Open Access Open Access  Restricted Access Subscription Access

Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text


Affiliations
1 Department of Computer Science, Punjabi University College of Engineering & Management, Rampura Phul, India
2 Department of Computer Science Punjabi University, Patiala, India
3 Department of Computer Science and Engineering, Yadavindra College of Engineering, Talwandi Sabo, India
 

Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.

Keywords

Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.
User
Notifications
Font Size


  • Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text

Abstract Views: 463  |  PDF Views: 0

Authors

Neetika Bansal
Department of Computer Science, Punjabi University College of Engineering & Management, Rampura Phul, India
Vishal Goyal
Department of Computer Science Punjabi University, Patiala, India
Simpel Rani
Department of Computer Science and Engineering, Yadavindra College of Engineering, Talwandi Sabo, India

Abstract


Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.

Keywords


Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.

References