

Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text
Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.
Keywords
Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.
User
Font Size
Information