Open Access Open Access  Restricted Access Subscription Access

Extraction of Textual Data from Unstructured Malayalam Web Resources


Affiliations
1 Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

The exponential growth of digital content in Malayalam language has led to an urgent need for advanced methods to extract valuable insights from this vast corpus. This paper presents a comprehensive study of text mining techniques tailored to Malayalam, aiming to bridge the gap between linguistic intricacies and computational approaches. This work describes a pioneering investigation into the creation of an integrated Malayalam Text Mining Tool combined with a comprehensive Part-of-Speech (POS) tagger. To handle the Malayalam language’s distinctive linguistic characteristics, the research combines Natural Language Processing and Machine Learning concepts. The text mining tool is meant to rapidly extract relevant insights from Malayalam text, meeting the growing demand for language-centric technologies in a variety of applications. Concurrently, the POS tagger improves the tool’s capabilities by precisely recognizing and labelling parts of speech in Malayalam phrases, enhancing the analysis.This tool serves as a facilitator for gathering corpora from diverse newspapers, employing data processing capabilities in TXT and CSV file formats, which are indispensable for a multitude of Natural Language Processing applications. Techniques such as tokenization, stemming, and lemmatization are employed to standardize word representations. Feature extraction methods like TF-IDF and word embeddings capture semantic relationships and local patterns, which enhance text comprehension for further analysis and machine learning. The analyzed data then undergoes classification to extract valuable insights. Model performance is assessed using evaluation metrics, while visualization techniques are employed to present results comprehensively for interpretation and communication. In future, further exploration could involve integrating additional machine learning algorithms for comparative analysis, thus paving the way for a deeper understanding and more advanced applications of Malayalam text mining across various domains.

Keywords

Text Mining, Information Extraction, Unstructured Text, Beautiful Soup, Conditional Random Field (CRF)
User
Notifications
Font Size

Abstract Views: 185




  • Extraction of Textual Data from Unstructured Malayalam Web Resources

Abstract Views: 185  | 

Authors

Jisha P. Jayan
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India
Anju Vinod
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India
Suresh K.S.
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India
Jayaraj N.
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Abstract


The exponential growth of digital content in Malayalam language has led to an urgent need for advanced methods to extract valuable insights from this vast corpus. This paper presents a comprehensive study of text mining techniques tailored to Malayalam, aiming to bridge the gap between linguistic intricacies and computational approaches. This work describes a pioneering investigation into the creation of an integrated Malayalam Text Mining Tool combined with a comprehensive Part-of-Speech (POS) tagger. To handle the Malayalam language’s distinctive linguistic characteristics, the research combines Natural Language Processing and Machine Learning concepts. The text mining tool is meant to rapidly extract relevant insights from Malayalam text, meeting the growing demand for language-centric technologies in a variety of applications. Concurrently, the POS tagger improves the tool’s capabilities by precisely recognizing and labelling parts of speech in Malayalam phrases, enhancing the analysis.This tool serves as a facilitator for gathering corpora from diverse newspapers, employing data processing capabilities in TXT and CSV file formats, which are indispensable for a multitude of Natural Language Processing applications. Techniques such as tokenization, stemming, and lemmatization are employed to standardize word representations. Feature extraction methods like TF-IDF and word embeddings capture semantic relationships and local patterns, which enhance text comprehension for further analysis and machine learning. The analyzed data then undergoes classification to extract valuable insights. Model performance is assessed using evaluation metrics, while visualization techniques are employed to present results comprehensively for interpretation and communication. In future, further exploration could involve integrating additional machine learning algorithms for comparative analysis, thus paving the way for a deeper understanding and more advanced applications of Malayalam text mining across various domains.

Keywords


Text Mining, Information Extraction, Unstructured Text, Beautiful Soup, Conditional Random Field (CRF)