Extraction of Textual Data from Unstructured Malayalam Web Resources

Jisha P. Jayan; Anju Vinod; Suresh K.S.; Jayaraj N.

Extraction of Textual Data from Unstructured Malayalam Web Resources

Jisha P. Jayan , Anju Vinod , Suresh K.S. , Jayaraj N.

Affiliations
1 Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Abstract
References
Article Metrics
Refbacks

The exponential growth of digital content in Malayalam language has led to an urgent need for advanced methods to extract valuable insights from this vast corpus. This paper presents a comprehensive study of text mining techniques tailored to Malayalam, aiming to bridge the gap between linguistic intricacies and computational approaches. This work describes a pioneering investigation into the creation of an integrated Malayalam Text Mining Tool combined with a comprehensive Part-of-Speech (POS) tagger. To handle the Malayalam language’s distinctive linguistic characteristics, the research combines Natural Language Processing and Machine Learning concepts. The text mining tool is meant to rapidly extract relevant insights from Malayalam text, meeting the growing demand for language-centric technologies in a variety of applications. Concurrently, the POS tagger improves the tool’s capabilities by precisely recognizing and labelling parts of speech in Malayalam phrases, enhancing the analysis.This tool serves as a facilitator for gathering corpora from diverse newspapers, employing data processing capabilities in TXT and CSV file formats, which are indispensable for a multitude of Natural Language Processing applications. Techniques such as tokenization, stemming, and lemmatization are employed to standardize word representations. Feature extraction methods like TF-IDF and word embeddings capture semantic relationships and local patterns, which enhance text comprehension for further analysis and machine learning. The analyzed data then undergoes classification to extract valuable insights. Model performance is assessed using evaluation metrics, while visualization techniques are employed to present results comprehensively for interpretation and communication. In future, further exploration could involve integrating additional machine learning algorithms for comparative analysis, thus paving the way for a deeper understanding and more advanced applications of Malayalam text mining across various domains.

Keywords

Text Mining, Information Extraction, Unstructured Text, Beautiful Soup, Conditional Random Field (CRF)

I-Scholar

Journal Help

User

Notifications

Journal Content
Browse

Font Size

Information

Abstract Views: 186

Extraction of Textual Data from Unstructured Malayalam Web Resources

Abstract Views: 186 |

Authors

Jisha P. Jayan
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Anju Vinod
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Suresh K.S.
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Jayaraj N.
Research & Development Division, Centre for Development of Imaging Technology, Thiruvananthapuram, India

Abstract

Keywords

Text Mining, Information Extraction, Unstructured Text, Beautiful Soup, Conditional Random Field (CRF)

Username
Password
Remember me

Username
Password
Remember me

Research Cell: An International Journal of Engineering Sciences

Research Cell: An International Journal of Engineering Sciences

Extraction of Textual Data from Unstructured Malayalam Web Resources

Keywords

Extraction of Textual Data from Unstructured Malayalam Web Resources

Authors

Abstract

Keywords