Open Access Open Access  Restricted Access Subscription Access

From Text Corpus to Dewey Number: Designing a Prototype for Automated Classification


Affiliations
1 Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal, India

   Subscribe/Renew Journal


This research is an attempt to explore the possibilities of an AI/ML-based automated indexing system for book collec-tions in a library. Library classification systems are essentially pre-coordinated indexing approaches. Researchers since the 1980s have used different techniques for synthesizing classification numbers automatically from the text corpus. With the advent of machine learning techniques in the late 1990s, a more recent approach involves using a supervised learning algorithm to train a model on a set of documents that have been manually classified by trained library professionals using classification schemes like UDC, DDC, or Colon Classification. The trained model (machine learning backend) learns pat-terns from the training data and then predicts the subject and class number for new documents. In the preliminary phase, we gathered a substantial collection of more than 200,000 MARC 21-formatted bibliographic records from different librar-ies and then curated these datasets appropriately to include Tag 082 (DDC Call Number), Tag 245 (Title of Document), Tag 520 (Summary Note), and Tag 650 (Subject Descriptors) for developing the final dataset. This final dataset was sub-sequently divided into three sections: (i) a training dataset (96% of the final dataset); (ii) a validation dataset (2% of the final dataset); and (ii) a test dataset (2% of the final dataset). We deployed Annif, an open-source AI/ML framework, along with different backends supported by it (Associative group: FastText, Omikuji and SVC; and Ensemble: Simple and Neural Network). In the next stage, the framework was trained using a variety of backend algorithms (as mentioned), and finally, results were combined into an ensemble based on a neural network model. To assess the effectiveness of these models, all of these machine-learning backends were compared using two crucial retrieval metrics: F1@5 and NDCG. When it comes to automated class number building, we have discovered that the neural network model outperforms all other backends. Moreover, it is quite feasible to adopt these methods and tools for building a real-life automated classification system, as the Annif supports REST/API-based access in generating suggestions for DDC-based class numbers (along with accuracy scores) based on given text corpora. This overall framework is based on open-source software, open datasets, and open standards.

Keywords

Annif, Automated Indexing, Automatic Classification, DDC, NDCG, Neural Network
User
About The Authors

Soumik Kerketta
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal
India

Parthasarathi Mukhopadhyay
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal
India


Notifications

Abstract Views: 190




  • From Text Corpus to Dewey Number: Designing a Prototype for Automated Classification

Abstract Views: 190  | 

Authors

Soumik Kerketta
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal, India
Parthasarathi Mukhopadhyay
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal, India

Abstract


This research is an attempt to explore the possibilities of an AI/ML-based automated indexing system for book collec-tions in a library. Library classification systems are essentially pre-coordinated indexing approaches. Researchers since the 1980s have used different techniques for synthesizing classification numbers automatically from the text corpus. With the advent of machine learning techniques in the late 1990s, a more recent approach involves using a supervised learning algorithm to train a model on a set of documents that have been manually classified by trained library professionals using classification schemes like UDC, DDC, or Colon Classification. The trained model (machine learning backend) learns pat-terns from the training data and then predicts the subject and class number for new documents. In the preliminary phase, we gathered a substantial collection of more than 200,000 MARC 21-formatted bibliographic records from different librar-ies and then curated these datasets appropriately to include Tag 082 (DDC Call Number), Tag 245 (Title of Document), Tag 520 (Summary Note), and Tag 650 (Subject Descriptors) for developing the final dataset. This final dataset was sub-sequently divided into three sections: (i) a training dataset (96% of the final dataset); (ii) a validation dataset (2% of the final dataset); and (ii) a test dataset (2% of the final dataset). We deployed Annif, an open-source AI/ML framework, along with different backends supported by it (Associative group: FastText, Omikuji and SVC; and Ensemble: Simple and Neural Network). In the next stage, the framework was trained using a variety of backend algorithms (as mentioned), and finally, results were combined into an ensemble based on a neural network model. To assess the effectiveness of these models, all of these machine-learning backends were compared using two crucial retrieval metrics: F1@5 and NDCG. When it comes to automated class number building, we have discovered that the neural network model outperforms all other backends. Moreover, it is quite feasible to adopt these methods and tools for building a real-life automated classification system, as the Annif supports REST/API-based access in generating suggestions for DDC-based class numbers (along with accuracy scores) based on given text corpora. This overall framework is based on open-source software, open datasets, and open standards.

Keywords


Annif, Automated Indexing, Automatic Classification, DDC, NDCG, Neural Network



DOI: https://doi.org/10.17821/srels%2F2024%2Fv61i6%2F171643