Joorabchi, A. and Mahdi A. E., An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata, Journal of Information Science, 37, 5 (2011), 499–514
This article describes an unsupervised approach for automatic classification of scientific literature archived in digital libraries and repositories according to Dewey Decimal Classification scheme.
It makes a good summary of the up-to-date research works in this field and categorized them into two main groups of ML-based systems and string matching-based systems.
This article proposes the Bibliography Based ATC (BB-ATC) approach which solely relies on the subject classification metadata of the publications citing either the document to be classified or one of its references, as catalogued in the OPACs of conventional libraries, in order to probabilistically infer the most appropriate class for the document.
To demonstrate the application and evaluate the classification performance of the proposed BBATC approach, a prototype ATC system for automatic classification of scientific literature archived in the CiteSeer digital library is developed.
The above figure shows an overview of the ATC system. The system is effectively a metadata generator comprising a pre-processing, a data mining, and an inferring unit.
The task of the data mining unit is two fold. (1) It uses the Google Books Search (GBS) engine to compile a list of publications that either cite the document to be classified or one of its references. This is done by submitting a number of URL queries to the GBS engine in the following format:
http://www.google.com/books/feeds/volumes?max-results = 20&q = %22[title]%2C%22
(2) It processes the metadata records of the publications in the pool to extract their corresponding ISBNs. These ISBNs are then used as unique identifiers for the publications to query the WorldCat database for their corresponding metadata records. The latter process is done by submitting the following URL query to the WorldCat Search API for each ISBN:
http://www.worldcat.org/webservices/catalog/content/isbn/[ISBN]%3Fwskey%3D[key]%3Dfull
The whole process is showed in the figure below.
In the inference process derives the un-normalized local frequency, normalized local frequency, and global frequency weights for each unique DDC number in the pool. They are used for computing the weights for all the DDC numbers.
A classification hierarchy tree is built to select the DDC number for the document.
The developed ATC system was evaluated using a test corpus of 1000 scientific documents and the classification results were presented and analysed with the aim of quantifying the prediction performance of the system and identifying the factors influencing its performance. We reported micro-averaged values of 0.84, 0.78, and 0.81 for the overall precision, recall, and F1 measures of our system, respectively, and provided a relative comparison between the performance of our system and those of similar reported systems.