An Extensive Study on Automated Dewey Decimal Classification

Wang, J. (2009). An Extensive Study on Automated Dewey Decimal Classification. JASIST, 60 (11), pp. 2269-2286.

Motivation

  • Manual subject cataloging is expensive and time consuming.
  • The training required to become a subject classifier is complex.
  • Automated subject classification would improve efficiency and free catalogers from a heavy workload.

Objectives

To find a practical solution to automatic bibliographic classification with a  supervised learning approach.

  • Are library classification suitable for machine processing?
  • Are supervised-learning-based methods able to work effectively on bibliographic data?

Dataset

The bibliographic dataset used in this study consists of Library of Congress 10-year bibliographic data in science and technology domains. It is formed by extracting MARC records from the OCLCWorldCat system using the following criteria: having a proper title in English, created by the Library of Congress during the period from 1994 to 2004, and being in the domains of science and technology (i.e., DDC main classes 500 and 600). There are 88,440 records in total.

DDC Restructuring

In their analysis of the DDC subject distribution, the authors found problems of data sparseness, skewed distribution, and deep hierarchy. In order to overcome data sparseness, the authors truncated the DDC subjects by treating a DDC number just as a decimal, each digit representing a hierarchical level; building a “decimal classification hierarchy” for machine classification.

To implement a pragmatic library classification system, the authors propose to take two measures:

  • introducing human intelligence into the automatic classification process
  • restructuring the DDC hierarchy

Interactive classification requires  a user to communicate with the computer
system during the task execution, and shares the task with the computer to achieve a balance of automation and human control

The authors also suggest a trimming machine algorithm for merging, flattening and chopping the hierarchy to achieve better distribution.

Results

The authors performed three groups of experiments: hierarchical
classification on the truncated DDC scheme (referred to as HC), which is taken as a performance baseline; hierarchical classification on the trimmed DDC scheme (MC), from which the effects of the DDC restructuring can be observed; and interactive classification (IC) on the trimmed DDC scheme, the practical solution to automatic library classification.

The IC method performed the best, producing classification accuracy of nearly
90% with no more than three user interactions.

 

This entry was posted in Dewey Decimal Classification, Digging methods and tools, Uncategorized. Bookmark the permalink.

Leave a Reply