Blogs/Discussions | Digging into Metadata | for text analyses, resource classification, and knowledge discovery

An Extensive Study on Automated Dewey Decimal Classification

Posted on March 23, 2012 by Catherine

Wang, J. (2009). An Extensive Study on Automated Dewey Decimal Classification. JASIST, 60 (11), pp. 2269-2286.

Motivation

Manual subject cataloging is expensive and time consuming.
The training required to become a subject classifier is complex.
Automated subject classification would improve efficiency and free catalogers from a heavy workload.

Objectives

To find a practical solution to automatic bibliographic classification with a supervised learning approach.

Are library classification suitable for machine processing?
Are supervised-learning-based methods able to work effectively on bibliographic data?

Dataset

The bibliographic dataset used in this study consists of Library of Congress 10-year bibliographic data in science and technology domains. It is formed by extracting MARC records from the OCLCWorldCat system using the following criteria: having a proper title in English, created by the Library of Congress during the period from 1994 to 2004, and being in the domains of science and technology (i.e., DDC main classes 500 and 600). There are 88,440 records in total.

DDC Restructuring

In their analysis of the DDC subject distribution, the authors found problems of data sparseness, skewed distribution, and deep hierarchy. In order to overcome data sparseness, the authors truncated the DDC subjects by treating a DDC number just as a decimal, each digit representing a hierarchical level; building a “decimal classification hierarchy” for machine classification.

To implement a pragmatic library classification system, the authors propose to take two measures:

introducing human intelligence into the automatic classification process
restructuring the DDC hierarchy

Interactive classification requires a user to communicate with the computer
system during the task execution, and shares the task with the computer to achieve a balance of automation and human control

The authors also suggest a trimming machine algorithm for merging, flattening and chopping the hierarchy to achieve better distribution.

Results

The authors performed three groups of experiments: hierarchical
classification on the truncated DDC scheme (referred to as HC), which is taken as a performance baseline; hierarchical classification on the trimmed DDC scheme (MC), from which the effects of the DDC restructuring can be observed; and interactive classification (IC) on the trimmed DDC scheme, the practical solution to automatic library classification.

The IC method performed the best, producing classification accuracy of nearly
90% with no more than three user interactions.

Posted in Dewey Decimal Classification, Digging methods and tools, Uncategorized | Leave a comment

Findex: search result categories help users when document ranking fails

Posted on March 23, 2012 by mzarro

Käki, M. (2005). Findex: search result categories help users when document ranking fails. Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’05 (pp. 131–140). New York, NY, USA: ACM. doi:10.1145/1054972.1054991

In this study, a web search tool enhanced with categories was used by 16 users over a 2 month period – a longitudinal, naturalistic study.

It was found that when the search rankings did not deliver a good result at the top of the list, categories enabled users to find results that appear deeper down in the list of results.

Categories were created based on word frequencies in the result summaries. The set of collections is on the left, the main results on the right as seen in the screenshot below.

Query length in the study averaged 2 terms, consistent with previous studies.

Categories were used in over 25% of the searches, in these searches an average of 2.3 categories were used.

Main findings:

”categories are used in real settings and that users find relevant results with them” – users found categories interface helpful when evaluating the result set
“categories are beneficial when result rankings fail” – when results are not found in the top results of the ranked list, categories help users find a useful result. This is particularly useful when the query is vague or broad. 45% of users reported using less precise queries than in a non-enhanced interface
“categories can forgive mistakes in searching” – when the categories were used, there are fewer cases where no results were found. This is useful less precise queries are used, as is common in unfamiliar domains
“categories make it easier to access multiple results” – their use increases the cases where more than one result was selected.

Posted in Displays, Systems | Tagged Display | Leave a comment

Contextual Multi-dimensional Browsing

Posted on March 23, 2012 by mzhang

Citation

Wu, L., Chuang Y., & Joung Y (2008). Contextualmulti-dimensionalbrowsing. Computers in Human Behavior, 24(6): 2873–2888

Summary

Browsing with multi-faceted categorization may bring to many choices and too much browsing freedom to disorient users and increase the difficulty of information gathering. This paper takes context into account for multi-faceted categorization to help organize facets.

An experiment is carried out with two interfaces: ROAD interface which is basic, and COAD interface where attributes (facets) were arranged according to their contextual importance.

Context scenarios are constructed as: family size: single person, extended family; revenue/ expenditure ratio: low, high; purchase preference: luxury, speed. These factors could combine 8 different scenarios. Two of them are removed, leaving 6 scenarios.

The importance of different facets under different context scenarios was extracted, based on the frequencies the facets were clicked.

According to a user experiment, it was found that the interface COAD where facets were arranged according to contextual importance could promote ease-of-information-access, confidence in the decision making, and ease-of browsing.

Posted in Uncategorized | Leave a comment

An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata

Posted on March 15, 2012 by xliu

Joorabchi, A. and Mahdi A. E., An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata, Journal of Information Science, 37, 5 (2011), 499–514

This article describes an unsupervised approach for automatic classification of scientific literature archived in digital libraries and repositories according to Dewey Decimal Classification scheme.

It makes a good summary of the up-to-date research works in this field and categorized them into two main groups of ML-based systems and string matching-based systems.

This article proposes the Bibliography Based ATC (BB-ATC) approach which solely relies on the subject classification metadata of the publications citing either the document to be classified or one of its references, as catalogued in the OPACs of conventional libraries, in order to probabilistically infer the most appropriate class for the document.

To demonstrate the application and evaluate the classification performance of the proposed BBATC approach, a prototype ATC system for automatic classification of scientific literature archived in the CiteSeer digital library is developed.

The above figure shows an overview of the ATC system. The system is effectively a metadata generator comprising a pre-processing, a data mining, and an inferring unit.

The task of the data mining unit is two fold. (1) It uses the Google Books Search (GBS) engine to compile a list of publications that either cite the document to be classified or one of its references. This is done by submitting a number of URL queries to the GBS engine in the following format:

http://www.google.com/books/feeds/volumes?max-results = 20&q = %22[title]%2C%22

(2) It processes the metadata records of the publications in the pool to extract their corresponding ISBNs. These ISBNs are then used as unique identifiers for the publications to query the WorldCat database for their corresponding metadata records. The latter process is done by submitting the following URL query to the WorldCat Search API for each ISBN:

http://www.worldcat.org/webservices/catalog/content/isbn/[ISBN]%3Fwskey%3D[key]%3Dfull

The whole process is showed in the figure below.

In the inference process derives the un-normalized local frequency, normalized local frequency, and global frequency weights for each unique DDC number in the pool. They are used for computing the weights for all the DDC numbers.

A classification hierarchy tree is built to select the DDC number for the document.

The developed ATC system was evaluated using a test corpus of 1000 scientific documents and the classification results were presented and analysed with the aim of quantifying the prediction performance of the system and identifying the factors influencing its performance. We reported micro-averaged values of 0.84, 0.78, and 0.81 for the overall precision, recall, and F1 measures of our system, respectively, and provided a relative comparison between the performance of our system and those of similar reported systems.

A Novel Approach to Visualizing and Navigating Ontologies

Posted on February 17, 2012 by mzarro

Motta, E., Mulholland, P., Peroni, S., d’ Aquin, M., Gomez-Perez, J., Mendez, V., & Zablith, F. (2011). A novel approach to visualizing and navigating ontologies. The Semantic Web–ISWC 2011, 470–486.

This work reports on a visualization tool, KC-Viz, for navigating ontologies in a “middle-out” approach, starting from information rich nodes. Results from a preliminary empirical study suggests KC-Viz provides increased performance for users in real-life tasks. It addresses the problem of visualizing a large ontlology on relatively limited screen real-estate:

On the one hand the information on display needs to be coarse-grained enough to provide an overview of the ontology… On the other hand, an exploration process needs to be supported, where the user can effectively home in on parts of the ontology, thus changing the level of analysis, while at the same time not losing track of the overall organization of the ontology.

Using the key concept extraction (KCE) algorithm, global and local importance of concepts is measured, allowing them to be displayed in the interface. This has the effect of providing a user of the ontology a way to gain an overview of its domain. The functionality can be particularly beneficial for new users of an ontology, a user attempting to choose from several competing ontologies, or a user creating new ontologies (and thus avoid replicating what already exists).

The authors conclude the KC-Viz system supports “information foraging” in an ontology. Thus it was found to be particularly helpful for first time users of an ontology. Users were able to learn the domain more efficiently than those who utilized the ontology navigation systems, Neon Tool Kit without visualization support or Protege with OwlViz plugin, during a controlled experiment.

The main positive subjective comments about the system were: “flexible support provided by KC-Viz to manipulate the visual displays; the abstraction power enabled by the KCE algorithm; and the value of the subtree summaries provided by KC-Viz.” The main negative comments were: “criticism of the tree layout algorithm used by KC-Viz, which does not allow display rotation and at times generates overlapping labels; the lack of transparency of the KCE algorithm, which does not allow the user to configure it, or to clarify why a node is considered more important than others; and the lack of integration between KC-Viz and reasoning/query support”

As the MCD project will include many thesauri and ontologies, it is likely we will need to generate overviews of the various domains covered. KC-Viz is a candidate technology to further this goal. We should take note of their findings that users benefited from an overview of key-concepts, and take into account the subjective user findings for the user-experience of the MCD user interface.

Posted in Displays, Ontology | Tagged Visualization | Leave a comment

User-centered indexing

Posted on November 7, 2011 by mzarro

Fidel, R. (1994). User-centered indexing. Journal of the American Society for Information Science, 45(8), 572-576.

User-centered indexing: requires we index reflecting the approach users would take to find a document.
Document centered: that indexing, like abstracts, creates surrogates for documents.
Purpose: to represent the content or features of a document:
“Aboutness” what is the document about?
Generally poor inter-indexer agreement.
- Process: content-analysis of the document to select concepts that represent it.
- Translation, expressing the concepts in the indexing language.
  Requires rules (policy):
  a. Sources of terms (controlled vocabulary, other?)
  b. Specificity (how narrow or broad)
  c. Weights (reflect the importance of the concept)
  d. Accuracy (how to translate when there is no equivalent?)
  e. Degree of precombination (decide to use pre or post)
  f. User language (assign terms approximate to the users)
  – and some content analysis policies –
  g. Exhaustivity (how comprehensive)
  h. Indexable matter (what parts of the doc should be represented)

Request-Oriented – Soergel (1985):
Checklist indexing: check each document against the descriptors in a vocabulary (but, costly and time-consuming). Classified structure for indexing can help improve efficiency of this approach.

Automated approach: computerized
- dynamic
- objective and consistent
- natural language requests, relevance feedback, ranked output, query expansion
- “indexing and searching are two sides of the same coin”

Posted in Concepts and Theories, Vocabularies | Tagged user, vocabularies | Leave a comment

A Faceted Classification as the Basis of a Faceted Terminology: Conversion of a Classified Structure to Thesaurus Format in the Bliss Bibliographic Classification, 2nd Edition

Posted on October 8, 2011 by mzhang

Citation: Vanda Broughton. A Faceted Classification as the Basis of a Faceted Terminology: Conversion of a Classified Structure to Thesaurus Format in the Bliss Bibliographic Classification, 2nd Edition. Axiomathes. 2008, 18(2): 193-210

This paper develops a faceted thesaurus based on a faceted classification system. It describes the creation of a thesaurus using the Bliss Bibliographic Classification Second Edition (BC2) as the starting point. It examines various aspects of the relationship between the two forms of vocabulary and how these might be resolved to create a truly integrated and interdependent structure that can be managed to some extent mechanically.

For a regular thesaurus, there should be relationship tags like UF (use for), Use, BT (Broader term), NT (Narrower term), RT (Related Term) to control and represent the relationships of vocabularies. To build a faceted thesaurus, term relations should be detected and converted from the hierarchical structure.

1. For relationships in the Faceted Classification

(1) Broader Term (BT)/ Narrower Term (NT) Relationships

Facet structure provides paradigmatic BT and NT relationships. For BC2, every parent is the BT of each of its children. Every child is the NT of its parent.

(2) Related Terms (RT) Derived from Hierarchy

RTs are derived from terms in the same array (sibling terms), and terms not necessarily siblings, but at a coordinate level in the hierarchy.

2. Vocabulary Control in the Faceted Classification

(1) Equivalence Relationships

The conceptual structure of the classification collocates synonyms (equivalent terms within the same class heading or caption). They are converted as equivalence relationships.

For the most part the BC2 schedules observe some stylistic conventions for synonyms. Usually two or more synonymous terms are separated by a comma, and the evident preferred term is listed first.

(2) Compound Terms

When building thesaurus, it may be needed to convert a single term to a compound phrase, because when moving the single term out from the hierarchical structure, more information would be needed to identify or understand it. The rules for when compounds may be included in a thesaurus are very complex, but broadly embrace:

a. Adjectival phrases where one term generates a species or subclass of the other;

b. Compounds where the terms in combination mean something other than the sum of their parts, or where the parts are meaningless when separated;

c. The term is conceptually compound but represented by a single term.

Conclusion

It is clear that the faceted structure of the BC2 terminologies supports the generation of a compatible thesaurus in a number of ways. For example, it allows the precise identification of broader/narrower terms, and sibling and coordinate associative terms. Equivalence relationships are acknowledged through collocation of equivalent or near equivalent terms. BC2 has the potential for syntagmatic relationships to be automatically detected in a populated system.

While the structure is excellent for managing relationships, questions of vocabulary control are only now in the process of being addressed, and it clear that more rigor must be introduced into the formatting of classes if terms are to be handled as effectively as concepts.

Posted in Vocabularies | Tagged FACET, Thesaurus | Leave a comment

ThManager: an open source tool for creating and visualizing SKOS

Posted on October 7, 2011 by Catherine

Citation: ThManager: an open source tool for creating and visualizing SKOS. Javier Lacasta, Francisco Javier Lopez-Pellicer, Pedro Rafael Muro- Medrano, Javier Nogueras-Iso, and Francisco Javier Zarazaga-Soria. Information Technology and Libraries. 26.3 (Sept. 2007).

Also: http://thmanager.sourceforge.net/index.html

ThManager is an Open Source Tool for creating and visualizing SKOS RDF vocabularies, a W3C initiative for the representation of knowledge organization systems such as thesauri, classification schemes, subject heading lists, taxonomies, and other types of controlled vocabulary. ThManager facilitates the management of thesauri and other types of controlled vocabularies, such as taxonomies or classification schemes. The tool has been implemented in Java and has the following features:

Multi-platform (Windows, Unix). As it has been developed in Java and the storage of metadata records is managed directly through the file system, the application can be deployed in any platform with the minimum requirement of having installed a Java virtual machine.
Multilingual. The application has been developed following the Java internationalization methodology. Nowadays, there are Spanish and English versions. With little effort, other languages could be supported.
Selection and filtering of the thesauri stored in the local repository.
Description of thesauri by means of metadata in compliance with a Dublin Core based application profile for thesaurus. These metadata can be either visualized in HTML or edited through a form.
Visualization of thesaurus concepts. The visualization interface includes the following widgets:
- Alphabetic viewer: It provides the list of thesaurus concepts alphabetically ordered in the selected language.
- Hierarchical viewer: It provides a tree showing the hierarchical structure of thesaurus concepts.
- Concept viewer: For a selected concept it shows all the properties allowing additionally the navigation to the related concepts by means of hyperlinks.
- Search tool: It facilitates search of concepts. The searching process is based on preferred labels allowing the following criteria: “equals”, “starts with” and “contains”.
Edition of thesaurus content. The tool provides an edition interface to modify the content of a thesaurus: creation of concepts, deletion of concepts, and update of concept properties.
Exchange of thesauri according to SKOS format. The export operation includes the export of thesaurus metadata.
Extraction of related concepts in WordNet. It generates an automatic mapping of thesaurus concepts against the concepts of Wordnet lexical database.
On-line help by means of PDF visualization.

The Lacasta et. al article provides an overview of many other thesaurus creation and management tools including, Lexico, MultiTes and TemaTes and a presentation of the ThManager architecture and functionality.

Posted in Applications, Systems | Tagged Linked data, software, system, Thesaurus | Leave a comment

SIS – TMS : A Thesaurus Management System for Distributed Digital Collections

Posted on October 7, 2011 by xliu

Doerr M., Fundulaki I. (1998). SIS-TMS: A thesaurus management system for distributed digital collections. Proc. 2nd European Conference on Digital Libraries (ECDL’98), (C. Nikolaou and C. Stephanidis eds.) Lecture Notes in Computer Science 1513, Springer-Verlag: Berlin, 215-234.

Introduction

The focus of this paper is to present methods and an actual system suited to store, maintain and provide access to knowledge structures that are in use or needed for the respective auxiliary system interfaces and three tasks.

- Guide the user from his/her naïve request to the use of a set of terms optimal for his purpose and for the characteristics of the target information source.

- Expand naïve user terms or the terms optimal for the purpose of the user into sets of terms optimal for each different information source.

- Classify all information assets of a certain collection with controlled vocabulary from a specific thesaurus.

This paper proposes the following requirements for Thesaurus Management:

(1) interaction with the thesaurus contents except manipulations, (2) maintenance, i.e. the manipulation of the contents and the necessary and desirable support of associated work processes, and finally (3) analysis, i.e. the logical structure needed to support (1), (2), and the thesaurus semantics in the narrower sense.

SIS-TMS

The SIS-TMS is a multilingual thesaurus management system and a terminology server for classification and distributed access to electronic collections following the above analysis. The its distinct features are its capability to store, develop, display and access multiple thesauri and their interrelations under one database schema, to create arbitrary graphical views thereon and to specialize dynamically any kind of relation into new ones. It further implements the necessary version control for a cooperative development and data exchange with other applications in the environment.

It originates in the terminology management system (VCS Prototype) developed by ICS-FORTH in cooperation with the Getty Information Institute in the framework of a feasibility study. It was enhanced within the AQUARELLE project, in particular by the support of multilinguality. An earlier version is part of the AQUARELLE product. A full product version was available summer ’98.

The SIS-TMS is an application of the Semantic Index System, which is a product of the Institute of Computer Science-FORTH, is an object oriented semantic network database used for the storage and maintenance of formal reference information as well as for other knowledge representation applications. It implements an interpretation of the data model of the knowledge representation language TELOS omitting the evaluation of logical rules.

This paper discussed the thesaurus structure from several perspectives, including assumptions on concepts, modeling thesaurus notions, intrathesaurus relations, representing multiple interlinked thesauri, interthesaurus relations.

Collection Management Systems (CMS) such as digital libraries, library systems, and museum documentation systems will continuously change. The CMS can also propose new terms to the TMS. The TMS will be updated with new terms from many sides, and old concepts and terms may be renamed, revised and reorganized. The essential problem is to ensure and maintain consistency between the contents of the vocabularies in the underlying CMS and the contents of the Local Thesaurus Management Systems.

The user interacts with the SIS-TMS via its graphical user interface, which provides unconstrained navigation within and between multiple interlinked thesauri. The user can retrieve information from the SIS-TMS knowledge base using a number of predefined, configurable queries and accept the results either in textual or graphical form. SIS-TMS not only provides graphical representations but an essential feature is its ability to represent in a single graph any combination of relationships in arbitrary depth. (See fig 1, central window). The updates in the SIS-TMS are performed through the Entry Forms in a task oriented way.(See fig 1, right window).
SIS-TMS interface
Fig.1. SIS-TMS User Interface, Browser and Data Entry facility.

Conclusion

Integrated terminology services in distributed digital collections are going to become an important subject, and that the SIS-TMS provides a valuable contribution to that. It solves a major problem, the consistent maintenance of the necessarily central terminological resources between semiautonomous systems. The terminological bases themselves need not be internally distributed, as the access needs low bandwidth, read-only copies can easily be sent around at the given low update rates, and term servers can be cascaded. In the near future, the functionality of this system will be further enhanced to make its usability as wide as possible. Whereas there are several standards for thesaurus contents, no one has so far tried to standardize the three component interfaces: (1) Term Server to retrieval tools, (2) TMS to CMS, (3) TMS to Term Server. As in a distributed information system many components from many providers exist, these three interfaces must become open and standardized, to make a wide use reality.

Posted in Systems | Tagged distributed, system, Thesaurus | Leave a comment

Commercial Controlled Vocabulary Software Evaluation

Posted on October 4, 2011 by mzarro

Comparative evaluation of thesaurus creation software. (Hedden, 2008)

This article compares three commercial thesaurus creation and maintenance tools; MultiTes, Term Tree 2000, and WebChoir TCS-10. The author sets out requirements for thesaurus maintenance, taken from published standards, that the three tools meet: hierarchical relationships, associated term relationships, ‘used-for’ terms, and optional notes for each term. In addition, all tools support the creation of candidate and approved terms, polyhierarchies, and “disallowing illegal relationships (e.g. circular relationships).”

The following six evaluation measures are used to compare the three tools:

Thesaurus display:
- alphabetical and/or hierarchical.
Term editing and display:
- the user interface and controls for creating and maintaining terms.
Searching:
-ability to search for terms in the thesaurus
User-defined relationships and attributes:
- the ability to create relationships between terms, such as “broader,” “narrower,” or “related” term; and “use” or “used for.” Additionally, more advanced relationships may be defined by the user to support ontology creation. Terms used in the thesaurus may be categorized for use in a faceted taxonomy. For example, a category of named-entity terms may be useful.
Rules enforcement:
- the tool should support the user and help them follow rules like prohibiting orphan terms (those who have no broader or narrower terms).
Importing, exporting, and reports:
- does the tool support batch importing existing thesauri including their relationships. And, does it allow easy exporting to a format that can be imported into other tools? Reports may be useful in multiple formats.

Tools:

MultiTes Pro (http://www.multites.com/)

Thesaurus display:
- alphabetical display (hierarchical shown in reports).
Term editing and display:
- terms can be created and related to existing entries simultaneiously
- relationships cannot be edited, they must first be deleted and new one created
Searching:
- advanced search supported (ie, search within a note, etc)
User-defined relationships and attributes:
- does not support user-defined attributes
Rules enforcement:
- user can delete a term that has narrower terms, allowing for orphans (tool can report orphans however)
Importing, exporting, and reports:
- imports structured text files, exports and reports in many formats

Term Tree (www.termtree.com.au)

Thesaurus display:
- alphabetical / hierarchical view
Term editing and display:
- new terms can be created from existing
Searching:
- supported
User-defined relationships and attributes:
- addition of new relationships is no supported
- addition of categories/attributes is supported
Rules enforcement:
- user can delete a term that has narrower terms, allowing for orphans (tool can report orphans however)
Importing, exporting, and reports:
- cannot import XML.
- exports to many formats: Excel, CSV, XML
- many report types supported, including KWIC and similar

TCS-10 (www.webchoir.com)

Thesaurus display:
- hierarchical/alphabetical
Term editing and display:
- a unique implementation of “use” and “use for”
Searching:
- guided Boolean search (search within results, etc)
User-defined relationships and attributes:
- supports user-defined relationships/attributes
Rules enforcement:
- allow/disallow duplicates and orphans
Importing, exporting, and reports:
- supports many formats, MARC, XML, ASCII, etc…

Summary

No clear winners, all fulfill basic requirements. All have pros and cons that are need dependent. The author recommends MultiTes as a good value. Customization does not seem to be possible, and only TCS-10 appears to support ontology-like user-defined relationships and attributes.

Citation:
Hedden, H. (2008). Comparative evaluation of thesaurus creation software. The Indexer, 26(2), 50-59.

Posted in Vocabularies | Tagged software | Leave a comment

Digging into Metadata