Datasets

In order to improve effective use of metadata, and demonstrate applicability and scalability of the techniques, we will apply our proposed techniques to a heterogeneous set of metadata datasets. These datasets represent various domains and have been generated by different methods. To provide contrast, we have selected two repositories from the repository list provided on the DiD Challenge website, National Science Digital Library (NSDL) and Digital Library for Earth Sciences Education (DLESE), and will also include two of our own repositories, ipl2 and Intute. This project will link our repositories (ipl2 and Intute) to NSDL and DLESE, without creating a formal mapping and crosswalk.

The Internet Public Library (IPL) was developed by Dr. Joe Janes and students at the University of Michigan’s iSchool in 1995. Janes (1998) describes the history of the IPL. It is available throughout the world as a training tool for library and information science (LIS) programs as well as a freely available resource for the general public; it plays a major role in preparing a new generation of information professionals. In January 2007, the IPL was moved to Drexel’s College of Information Science and Technology. In 2008, the Librarians’ Internet Index (LII) was migrated to the iSchool at Drexel, with the intent to merge the two repositories and the name was changed to ipl2. ipl2 includes selected high quality resources across twelve subject areas (including the social sciences and the arts and humanities) that are geared toward the general public and student users. The metadata records have been created by graduate student volunteers who receive some training and oversight but the repository is lacking consistent quality control. The LII included selected high quality resources with high quality metadata generated by librarians. As part of this merger process it was decided to crosswalk all existing IPL and LII metadata to Dublin Core. In the process, the implementation team encountered a number of ‘metadata bottleneck’ and associated issues. These issues were rooted in the original metadata formats being used, and the individual sociotechnical and organizational backgrounds of IPL and LII. (Khoo and Hall, 2010). Much work was required to understand these backgrounds in order to manipulate the metadata and share it.

Intute (http://www.intute.ac.uk/) was formed in 2007 by a merger of the eight former Resource Discovery Network (RDN) hubs. It is a collaborative project, based at Mimas, The University of Manchester, which provides a catalogue of high quality Web resources for students and researchers in UK higher education. Intute contains high quality metadata records created by information professionals and covers nineteen subject areas including the humanities and social sciences. In 2010, representatives of Mimas and the iSchool at Drexel began discussions about a possible merger of the ipl2 and Intute. Funding for Intute will cease in July 2011 and staff at Mimas are seeking alternative homes and uses for the data. Mimas is currently engaged in several other projects addressing these technical and organizational problems.

NSDL is a very large repository that provides organized access to high quality resources related to science, technology, engineering, and mathematics (STEM) K-12 and undergraduate education. NSDL provides a centralized metadata repository that contains mixed quality metadata harvested from various NSDL collections. For NSDL, metadata creation is a decentralized practice that includes both manual andautomatic processing by partner projects and aggregated through automatic metadata harvesting and OAIPMH. In contrast, DLESE enforces a unified metadata creation procedure and involves information professionals and trained personnel to create metadata records. DLESE, a large repository, serves a more focused domain (earth systems) and maintains a relatively high quality metadata repository, relying on community members to create metadata tat was submitted via community tools, and quality controlled by metadata experts. By testing our methods on both NSDL and DLESE, the applicability of our proposed methods will be well established. [The DLESE metadata is also available through NSDL.]

As this project will utilize automatic algorithms to detect and establish semantic links among concepts and metadata records, having large datasets has many advantages. The algorithms we plan to use include clustering techniques, latent semantic indexing techniques, and other statistical-based learning and self-organizing algorithms. Working with large repositories improves the validity of the algorithms. Large repositories will also enhance the quality of recommended tags and linkages to be generated by the algorithms.

Table 1 summarizes features of the selected metadata repositories.

Table 1. Features of the selected datsets

Leave a Reply