Sunday, July 13, 2008

Summary reaction to: Finding a Catalog: Generating Analytical Catalog Records from Well Structured Digital Texts.

Citation:

David Mimno, Alison Jones, and Gregory Crane. (2005) "Finding a Catalog: Generating Analytical Catalog Records from Well Structured Digital Texts." Proceedings of the 2005 Joint Conference on Digital Libraries, Denver, CO, June 7-11, 2005.

Providing analytical description is usually only a luxury few libraries can afford. Using well structured xml full text documents to provide analyticals would solve the cost issues and provide greater access. Even with the prospect of the ability to provide so much more access with little or no extra cost, I'm not sure if that is the best method for every item in a collection.

The collection, described in this article totaled 55 million words and was cataloged with 60,000 records, on average there is one record for every 900 words. Providing this level of access to a particular collection can be very helpful, but I question the balance if digitized collections were described in this way and then integrated into the larger catalog. I wonder if there will be a negative impact on searching for items with few access points. This level of access retrospectively added to catalogs would increase the number of records exponentially; will such growth negatively affect our current OPACs?

I think there is great potential in xml to help the library community generate cataloging records in a more efficient manner. I am struggling with the idea that extracting data from structured xml files and creating metadata records is cataloging. The article describes that someone provides TEI tags for an OCR and then a fairly standard server processes the job. This process is not cataloging, neither is the automated generation of subject headings. The article stated that 40,000 records had between 0-5 subject headings, I wonder how many had 0? Also 8,000 records had over 30 subject headings. I am curious to know if this process resulted in a system that would provide better results than simple keyword access?

No comments: