Texas Digital Library Conference System, TCDL 2012

Font Size: 
Enhancing a Digital Repository with Objects’ Embedded Metadata
Serhiy Polyakov

Last modified: 2012-04-16


This poster describes the techniques of enhancing discovery and presentation of digital objects with embedded metadata extracted from the documents submitted to the digital repository.

The goal of the ongoing project at the Texas Center for Digital Knowledge is developing the repository solution for management, discovery, and presentation of heterogeneous digital objects (text documents in various formats, spreadsheets, presentations, images, archives, etc.) The target users are organizational departments of all levels, research projects teams, and individual researchers. The target users are also submitters to the repository and they are not expected to supply rich descriptive metadata with the submitted objects. This aspect stresses the importance of not only fulltext but also metadata embedded in the documents that is rarely utilized in the digital repositories.

We have chosen a set of open source solutions to implement the digital repository system. These are Fedora Commons repository platform, Drupal content management system, and Islandora framework with additional services: search server Apache Solr/Lucene, search service Fedora Generic Search, and search toolkit Apache Tika. We have developed a technique that extracts the values (if exist) of the following embedded metadata elements from the documents on submission: Title, Author, Last Author, Company, Description, Content Type, Image Resolution, Page/Slide/Paragraph/Word Count, Revision Number, Date Created/Modified/Digitized/Printed, Protection Status, and others. This embedded metadata augments fulltext, descriptive metadata supplied by submitters, and technical metadata in search and presentation of the digital objects.


digital repository; indexing methods; embedded metadata; search process