Texas Digital Library Conference System, TCDL 2010

Font Size: 
Scanning to PDFA: Buildling a digital collection for access AND preservation
Gail Clement, Derek Halling, Nancy Burford, Esther Carrigan, Heather K. Moberly

Building: AT&T Executive Education & Conference Center
Room: Salons A & B
Date: 2010-05-17 06:30 PM – 08:00 PM
Last modified: 2010-04-16

Abstract


The Texas A&M University Medical Sciences Library partnered with Oklahoma State University Libraries to digitize the Index-Catalogue of Medical and Veterinary Zoology, a multilingual periodical published by the US Government Printing Office.  This series is a key resource, a historical compendium of the parasitological literature of importance to researchers in re-emerging diseases and global animal health.  The compilation of content began in 1892, and resulted in over 100 separate publications comprising over 20,000 pages.With generous grant support from the National Library of Medicine, the Library has digitized 67 publications as of March 10, 2010.  This undertaking is intended as a demonstration project to encourage the digitization and preservation of veterinary grey literature.Conversion methods involved high resolution scanning of bound volumes and creation of archival master files in uncompressed TIFF format. Derivative versions of page image files were processed via optical character recognition (OCR) using multiple dictionaries to capture text in English, Spanish, French, German, Dutch, Greek and Russian languages. Each volume was recompiled as a single PDF file with text behind page image, and saved using the PDF/A-1b profile for archiving.  Achieving PDF/A compliance was a challenge given the multiplicity of fonts required to represent the typefaces and character sets comprising this body of content.  Specific  solutions used to address the challenge of PDF/A compliance will be demonstrated.


Keywords


Digitization; Optical Character Recognition; PDF/A

Full Text: PDF