Texas Digital Library Conference System, 2015 Texas Conference on Digital Libraries

Font Size: 
Expanding and Improving Access to Early English Books Online (EEBO)
Matthew J Christy, Elizabeth Grumbach, Laura Mandell

Last modified: 2015-03-19

Abstract


The Early-Modern OCR Project (eMOP), currently in its final phase at the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, is a Mellon-funded project tasked with developing open-source tools and techniques to improve Optical Character Recognition (OCR) outcomes for early modern printed documents. The basic premise of eMOP is to 1) use book history to identify the typefaces represented in the collections and the printers that used them; 2) train open source OCR engines on those typefaces; and 3) OCR early modern document page images using an engine trained on the typefaces specific to those documents. As our dataset, eMOP is using the 45 million page images that comprise the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary collections. To test the accuracy of our OCR results we are using the approximately 48,000 documents hand-transcribed by the Text Creation Partnership (TCP).

Over the last two years the eMOP team has become intimately familiar with the documents and metadata that make up the ECCO, EEBO and TCP collections. As members of the IDHMC we also have access to the tools and websites of the Advanced Research Consortium (ARC), housed at the IDHMC. Specifically, ARC’s 18th Century resource aggregator for scholarship, 18thConnect, and 18thConnect’s crowd-sourced transcription correction tool, TypeWright. As such we are in a unique position to provide access to these scholarly resources in new and exciting ways. We are proposing to develop a poster to share these advancements with the members of the TCDL at this April’s conference. The poster will cover:

  • The addition to 18thConnect of EEBO collection metadata. This will allow for searching of EEBO metadata outside of Proquest’s EEBO paywall.

  • The addition to 18thConnect of Phase I of the TCP’s hand-transcriptions of EEBO. This will make 18thConnect the only place that scholars can do a full-text search on all of the almost 33,000 transcriptions released with Phase I.

  • The addition to TypeWright of the OCR transcriptions of EEBO created by eMOP. This will constitute the first time this collection has been OCR’d, and TypeWright will be the only place these transcriptions will be available outside of the Proquest EEBO paywall.

  • The addition to TypeWright of the Phase I TCP transcriptions, allowing scholars to further correct these transcriptions in a public and collaborative way.

  • The ability of scholars to get their correct transcriptions of EEBO documents in text or XML format for use in their scholarship. This is another unique feature of 18thConnect, made available via our contracts with Proquest (and with Gale for the ECCO collection as well).

We think all of these developments will be of interest to digital librarians, and are eager to share them in April.



Keywords


ocr; transcriptions; early modern; eebo; tcp