Texas Digital Library Conference System, 2015 Texas Conference on Digital Libraries

Font Size: 
Beyond the Early Modern OCR Project
Matthew J Christy, Elizabeth Grumbach, Laura Mandell

Last modified: 2015-03-17

Abstract


The Early Modern OCR Project (eMOP) is a Mellon Foundation grant funded project, nearing completion at the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University. eMOP’s goal is  to improve optical character recognition (OCR) output for early modern printed English-language texts by utilizing and creating open-source tools and workflows. In addition to establishing an impressive OCR workflow infrastructure, eMOP has produced several open-source post-processing tools to evaluate and improve the text output of Google’s Tesseract OCR engine. Work on eMOP is nearing completion this summer, and the team is now looking beyond eMOP towards sharing its accrued knowledge and tools.


As a Mellon Foundation grant funded project, eMOP is tasked with sharing the results of its work whenever possible. This is in line with the IDHMC’s stated goals of aiding Humanities scholars with conducting digital research and/or creating digital outcomes of their research. As such, we are pursuing a variety of methods to disseminate the various products of our work.

  • We are creating open-source code repositories for all software created by, and for, eMOP.

  • We are creating an open-source repository of all eMOP typeface training created for the Tesseract OCR engine.

  • We are creating a publicly available database of early modern printers, publishers and booksellers based on the imprint metadata of the entire Eighteenth-Century Collection Online (ECCO) and Early English Books Online (EEBO) proprietary collections.

  • We are making the recently released Phase I hand-transcriptions of EEBO by the Text Creation Partnership (TCP), available for full-text searching via the Advanced Research Consortium’s (ARC’s) 18thConnect website.

  • We are making the first-ever-produced OCR transcriptions of the entire EEBO catalog available via 18thConnect’s online crowd-sourced transcript correction tool, TypeWright. TypeWright will provide free access to the EEBO transcriptions, and a text or XML version of that corrected transcription for anyone who corrects an entire document.


In addition, the eMOP team is committed to continuously improving the accuracy and robustness of our workflow. We are currently in discussion with, or actively engaged in, partnerships with teams at Notre Dame, Penn State, and the University of Texas to apply eMOP’s workflow to different collections. These partnerships will provide us with the ability to improve eMOP by:

  • adding more OCR engines to our workflow in addition to Tesseract, currently being used;

  • expanding our collected dictionaries beyond the current early modern English used with eMOP;

  • expanding our database of google-3grams beyond the early modern period to aid in post-processing OCR correction of documents outside of the early modern period;

  • expanding our printers & publishers database to include data from outside of the ECCO and EEBO collections.


We are proud of the work we have done with eMOP and are eager to continue to find ways to build upon what we have accomplished. We feel that much of our work would be of interest to libraries and librarians. We look forward to sharing the outcomes of eMOP and our vision for future work with the participants at TCDL this April.



Keywords


ocr; digitization; early modern; eebo; ecco; tcp; emop

Full Text: Slideshow