Texas Digital Library Conference System, 2014 Texas Conference on Digital Libraries

Font Size: 
Flowcharting a Course Through Open-Source Waters, an eMOP guide to OCR
Matthew J Christy

Last modified: 2014-03-14

Abstract


The Early Modern OCR Project (eMOP), an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, intends to use font and book history techniques to train modern Optical Character Recognition (OCR) engines. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research (Mandell, 2013).


Now in year two, eMOP is turning towards one of their main goals: to produce a workflow, published in Taverna, for use by individuals and institutions with similar projects. Matthew Christy and Liz Grumbach, eMOP Co-Project Managers for Year Two, will present a series of interconnected workflows that represent the work being done by eMOP and give an idea of how eMOP work will benefit the library, and larger academic, communities. Our presentation will include flowcharts covering:

  • Wrangling the eMOP data and metadata. Our data set consists of the 45 million pages that make up the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) commercial database, as well as over 46,000 had transcribed texts from the Text Creation Project (TCP). We have created our own DB and query/download tools to manage and access that data.

  • The eMOP Font History database being created. This DB is based on parsing the natural-language imprint lines of every document in EEBO.

  • Training Tesseract. We have developed our own tools and methods to optimize training of Google’s open source OCR engine Tesseract for work on pre-modern printed texts.

  • The eMOP controller. The controller is a software process that controls work from OCR’ing to scoring of results

  • The eMOP post-processing process. This process will score OCR results per page, and then decide which of two post-processes to route the page through. Pages that score well will be routed for further correction. Pages that score badly will be routed to a triage system which will determine what is causing the page to fail OCR’ing and tag them for appropriate pre-processing to rectify problems and later re-OCR’ing.

  • The eMOP post-processing scoring method.

  • The process for training eMOP’s triage system’s machine learning applications.


We will conclude with information where to find out more information about eMOP, as well as our open source code and workflows.



Keywords


digitizatioin, ocr, open source, flowchart

Full Text: Slideshow