Texas Digital Library Conference System, TCDL 2013

Font Size: 
Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text
Nicholas Joel Woodward

Last modified: 2013-03-21

Abstract


Widespread digitization initiatives and the concomitant explosion in digital corpora have redefined the roles of academic libraries in recent years. Many current efforts in the academic community focus on making digital content accessible and legible to mass audiences for a variety of purposes, and the transcription of scanned documents is one integral component. Difficulties inherent to the process of optical character recognition mean that most digital artifacts containing text are converted to scanned images that lack full-text search capabilities. In many cases researchers must resort to either manually entered metadata (generally unfeasible with large-scale data) or crowdsourced input from users (only applicable on a per-item basis).

To these ends, I have developed an application and workflow for the large-scale transcription of scanned artifacts by combining limited user input (crowdsourcing) with machine learning on a high performance computing cluster to recognize patterns of matching words across artifacts and mechanically transcribe them. The application consists of a collection of tightly integrated Java libraries that 1) mechanically segment scanned images into individual words, 2) use one of several pattern recognition algorithms available as open source code to match similar images (i.e. words) across the entire corpus, and 3) transcribe these words using either open source OCR software or an online crowdsourcing tool written in PHP.

The Digital Archive of the Guatemalan National Police Historical Archive serves as the test case for the application. The collection has several characteristics that make it ideal for measuring the performance of the transcription process. Its large size (12+ million pages), range of document formats (forms for births, marriages and deaths, typed letters and handwritten journals), and variance in scanning quality all present unique challenges to manual transcription.

The presentation will describe in detail the process of developing the application and workflow, the challenges presented by the test collection and how they were met and, finally, the preliminary results from the transcription process. It will focus on the following:

- Methods for segmenting and matching words

- Challenges to the proposed approach

- Mutually beneficial relationship between user input and machine learning

- Opportunities for future improvements and applications


Keywords


digital archives; scanned documents; ocr; crowdsourcing