Texas Digital Library Conference System, 2014 Texas Conference on Digital Libraries

Font Size: 
Transforming Access to Texts with 18thConnect and TypeWright
Elizabeth Grumbach

Last modified: 2014-03-14

Abstract


18thConnect is a digital aggregator and virtual research environment (VRE) for eighteenth-century researchers. As part of a larger community of VRE’s, all organized under the Advanced Research Consortium (ARC) and based on the NINES (Networked Infrastructure for Nineteenth-Century Electronic Scholarship) model for peer review and scholarship, 18thConnect has to tackle issues relevant to its period-specific research community. As a result, the TypeWright application was built for the 18thConnect platform in order to provide an easily-accessible, crowd-sourced correction tool for eighteenth-century texts.

The TypeWright tool was designed to solve issues with Optical Character Recognition (OCR) for early printed texts, specifically those in Gale/Cengage Learning’s Eighteenth-Century Collections Online (ECCO) subscription database, to provide accurate text for full-text searching, data mining, and the creation of digital scholarly editions. Because these texts were photographed, microfilmed, and then digitized over a period of 40 years, their quality negatively impacts OCR text output. In addition, early printing conventions, especially early typefaces and paper quality, cause OCR engines to mis-recognize the word images on a page. To foster the sustainability and use of these texts in scholarship, TypeWright was created to enable users to correct, by hand, save, and share their editing with the 18thConnect community.

For this poster presentation, I intend to focus on illuminating the following three aspects of the TypeWright tool:

1. Correcting a text in TypeWright, or, briefly explaining the accessible user interface.

When a user accesses the 18thConnect site, they can search for “TypeWright-enabled” texts, right now consisting of the 183,000 documents contained in ECCO. Once a user has selected a text, they are ported into the editing interface, which displays snippets of the page image for transcription in the text editing box below. The text editing box already contains the text generated by a previous OCR process, so that the user can either edit the text, or confirm the current text is correct.

2. Liberating a text in TypeWright, or, how users can request full text and XML for a document after completing correction;

After a user, or a group of users working collaboratively, have completed correcting a document, their work is reviewed by TypeWright administrators. If the work passes the evaluation process, then the user(s) are able to receive the corrected plain text or XML/TEI-encoded files. If the work fails evaluation (which is rare) users are instructed to look for common “correction” mistakes, and fix them.

3. Using a text after TypeWright correction, or, the benefit of crowdsourcing correction for the academic community.

Once a user has received their corrected text files, 18thConnect administrators advise users to use this data in their digital project, then submit that digital project for peer review to 18thConnect. In addition, the corrected text, per our agreements with Gale/Cengage Learning, return to that database to improve the searchability of this proprietary product, which constitutes an important resource for the eighteenth-century scholarly community.

Keywords


archives, optical character recognition, open source, tools, access, virtual research environment, digital edition

Full Text: PDF