Collection Size Descriptions as Archival Data: The Spectrum of physdesc
Sarah Buchanan, Haoyang Li

Last modified: 2014-03-25


This poster presents insight into the functional vocabulary with which repositories describe the physical extent of their collections. The structured standard Encoded Archival Description (EAD) has provided repositories with a XML basis for representing archival finding aids since its creation and adoption during the 1990s. As one measure of its widespread adoption by collecting repositories, consider that the nationwide corpus of ArchiveGrid currently comprises over 120,000 EAD documents. The public database Texas Archival Resources Online similarly facilitates discovery of historical collections by displaying the contributions of EAD-structured finding aids from Texas repositories. The current version of EAD consists of 146 elements – an EAD tag and its formal element name – which provide the basis for these structured descriptions of collections. In this research we focus on one component of collection description, the <physdesc> tag, and report on the range of format types that appear in Texas collections. Beyond the colloquial names of box, photograph, and painting exist many outlier terms which present unique challenges and opportunities. The variation within the <physdesc> tag may be painless to the human reader during display, yet becomes problematic during natural language processing which requires normalization of collection sizes in order to perform statistical analysis.

Through the one element of Physical Description, repositories are charged with summarizing both the materiality and the quantity of the items contained in an entire collection. These descriptions speak to the physical form and enumerative values of all information artifacts in the collection through the use of four optional subelements: dimension, extent, genre characteristic, and physical facet. We demonstrate the effect of having relative leeway in terms of data structure requirements built into the formal definition of this element. Because "the information may be presented as plain text," the end result of this definition is a dataset with wide internal variation that could impede the goal of assessing such collections through actionable data and its reuse in a broader context, such as by repository or region. With the third EAD Revision currently in gamma release (and set to replace EAD 2002 this spring), we consider our study in parallel with the following two developments: the continuation of the <physdesc> element as an unstructured option, and the creation of a new <physdescstructured> element which will formally adopt, rename, and add a fifth subelement to the four optionals listed above. In addition to version compatibility, EAD developers and adopters should facilitate integration of the legacy data corpus alongside data requirements to meet the dual goals of analysis and discovery. The Visualizing Archival Data / Augmented Processing Table project, of which this study is a part, aims to understand how such finding aid data can reveal the quality and granularity of collection arrangements, and through this, the layers of historical evidence that are made available to researchers seeking resources on specific topics, people, and organizations.


descriptive metadata; structured data; digital libraries

