Texas Digital Library Conference System, 2015 Texas Conference on Digital Libraries

Font Size: 
Search as Research: Big Data Infrastructure Visualization Application
Timothy Duguid

Last modified: 2015-03-19

Abstract


Despite its age, sites such as Yahoo, Google, and Bing continue to use lists of links to display their search results. Doubtless these companies have conducted usability studies that show the utility of paginated lists, as they have focused their attention on optimizing their search algorithms to ensure that the most relevant search results appear at the top or within the first couple of pages of search results.  After all, few people will view more than 3-5 pages of Google search returns, let alone the millions of other results from any particular enquiry. This has given rise to Search Engine Optimization companies who work to ensure that their clients are listed at the top of those search results. Therefore, the most well-funded – not necessarily the most relevant – sites appear at the top of many internet searches. Paginated lists are even less helpful in conducting original research or for dealing with questions that may have multiple answers. For original research, it is often the statistically unremarkable result that is most noteworthy.  And, it is impossible to know if all the correct answers have been found unless a user is willing to sift through those millions of returns.

Though its dataset is not as substantial as that of Google, Yahoo, or Bing, the Advanced Research Consortium (ARC) has compiled a catalog of humanities-related digital artifacts that presently consists of 1.6 items dating from the medieval period to the twentieth century.  Some of these include full text transcriptions and optical character recognition from books and pamphlets. Thanks to the efforts of the Early Modern OCR Project, or eMOP at Texas A&M University, that number will expand to include full-text transcriptions of the EEBO and ECCO corpuses. This dataset has been compiled to facilitate research, and particularly to encourage scholars to develop and evaluate new kinds of research questions.  Given that paginated lists are not particularly helpful for this kind of research, ARC has developed a visual interface of its faceted catalog. This poster examines ARC's visual search interface, called Big Data Infrastructure Visualization Application (BigDIVA), as one method of conducting research on humanities datasets. It shows how BigDIVA optimizes the search process by presenting users with a faceted visualization of all of their results. The poster argues that this is preferable to proprietary searches that rely on site rankings and search algorithms because the BigDIVA’s mechanisms are eminently transparent and reproducible. Furthermore, BigDIVA allows users to simultaneously view the big picture and individual results, something that traditional search engines cannot do because that are focused on finding the one answer or webpage.  This poster shows how researchers can use BigDIVA to formulate theories on large-scale trends, while finding specific case studies that support these theories. By displaying all the results for a particular query, this poster finally argues that BigDIVA facilitates discovering unexpected results that would have been filtered out by a detailed, complex search query.


Keywords


data visualization, big data, search, visual research