Technological adventures in the archives

As a doctoral student doing history, I have been often frustrated by archival collections. They usually require travel that can be difficult with limited funding and a young child. Fortunately, since my research involves the City University of New York and I live in Brooklyn, many of my archival sources are available closer to home. Yet, even with several library archives less than an hour away from each other by subway, accessing these archives remains a slow and uneven process of navigating (sometimes out-of-date) finding aids and (seemingly unrelenting) bureaucracy.

Over the past few years, my habit has been to scan or photograph as many of the documents I find related to my research questions. I keep a research journal where I comment on the documents I read and pull out the relevant quotes. But I usually gather more than I need. This is partly a result of my fear, perhaps misguided, of overlooking a piece of evidence in a document on a first read that could be helpful, as I read more documents in the archive or re-read them in the future. Moreover, in gathering photographs of documents in the archives into my own personal research library, I also hope that I can in the search them for peoples, places, and topics that come up as I continue my research or as I write. And only through trial and error have I found tools and a process that works for me in getting quality digital copies of archival documents. Most recently I have been putting more thought to the quality of text recognition since that hampers most my ability to search documents.

If I am lucky then an archival collection has already been digitized. In those rare cases, I don’t have to spend hours of my day performing the tedious task of photographing or scanning documents. However, even digitized archival collections are difficult to process. Some digital archives will provide an online search interface but search results can only be as reliable as the quality of text recognition performed upon the document scans. I don’t deal much with handwritten document but a search interface in that case would be seriously limited in its power as a research tool. And in the case of older typed documents that have seen better days its going to be hard to accurately extract searchable content.

In this post, I will illustrate the sort of errors in optical character recognition (OCR) that I have found in my archival research and one workaround I have used with some success: FineReader by Abbyy. The source material I will focus on is a digitized historical newspaper, one published by students at Baruch College. In the spirit of open digital scholarship, I am publishing my source code at whatsupdoc-analysis. I chose to do this data project in R since there are a number of excellent packages for scraping and transforming data. I have described some of the benefits of using R for data analysis the “right way” in my Do the R Thing workshop I gave in May 2017 at the CUNY Graduate Center.

Scraping the data

Recently, as I was writing an essay on the relationship of CUNY students to the civil rights movement, I attempted to search for mentions of important people and events across the digitized collections I had gathered. One such individual was James Meredith, whose barring from admission by the University of Alabama was national news in 1963. One of my sources, the Baruch Ticker, a student newspaper published since the 1930s, conveniently has an online search interface. A full-text search for “James Meredith” turned up an editorial on page 3 of the November 20, 1962 issue. But the results proved unreliable.

I wanted to automate the process of searching for key terms to some extent so had to figure out how to scrape the information from the web search interface. I used rvest and identified the relevant XPaths to the search form from the search results.

# set user agent string to make sure web server replies with full page
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
baseurl <- "" 
session <- 
  baseurl %>%
form <- session %>%
  # index page has malformed forms so I cannot directly use html_form on the session
  html_node(xpath = '//*/form') %>%
form <- set_values(form, 
                   'data[Search][keywords]' = "James Meredith",
                   'data[Search][field]' = "fulltext")
results_session <- submit_form(session, 
results_urls <- 
  results_session %>% 
  # the search result page unfortunately has a complicated structure
  html_nodes(xpath='//*/div[@id="search_results_div"]/div[@class="results"]/div[contains(@class,"result_")]/div[@class="result_title"]/a') %>%
  html_attr("href") %>%
  # remove the query string that is appended to the pdf url
  str_replace("#.*$", "")
results_contexts <-
  results_session %>% 
  html_nodes(xpath='//*/div[@id="search_results_div"]/div[@class="results"]/div/div[@class="context"]') %>%
results <- tibble(url = results_urls,
                  context = results_contexts)

I confirmed I got the same 13 results by scraping that I had seen from a web browser. Interestingly, only one of the results actually contained the exact phrase “James Meredith” since the search function looked for any pages with “James” and “Meredith” though not contiguously. The only other result of the 13 that was relevant to my query was erroneously recognized as “J a m e s Meredith” (with inserted spaces), though “James” was recognized elsewhere on the page.

results %>%
    # extract the phrase and the ten characters before
    extract = str_extract(context, ".{10}Meredith")
  ) %>%
  select(url, extract)

## # A tibble: 13 x 2
##                                    url               extract
##                                  <chr>                 <chr>
##  1 /files/articles/ticker_19630923.pdf    James II. Meredith
##  2 /files/articles/ticker_19830426.pdf     Possibly Meredith
##  3 /files/articles/ticker_19871117.pdf    eason for Meredith
##  4 /files/articles/ticker_19761214.pdf    ICATION a Meredith
##  5 /files/articles/ticker_19980401.pdf     1907 Don Meredith
##  6 /files/articles/ticker_19991201.pdf     I kissed Meredith
##  7 /files/articles/ticker_19770103.pdf    or- James Meredith
##  8 /files/articles/ticker_19870217.pdf    f Burgess Meredith
##  9 /files/articles/ticker_19640421.pdf     i c s by Meredith
## 10 /files/articles/ticker_19691209.pdf    inds of'^ Meredith
## 11 /files/articles/ticker_19621010.pdf    J a m e s Meredith
## 12 /files/articles/ticker_19640225.pdf    APriL The Meredith
## 13 /files/articles/ticker_19761112.pdf "c M a n,\" Meredith"

Note that ticker_19621010.pdf was the document that matched when searching “James Meredith” but only because “James” had occurred elsewhere in the document and not immediately preceding “Meredith”.

We will now more closely examine the text quality of that pdf to understand why this problem occurred.

Problems in extracting text content

I used the pdftools package for R to extract the text from the PDF. Behind the scenes, this package is using the poppler library, commonly used on Linux systems. I suspect other libraries might be better or worse at extracting text but have not done such a survey myself yet.

We will focus in on page 3 of ticker_19621010.pdf.

As is apparent from a closer inspection of the above image, any OCR technology would have difficult. The scan has lines across it in random places and some of the text is so thin as to be unreadable.

We can first confirm that “Meredith” occurs in the text extracted. But the quality of the PDF is such that we find only two of the three occurrences of Meredith, once in the header, and another in the second paragraph. Missing in the results is the “Meredith” immediately following “James” in the first paragraph.

pagetext <- pdftools::pdf_text(pdffile)[3] 
pagetext %>%

## [[1]]
## [1] "          Meredith          " "      Mr! Meredith.         "

One problem is that there are spurious characters (“!”) in the output. But a more basic problem in the text becomes apparent when we look at a bigger slice of the text content. We quickly find how often spaces (” “) are inserted into the extracted text. Any attempt at tokenizing this text would prove difficult because of the poor quality of the source data.

pagetext %>%
  # grab some random window of 2500 characters
  str_sub(10000, 12500) %>%
  str_wrap(80) %>%

## His talk concentrated on t w o Blind Students S t e v e R a p p a p o r t *63
## Mike Kreitzer '63 Managing Joe Traum '64 Editor Business Manager s i g n e d b
## y D e a n S a x e , w a s circu-» Leonard T a s h m a n '63 lated throughout t
## h e school: Perhaps this column will give you a n insight a s to tricks that,
## various people Aise to cover UP w h a t they t. H o w e v e r , l e t u s t a k
## e ' t h e c a s e o f o n e m a n . T h e t i m e h e l i v e d in

One solution would be to attempt to repair this text using a set of simple rewrite rules. For instance, if there is a series of characters separated by spaces, we could check to see whether removing the spaces would yield a valid word. But the original motivation for my search was to find references to an individual. With simple rewrite rules, we wouldn’t easily be able to decide if the word yielded was a valid word if that word in fact was a proper name. Instead I decided to reconvert the PDF to a searchable PDF using an alternative, hopefully better, OCR technology.

Better text recognition

The document we are examining is black and white. What might have been faint marks on the scan in a color or grayscale image become hard lines that cut across the page. The choice of a library to give users access to the black and white version of the scans could make sense given how the size of grayscale and color PDFs would necessarily be larger. Size certainly matters for the institution hosting the files, be it the storage consumed by a collection or the network bandwidth used when sending those files. But for academic researchers, who are interested a subset of all the files an institution might need to host, such a decision forces them to use smaller PDFs at the expense of quality. For my own purposes, I had already downloaded all the PDF files from the Ticker website by scraping a list of issues from the website. Though all those files together take up 5 gigabytes, when hard drives are so cheap and increasingly large, researchers would likely opt for bigger files to get better text content. Being that archives go through a lengthy and expensive process to digitize material, I would hope they maintain full quality, color scans that could be used for future re-processing as OCR technology improves.

Given these reservations, I was still pleased with the results from Abbyy. I have been using their desktop application, FineReader, for a while now, and would certainly recommend it to all academic researchers working with digitized primary documents. But here I wanted to explore features that Abbyy only provides through its SDK, which is available with either through the FineReader Engine or the Cloud SDK. I chose to give the Cloud SDK a try since there is a trial package of 50 free pages, which was plenty for my purposes in this exercise.

I made use of the abbyyR package to easily call the web API. The developers of the package provide a helpful example for using the library. When I submitted the image of page 3 to be processed, the Cloud SDK creates a task that I can monitor till it is complete.

processImage(file_path = pngfile)

## Status of the task:  1 
## Task ID:  1

##    .id                                   id     registrationTime
## 1 task e56fe725-bc21-45b3-99c9-7a1d2d77021e 2017-09-01T13:57:25Z
##       statusChangeTime status filesCount credits estimatedProcessingTime
## 1 2017-09-01T13:57:26Z Queued          1       0                       5

Once the processing is completed (which took about 90 seconds), the text output can be simply downloaded. We notice immediately that we are now getting three matches for “Meredith” rather than just two and that in this output we do have the exact phrase “James Meredith”. Great!

resultUrl <- finishedlist$resultUrl %>% as.character()
abbyyFile <- tempfile()
curl::curl_download(resultUrl, abbyyFile)
abbyyText <- read_file(abbyyFile)
abbyyText %>%

## [[1]]
## [1] "          Meredith          " " to James Meredith's presenc"
## [3] " with Mr. Meredith.! that th"

Using a bigger slice of the text, we also see a vast improvement over the quality of the original document.

abbyyText %>%
  # grab some random window of 2500 characters
  str_sub(10000, 12500) %>%
  str_wrap(80) %>%

## ity in which h* lived is of little Jeffers. Jeffers, a poet on whom j Charity
## Drive .Vries Editor Assoc. Bus. Man. “A short Memorial Service for desire to
## say. - --------- tiortancf. We would not want this individual to be ari anomaly.
## Levy is a noted authority, was Mike Del Gindie# '64 Jody Bernstein 1 • * • #
## Iking any identification with his counterpart* in society, so we shall “honest.”
## the professor said. _ Alpha Phi Omega, the national the late Anna Eleanor
## Roosevelt Student ConacU representative to one of his clone conatltn. II him
## Mr. Zacaz. # The other, Mark Van Doren. Era!hit* Editor wttt be heftt-fa the
## Auditorium of- I We. must next act the scene. Mr. Znraz return* btfm^Troni work,

There are still many errors in the text output that a human reader could likely fix. Still, I was very happy with the quality of text extracted by the Abbyy Cloud SDK. But before I could do more with the historical newspapers I am researching, I needed to confront one problem that persisted. There was no easy way from the text output of the Abbyy Cloud SDK to extract just the contents of the article relevant to my research. I still have no good solution for this but will conclude this post by illustrating this problem.

I have for now ignored that text on the page is layed out in blocks of articles. Lines of two or more articles will be vertically aligned, making it hard for OCR technology to extract sentences and paragraphs reliably. We can see this by going back to our original PDF and extracting a window of text after heading for the editorial on Meredith we have examined above. First, let’s crop out the article to take a closer look. The dimensions for such a crop were set by visually approximating where the editorial appears on the page. Of course, this is not a process that would be easily reproducible by a computer program.

load.image(pngfile) %>% 
  imsub(x < (2466 / 3), y > (1473 / 2)) %>% 

The text extracted from the original clearly did not distinguish between the text of the editorial and the text of the letters to its right.

pagetext %>% str_extract("Meredith.{100}")

## [1] "Meredith                                                                       n o t t h e a d m i n i s t r"

Did the output of the Abbyy Cloud SDK fare any better with the headline? Not so much.

abbyyText %>% str_extract("Meredith.{100}")

## [1] "Meredith                                                               should have the right to decide, and "

Interestingly, the output text from the desktop FineReader proved far better, with the article seeming to be segmented properly.


Reaction to James Meredith’s presence on the Univer­sity of Mississippi campus is still strong. However, we believe that it has reached the peak of cowardice when a dean of the school believes it is ill-advised for white students to eat with Mr? Meredith. w ----------
Last week a group of students at the university had the courage and intelligence to eat -lunch with Mr. Meredith.

What next?

In sum, using the Abbyy Cloud SDK, I was able to dramatically improve the character recognition of documents from my archival research. Why does this matter? With better text extraction, documents can be searched more reliably. At the moment, the search interface for digitized archival collection is sorely limited. I don’t mean to knock the efforts archivists and librarians have put towards them.

But we should not stop with just search. In the case of an archival collection, machine learning techniques could dramatically improve access for researchers. Imagine if similar documents (or subparts of documents, such as articles from a newspaper) were grouped together so that a researcher who finds one could immediately locate others. The relationship of scholarship and technology is certainly not without pitfalls. But I would argue that opening scholarship to automation and machine learning can increase the researcher’s productivity in way that frees them to do more rather than restrict them.

As is evident with the quality of some of the digitized archival collections I have been researching, better use of technology in historical scholarship requires we first tackle a prior problem, that of reliably extracting text from archival material. This is not a simple problem, since it involves improving character-level as well as word-level recognition. In this particular case of historical newspapers, it also requires segmenting pages into articles rather than just arbitrary blocks of texts without concern for the page layout.

Using the Abbyy SDK, we can also request that the image processing to generate an Output XML document, rather than just the extracted text. The XML document provides even detailed information about the resulting document. It represents the processed images as a nested structure of pages, blocks, regions, rectangles, collections of paragraphs, lines, and ultimately characters. Most importantly the character level data includes variants of the character recognized and the confidence that the OCR engine had for each variant. Another useful piece of information that can be pulled out of the XML are the blocks of text on a page. With such structured information on the contents of a page, we could improve the tokenization and segmentation of documents for research.

If you have been tackling these and related problems, please do get in touch with me!