Under construction
This post aims to share my experiences with OCR (optical character recognition) in order to improve its success rate/decrease the workload of manually adjusting/correcting errors made by the OCR process. In my experiences, the following advice very significantly decreased the amount of errors m ade by the OCR.
Program that I used for OCR
First of all, I recommend the latest version of ABBYY Finereader (version 16) simply because it is the first finereader version that is native 64 bit. What that means is that it can access more than 4gb of ram. The previous 32 bit version often crashed likely due to ram issues. You can get a cheap licence for ABBYY finereader for about 70€ a year (or less if you are a team that can order in bulk).
Before running the character recognition
I did a test run by using publically available scans that can be downloaded online e.g. via the library page of the Universität Köln or Universität Münster. For much better results (fewer errors in character recognition), it was crucial to prepare the pages before running the OCR process (which ABBYY calls preprocessing). Particularly, there are issues in correctly recognizing ‘Frakturschrift’ (the type of font that we stereotypically associate with Germany before WWII), and that was widely used by city directories in Germany before the 1940s.
Problem 1: A common issue of any OCR with Frakturschrift is correctly distingushing the lower case letters s and f, as well as in some cases the n and u. The reason for this is due to the lack of ‘thick’ bridges between the letters that. For example, a easily recognizable n and u under Frakturschrift may be
vs
However, the issue is the ‘thin line’ in the letter that may disappear if the scan was made too bright and even if scan was made at a high DPI (i.e. 300 dpi+). Abbyy finereader OCR (as well as Tesseract) occassionally recognizes this instead, where these ‘thin lines’ disappear, while the ‘thicker lines’ remain:
vs
The n and u are then more or less indistinguishable from another or may be recognized as a ‘ll’ (aka two l letters) instead.
The same issue also appears with the letters s and f in Frakturschrift, where it cannot reliably distinguish if the small line that differs between the s (left) and the f (right):
vs
However, with the letter ‘k’, I have not experienced any misrecognition issues whatsoever.
Remedy to Problem 1:
As said, the issue in such misrecognition lies in the thin lines that are not recognized. One easy way to tremendously improve the correct recognition is by decreasing the brightness of the image.
Typically, the thin lines are not seen as a ‘solid’ blacks, but more like a brighter grey. What decreasing the brightness does is that it converts the brighter grey to a darker grey which is then recognized as a black. A may then appear to be a better recognizable
, thus better distinguishable from a ‘u’. I then ‘whiten the background’ under ABBYY, although it does not improve/worsen the success rate of recognition in my experience.
Some caveats to the remedy:
However, you cannot ‘crank’ down the brightness too much as it otherwise may falsely create a second ‘thin line’ under an ‘n’ or over a ‘u’, particulary if it is in bold i.e. we may create this monstrosity of a letter with lower brightness:
Typically this is only an issue with a bolded n or u, as turning down the brightness would make the already ‘thick’ lines too ‘thick’.
For city directories which uses bolded letters for last names e.g. Düsseldorf city directory, this leads to frequent mischaracterization of the last name. For city directories like Münster that does not bolden any of its entries (except maybe for ‘amenities’ or businesses), this is less so of an issue.
a. One way to at least decrease the likelihood of mischaracterization of bolded last names is to add these last names into the ABBYY dictionary, as ABBYY OCR prioritizes dictionary words if it is between two letters to choose e.g. if it is between Hermann or Hermauu, it most of the time correctly chooses Hermann as it appears as a name in the dictionary. Of course, if some last name is Hermauu (which does not exist as far as I know), then we are out of luck for this particular weird case.
b. A second, more elegant way is to run it through a post-OCR correction algorithm. For example TICCL by Martin Reynaert (https://github.com/martinreynaert/TICCL) basically can take the fully OCR’ed book and corrects misrecognized letters (through Levenshtein-distance https://en.wikipedia.org/wiki/Levenshtein_distance) with the corpus itself. The corpus itself may be from books of different years. Although a last name may be mischaracterized by a letter or so in one year, it is very unlikely that the same error occurs for the entries for that particular last name for the following book years consistently as well. At least for last names that still remain in some form in the book for many years, we can effectively decrease the number of errors.
c. Furthermore, through jobs and addresses, we can also reduce the errors as well. For example, if we correctly recognized the address of a person of interest through all years i.e. Königsstraße 12 but in year 1891 we find it recognized the last name misrecognized as ‘Hermaun’, but in 1892 correctly as ‘Hermann’ and in 1893 as Hermann again and so on, then applying the Levenshtein distance edit on this entry singular address throughout the years would correctly give is ‘Hermann’ in 1891.
Some notes on research application
However for research interested in urban inequality and the drivers of it, the last name may not be too much of an issue as we are not typically interested in tracking the individual, but rather in finding the composition of jobs people have in a neighborhood to get a clue about the socioeconomic background of a neighborhood over time.
Correct recognition of the last name + job may be more of interest to track the correct person throughout time i.e. if they moved within or even between cities i.e. to see the migration decision drivers of people for example.
Problem 2:
Under construction
Full suggested approach