Profile Photo

Digitizing Historical City Directories (19th/20th Century)

  • Public Group
  • 1 year, 2 months ago
  • 10

    Posts

  • 1

    Members

OCR best practices (by my experience)

This topic contains 1 voice and has 0 replies.
1 voice
0 replies
  • Author
    Posts
    • #225
      Duc
      Keymaster

      This post aims to share my experiences with OCR (optical character recognition) in order to improve its success rate/decrease the workload of manually adjusting/correcting errors made by the OCR process.

      Program that I used for OCR
      First of all, I recommend the latest version of ABBYY Finereader (version 16) simply because it is the first finereader version that is native 64 bit. What that means is that it can access more than 4gb of ram. The previous 32 bit version often crashed likely due to ram issues. You can get a cheap licence for ABBYY finereader for about 70€ a year (or less if you are a team that can order in bulk).

      Before running the character recognition
      I did a test run by using publically available scans that can be downloaded online e.g. via the library page of the Universität Köln or Universität Münster. For much better results (fewer errors in character recognition), it was crucial to prepare the pages before running the OCR process. Particularly, there are issues in correctly recognizing ‘Frakturschrift’ (the type of font that we stereotypically associate with Germany before WWII), and that was widely used by city directories in Germany before the 1940s.

      1. A common issue of any OCR with Frakturschrift is correctly distingushing the lower case letters s and f, as well as in some cases the n and u. The reason for this is due to the lack of ‘thick’ bridges between the letters that. For example:

      • This topic was modified 1 year, 7 months ago by Duc.

You must be logged in to reply to this topic.