OCR best practices (by my experience)
Polydata Projects › Forums › Polydata Group Forums › Digitizing Historical City Directories (19th/20th Century) › OCR best practices (by my experience)
-
AuthorPosts
-
-
27/06/2023 at 10:26 pm #225
Duc
KeymasterThis post aims to share my experiences with OCR (optical character recognition) in order to improve its success rate/decrease the workload of manually adjusting/correcting errors made by the OCR process.
Program that I used for OCR
First of all, I recommend the latest version of ABBYY Finereader (version 16) simply because it is the first finereader version that is native 64 bit. What that means is that it can access more than 4gb of ram. The previous 32 bit version often crashed likely due to ram issues. You can get a cheap licence for ABBYY finereader for about 70€ a year (or less if you are a team that can order in bulk).Before running the character recognition
I did a test run by using publically available scans that can be downloaded online e.g. via the library page of the Universität Köln or Universität Münster. For much better results (fewer errors in character recognition), it was crucial to prepare the pages before running the OCR process. Particularly, there are issues in correctly recognizing ‘Frakturschrift’ (the type of font that we stereotypically associate with Germany before WWII), and that was widely used by city directories in Germany before the 1940s.1. A common issue of any OCR with Frakturschrift is correctly distingushing the lower case letters s and f, as well as in some cases the n and u. The reason for this is due to the lack of ‘thick’ bridges between the letters that. For example:
-
This topic was modified 1 year, 7 months ago by
Duc.
-
This topic was modified 1 year, 7 months ago by
-
-
AuthorPosts
You must be logged in to reply to this topic.