What is OCR?
OCR or Optical Character Recognition is the recognition of text from printed or handwritten documents and images in order to distinguish alphanumeric characters using technology. Thats the technical definition. Let's look at a more practical definition.
What do you see in the below image?
Most likely, you would see the capitalised English character "A". Your mind has already done some preprocessing for you to identify light and dark regions, strokes and other features such as the triangle in the middle surrounded by darker regions.
However, this is what the computer sees when it sees the same image.
A computer simply 'sees' 1s and 0s. It has no cognition of what the patterns of ones and zeros represents to humans. OCR is the technology that converts the pattern of ones and zeros to machine readable data (eg. ASCII, HTML, JSON).
OCR technology helps computers understand printed and handwritten information by converting it to machine readable data.
How does OCR technology work?
OCR technology has come a long way since 1990s. Lets take an example. Suppose you're an OCR computer program presented with lots of different letters written in lots of different fonts; how do you pick out all the letter As if they all look slightly different?
You could use a rule like this: If you see two angled lines that meet in a point at the top, in the center, and there's a horizontal line between them about halfway down, that's a letter A.
Apply that rule and you'll recognise most capital letter As, no matter what font they're written in. Instead of recognising the complete pattern of an A, you're detecting the individual component features (angled lines, crossed lines, or whatever) from which the character is made.
Most modern OCR programs work by feature detection. However, rather than creating specific rules for each letter, they use neural networks for feature detection. How neural networks work is much more complicated and out of scope of this article. In short, neural networks automatically detect features provided that it is trained on a large number of samples of the character it is trying to detect.
OCR software tries to recognise characters in the image /document by slicing the image into smaller pieces and then passing each piece through a neural network to check if it contains a character and to find closest matching character. Modern OCR programs such as Google Vision and Tesseract then combine these characters based on the spacing between them to give word representations.
What are the applications of OCR?
It is quite likely that you have used OCR technology in your life if you have used an app such as CamScanner to take photos of business cards. When you upload photos & PDF files to Google Drive, Google automatically scans them using OCR technology to identify text in them. Other applications of OCR are:
- Extracting data from business documents, for example, bank checks, invoice, bank statement and receipts
- Recognising number plate recognition in traffic cameras & CCTVs
- Extracting data from passports at airports
- Extracting data from business cards
- Key value pair and table extraction from insurance documents
- Making physical books readable online
- Making documents searchable
What are the shortcomings of OCR & where is OCR technology headed?
There are 2 main shortcomings of OCR technology: accuracy and text categorisation.
One of the issues with OCR technology is that the accuracy may not be 100%. For example in the image below "21.08.2018" could be captured as "2I.O8.2OI8". Hence, you need a second system that validates the output of the OCR engine.
OCR technology identifies characters and then combines those characters into words. However, for business use, it is important to identify what those words mean. For example, OCR technology will give the output “Invoice No: 12345” where “Invoice No” represents the “invoice_number_key” and “12345” represents “invoice_number_value". This is where you need intelligence built on top of base OCR technology to make the identified text usable.
At Docsumo, we solve both these issues. Docsumo automates data extraction from documents and makes the data actionable. Using advanced computer vision and natural language processing, it validates the extracted data so that it can be directly consumed by downstream software.