Use R for prototyping a productivity app
Building OCR tools with the R-language
R can be used for numerous applications. Its extensive amount of libraries include even non-statistical tools, such as optical character recognition (OCR) which is handy for productivity apps, such as for the automation of classical accounting tasks, e.g. registering bills with their data in a central log.
The use-case of extracting information from imperfectly scanned invoices is explored here. This includes also the navigation on the scanned 2D image to find relevant information as well as the pre-processing of the image, that involves de-skewing, cropping and resizing.
R or Python
When addressing a computer vision problem, the first question that arises which tool or which language to use. R and Python come here naturally to mind, both for their usage in the data analytics domain. But which one to choose?
And this question does not not only arise when confronted with a concrete problem, more generally there is an ongoing debate on whether R or Python is the right tool for data-related tasks.
It is especially a critical one for for early-career-stage data scientists that have to make a choice on how to invest their time to learn a new programming language. While Python has become the more prevalent candidate and many companies have adapted their tech-stack, there are still good reasons to use R. And this is first and foremost the strong community that contributes to a growing number of libraries.
The use case that is presented here, relies exactly on those contributions, and is a show-case of R’s versatility and its strong capabilities when it comes to prototyping new applications.
Automate accounting tasks with OCR
The task at hand is explained quickly. Consider an accounting department that receives invoices from across the company and needs to extract relevant information and write these into a central database. Instead of doing this manually, this process should now be automated and a first prototype needs to be build up to assess feasibility.
The present example focuses on one type of invoice, but can be easily extended to recognise the invoice issuer and deploy a tool that can handle multiple issuers.
The automation process starts with scanning the invoices, thus the input data for our prototype are scanned images of three bills.
The technical foundations
The base to solve this computer vision task, are contributed libraries. These are tesseract, which is an R wrapper for the famous Open Source Engine that was originally developed at Hewlett-Packard Laboratories.
The bills that will be scanned and processed are from a German Taxi company. An example is shown below
All private information is greyed out. The relevant information that we seek to extract is the “Bruttobetrag” at the end of the bill. If you want to follow-along the code, you can download the images in a dedicated github repository here.
The goal is to write a prototype and configure it with three different bills that can then be used for a production use case, potentially in another language, or simply by using the command line tools ImageMagick and tesseract as standalone.
The advantage of using R is that it is straightforward to code this prototype and configure it until it produces stable results.
Configure the R session
The configuration is straightforward, loading libraries, setting up the workspace. Notice that the right languages for the recognition task need to be downloaded, in this case it is German.
Useful functions: Calculate geometric overlap
The approach to identify the invoice amount in this MVP (minimum viable product) is to identify the correct bounding box, that contains this amount. So there is a bit of 2D geometry navigation involved.
To be more concise, we will identify the position of the bounding box that contains the “Bruttobetrag” (invoice amount) in one bill and use this position as a reference for all other bills. To calculate the similarity of the bounding boxes, the metric Intersection-over-Union (IOU) is employed.
This is implemented in R with the following function:
The function is written so that it deals with the bounding box format that is returned from tesseract.
Since our approach relies on the exact position of the invoiced amount on the bill, all randomness from scanning the image, such as skew, size, etc. needs to be eliminated. That is what ImageMagick is used for here. The images undergo three transformations: deskew, trim, and resize:
tesseract works best with images of at least 300dpi. This should be taken into account when scanning the images.
Extracting information from the image
Once all scanned images are standardised, the information extraction can begin. This is straightforward with tesseract. The following line provides all relevant information:
receipt_data contains a data frame with the words including all bounding boxes:
The data frame contains also a confidence. The first word for instance was not properly recognised, it should be reading “FAHRER”, recognised is “EEREEER”, the confidence is accordingly low.
Apart from the recognised word and the confidence, also the bounding boxes are returned from the function call. Notice here that the coordinate system is following the standard monitor-graphics notation, where the origin is at the top-left and not at the bottom-left. To make debugging easier, incl. plotting with standard libraries, this is easily reverted by subtracting the total image height that can be obtained with ImageMagick:
Investigating the outcome / debugging
Once the coordinate system is fixed, the outcome can be investigated. On the left, the bill is shown with a red bounding box around the invoiced amount that we are looking for.
This plot together with the bounding box is created with the following function call:
Row 54 of the receipt_data contains the information on the bounding box of the “Bruttobetrag”:
When debugging the solution and storing information about the relevant bounding box, you will typically come across several artefacts that need to be taken care of. In the case above, the invoice structure varies geometrically when a second position, next to the transportation fee, e.g. a tip, in German, “Trinkgeld” is on the bill. This can be easily addressed by simply searching through the recognised words for this term and adjust the position of the bounding box accordingly:
Once all artefacts are addressed in the code, the extraction of the invoice amount is straightforward. The brutto.box in the snippet above is the reference, so the overlap of all bounding boxes with this reference are calculated (using the IOU function that we defined above) and the one with the maximum overlap is assumed to contain the invoice amount:
The vector billing_amount contains then the invoiced amounts for all bills that are handed over to the script.
R is used here to configure a productivity tool — extracting information from imperfectly scanned images. Writing the code proves to be very straightforward, thanks to various contributed libraries. So even when confronted with a classical Python task, namely image recognition, it is worth checking out existing R libraries for a rapid implementation of the task at hand.