How we tackled the multi-class classification problem
In a previous post we shared some insights on how we approach information extraction from documents with deep learning techniques. Our information extraction system, for example, powers several great automated accounting solutions releasing people from tedious work, like manually checking documents and typing in data. The approaches we used before are typically used in the NLP domain, treating documents as sequences of text. While working well, this approach does not reflect reality appropriately. First off, documents are two-dimensional data. The text may not only flow from left to right from top to bottom but there are separate regions of text.

An example invoice with recipient and sender information on top, order details in the center and further information at the bottom
Often times, text only becomes meaningful with respect to other text above or below it and there is more information than just text. For example, lines drawn on the document may be meaningful separators building structures like tables. This suggests that graphs may also be an appropriate representation of words on a document. We already looked into using graph convolutional networks which are a very promising approach but will leave it for a future post to talk about that in more detail.
. . .
We were motivated to rethink how we approach the task and make it reflect the structure of documents better. We did not change anything in the basic problem statement. Given a word on a document, we want to assign a label to it for the information type it belongs to — a typical multi-class classification problem.
A word or word box in that case includes the text but also its surrounding rectangle and coordinates that define its position on the document. For the sake of simplicity here, we factor out the problem of selecting and merging the right word boxes in a later step to deliver final extraction values.
We added another task of finding larger boxes on the document to extend our system to be able to retrieve compound extractions that consist of several fields. Such information is often found in tables or similar structures. One instance of this kind of extraction are the bought items listed on an invoice, often referred to as line items.
Our task not only has similarities with typical NLP problems, but there are also overlaps with Computer Vision tasks like object detection. We aim to combine the two by taking the state-of-the-art solutions of both. A common network architecture used in computer vision is the so called encoder-decoder. See this paper for an explanation of that neural network architecture.
We found that a customized network architecture in the style of U-net dyields the best results. On top of this encoder-decoder backbone, we introduce different heads that further process the generated feature map.
One of the tasks is the previously described classification of word boxes. WithROI-Pooling we select the features for each box from the feature map. Those features are chained into a sequence (roughly corresponding to a reading order from left to right, top to bottom). With self-attention (as introduced in Attention is all you need) we can model how the words in that sequence relate to each other and refine the features with more context information. For explicitly taking into account the word’s position at this stage, we add positional encodings (also described in Attention is all you need) based on its x but also its y coordinate. Afterwards a classifier is applied, producing a probability distribution over all the classes we want to find.
To detect more complex boxes (like line items), we use a similar approach to the state-of-the-art such as Faster-RCNN and YOLO. We make use of predefined anchor box generation and in our box regression head a stack of convolutional layers regresses to the offsets of those anchors. We also predict an object probability per box. As it is natural to predict overlapping boxes in object detection approaches, we use non-maximum-suppression to filter for the optimal solution.
We also introduce an auxiliary task of semantic segmentation to separate the document into meaningful regions. For predicting the segmentation map we use a basic convolutional network.

Schematic depiction of our information extraction model
For the box classification and semantic segmentation tasks we use cross entropy loss with class weights adjusted to the respective training data set. For the box regression task we define the loss similar to the one used in the RCNN papers. All three tasks are trained jointly with SGD. In fact, we found that sharing the underlying encoder-decoder structure between the tasks works very well. It even comes with the benefits of the tasks supporting each other, decreased model size and shorter inference time.
With this model we have built a strong information extraction system that we are constantly improving further. Not only does it allow us to provide compound extractions like line items but it also improves our extraction quality. Stay tuned for further updates!
. . .
If you enjoy mastering challenges of machine learning like this one, you should probably check out our open positions. We are always looking for excellent developers joining us!
At Gini, we want our posts, articles, guides, white papers and press releases to reach everyone. Therefore, we emphasize that both female, male, and other gender identities are explicitly addressed in them. All references to persons refer to all genders, even when the generic masculine is used in content.