Is this particular extraction for this particular document correct or not? How a "professor" helps us to make predictions on our extraction quality.
At Gini, we are constantly working on improvements in extraction quality from documents. As our extractors are getting better and better, we apply different metrics to estimate the overall quality of our extractors such as accuracy, precision, recall, F1-score etc.
However, one very important question always remains: Is this particular extraction for this particular document correct or not? Not on average but exactly on this particular document. This information is very important for our partners: If they can (certainly) rely on the fact that these particular extractions were exactly correct, they could omit the tedious process of having to manually check them. In this case, we are approaching a fully automated process of information extraction from documents. This is one of the major reasons why we at Gini are so convinced that providing realistic and adequate confidences of our extractions is so important.
But what actually are useful predictions? Most of our extractors are able to provide confidences of their extractions estimated by themselves. Let’s have a look at an extractor based on a neural network. Neural networks are always returning probabilities of their outcomes. Theoretically, we could apply these probabilities as confidence predictions. However, we are not sure that these probabilities meet our requirements, viz. being realistic and adequate.
Let’s employ an analogy for easier understanding of the problem domain and let’s imagine a group of several different neural network models as a group of students. One student, our top student, is good and confident in the answers. The second one is also good but always uncertain in the results. Such a neural network is returning mostly correct answers but with lower probabilities. The third student is poor and knows it. Such a neural network is often returning incorrect answers and correspondingly low probabilities of these answers. The fourth student is poor but super confident in the answers (we all meet such students!).
As a result, only the first and third student provides confidences that are adequate and could be applied further. Like students, neural network models can be very diverse in their behaviours.

Interestingly, and also very much like students, extractor models are not always consistent in terms of performance vs. confidence. Photo by Annie Spratt on Unsplash
What do we need in this case? Let’s use our beloved analogy again: We need a teacher who is able to estimate the students’ work! Try to look at it like this: All machine learning algorithms use training data as a teacher. In our analogy, data-sets with a teacher would relate to our “students” having several lectures or lessons by these various teachers with their individual data-sets. So in essence, we need a teacher for evaluating every single students’ work after training. Fortunately, we have recently implemented such a teacher and named it the “Prof. Confident”. What are his or her outstanding skills?
Our customers often provide feedback to our extractions (are they correct or not?). When our Prof. Confident is getting a document with extractions (“a student’s work”), he/she is looking for very similar documents with already given feedback and only those documents which were processed with the same extractor (“the same student”). So, our teacher is doing two important things:
- he/she is looking for already evaluated and very similar tasks done by a specific student
- he/she is expecting the same outcomes on a new similar task since our student did not do any re-training (“new lessons”) in between.
Now let’s apply this concept to our actual problem: If our extractor was correct for IBAN extractions on ten very similar documents of the same format, then we expect the same quality on a new document of the same format (the confidence of the IBAN extraction equals 1.0). If the extractor was wrong on five such documents from ten, then Prof. Confident is estimating the confidence 0.5 for the IBAN extraction on a new similar document. Of course, after re-training our “student” we have to collect feedback from the beginning. Our research results confirmed that such confidence predictions are almost always correct. This was expected since a professional teacher almost never makes serious mistakes during the estimation of students’ work. However, our Prof. Confident depends a lot on feedback provided by our customers. Without having enough similar documents with already given feedback, Prof. Confident is not able to return any valuable prediction.
Our students are regularly re-trained in order to improve their results. In this case, Prof. Confident is not able to estimate students’ works based on their previous results. Therefore, after every retraining, collecting new feedback is required. Based on our research results, it is enough to provide new feedback on at least five similar documents.
Now we could dive further down into details and try to answer questions like: How is our Prof. Confident looking for very similar documents? To answer this, we decided to apply a specifically designed vectorization on every document. It combines both the lexical and format features of a document. For this vectorization, we collected a small vocabulary of the most frequent words that appear on a lot of documents. So for every document, we calculate the average position (x, y) of each word from the vocabulary. Finally, we are able to concatenate all these positions into one vector per document. This vectorization characterizes the format of a document since it doesn’t depend on very specific information such as concrete names, addresses, or amounts.
If we apply this method to different invoices of the same format originating from the same company, we obtain very similar or even equal vector representations of these invoices. Based on this vectorization, we are able to calculate similarities between documents based on distances between the corresponding vectors and we are now in a position to find very similar ones between them. We have the capability of applying a predefined threshold for vector similarities and exceeding this threshold means having similar documents of the same format. Currently, we use the cosine similarity with a threshold of 0.9.

An illustration of document vectorization and cosine similarity between vectors. We show document vectorization into 2-dimensional vectors based on one word from the vocabulary. In reality, 2*N-dimensional vectors are applied where N is the vocabulary size.
As things unfold, let’s go deeper with our exploration of the technical problem-domain. As you can imagine, it would not be a good idea to compare a vector with all the previously seen (and saved) vectors, since this would be a very time-consuming process. Currently, we keep hundreds of thousands of vectors of all incoming documents from the last 28 days — and we have to deal with the process of deleting vectors of outdated documents. It is obvious that executing hundreds of thousands of vector similarity comparisons for every incoming document is really a bad idea. To resolve this problem we chose to apply so-called Locality-Sensitive Hashing based on Random Binary Projections. Utilizing this approach allows us to compare a single vector with only a small subset of already stored vectors and hence reduces computational costs a lot.
As you can see, Gini’s “Prof. Confident” is able to estimate confidences of our extractions with almost 100% accuracy and the professor is able to work with any kind of extractors — regardless of algorithms and settings behind them. All Prof. Confident needs is feedback provided by our customers. And the better the coverage and quality of this customer-provided feedback — the better of a teacher our professor will become.
. . .
If you enjoy mastering challenges of machine learning like this one, you should probably check out our open positions. We are always looking for excellent developers joining us!
At Gini, we want our posts, articles, guides, white papers and press releases to reach everyone. Therefore, we emphasize that both female, male, and other gender identities are explicitly addressed in them. All references to persons refer to all genders, even when the generic masculine is used in content.