Query and answering in Mantium.
Document Intelligence is a domain of information processing that focuses on structured documents, such as PDFs. Structured documents encode information not only in their text, but also in the layout of their elements, including textual, stylistic, geometric, and graphical elements.
Document Intelligence aims to solve a radically different problem than conventional NLP. In conventional NLP, all relevant information is assumed to be encoded in the semantic and syntactic patterns of the textual data being processed. In contrast, structured documents convey information via element positioning in 2-dimensional space, and through features like text size, font, horizontal and vertical lines, and graphics. As such, modeling a structured document requires a model that can condition on data encoded in these disparate forms of information.
Mortgage and real estate industries are prime examples of needing to process various structured and unstructured documents, including contracts, agreements, disclosures, and many other types. These documents require manual and repetitive tasks as limited solutions are available to handle the processing from start-to-end. Mantium accelerates this process by applying a simple, end-to-end automation process to extract data from structured and unstructured documents and perform a data enrichment process to ensure you get what you desire . We support state-of-the-art systems for optical character recognition (OCR) and document intelligence models.
In this post, we show how to utilize Mantium to build an automation system that processes a structured document and applies machine learning to enhance the data enrichment process. We will use Amazon Textract and a Mantium document intelligence model to perform abstractive question answering. All users need to do is upload their document, enter a query, and let Mantium do the rest.
AWS Textract is a machine learning service, and an OCR service, that extracts various types of text including handwritten and printed text. Textract offers different endpoints to support proper text extraction from various types of documents such as forms and receipts. Each endpoint offers different features in attempts to parse and interpret the data.
We will use the query-based extraction feature which involves passing question(s) along with a document to receive the answer to that question. Textract will perform some type of extractive question answering on the document. Textract recommends users to create natural language questions using words from the document to pass as a query. This works well, however there are times when it may return missing information to your queries.
To assist this, Mantium post processes Textract output to assist in automatically filling in the answers to the missing queries with an enrichment model.
In the next section, we will describe why and how Mantium performs automatic data enrichment.
The Data Enrichment step uses the extracted data as input to our document intelligence model to perform abstractive question answering, where given a query and a string formatted document, generate an answer to the query. We handle the preprocessing aspects, such as resizing, normalizing, and ensuring each feature is in the correct format. The reason for performing an automatic data enrichment is to support the limitation that Textract presents. Textract sometimes misrepresents the information queried from the document, thus presenting wrong or blank answers in the abstractive question-answering step.
To understand this more clearly, let’s go through the steps of extracting information from a W-2 form and performing enrichment to support the limitation presented by Textract in the abstractive question-answering step.
Below is the image of the original W-2 form before it’s passed through the extraction step.
We will enter the queries below to get answers from the document, then Textract extracts the text information from the form, as described above in the “Data Extraction with Textract” section
Initially, Mantium uses Textract to return answers to the set of questions, and the answers presented by Textract come from the highlighted regions of the document below.
The answer to the fifth query (Number 5) – What is the document name? was misrepresented as the Locally name by Textract, and it provided “MU” as the answer to the query.
Mantium uses its in-house document intelligence model in the enrichment step to provide the correct answer to the query asked, replacing Textract’s answer. As a user, you will see the answers in the table below to your queries. Mantium has abstracted the complexity making it easy for you to get accurate answers.
|Queries||Mantium’s Final Answers|
|What is the employee’s first name?||Jane A|
|What is the employer identification number?||11-2233445|
|What is the state income tax?||1,535|
|What is the state wages?||50,000|
|What is the document name?||W-2 Wage and Tax Statement|
On the next page, provide a name for your notebook instance, and select a suitable instance type(An AWS Free Tier is okay). In the IAM Role, you can create a new IAM Role or use an existing role.
Complete the creation steps by accepting the remaining default settings.
This article explains how Mantium enriches your Textract extraction with its in-house document intelligence model. It also discussed how and why we added an automatic extraction process to provide a robust and accurate AI automation process for documents.
At Mantium, we are building an AI automation platform for documents with powerful capabilities to support different business use cases in different industries; click here to learn more.
Most recent posts