Intelligent Document Processing
The challenge with information extraction
Manually extracting information is a cumbersome process that costs valuable human resources―translating to time and money. Reading thousands of complex documents is not only labour-intensive, but is difficult to scale up effectively, without incurring extra overheads or risks of making errors in the process.
Yet, the automated solution, called Intelligent Document Processing (IDP), has its limitations too. Many enterprises still rely on a manual extraction process because IDP systems lack the ‘intelligence’ to extract information accurately–particularly in unstructured documents, which is the real challenge.
Most Machine Learning based IDP systems are dependent on thousands of data sets to be able to produce optimal performance per document subtype; even so, these systems are often limited to documents with more structure. To successfully extract information from documents which are highly unstructured, organisations need to invest a great degree of IT spends and configuration effort.
What is intelligent document processing?
Simply put, IDP converts semi-structured and unstructured data into structured data that can be fed into downstream systems for further use.
An IDP solution should be able to handle a wide range of document formats including PDFs, emails, any kind of text and even scanned images, making it valuable for small to large businesses to automate document processing for process invoices, claims, loan applications, legal documents like powers of attorney, cheques, identification documents and more.
Features of intelligent document processing
An enterprise-grade IDP solution goes through the following phases (see Figure 1).
Pre-processing: The process starts with converting the document from an image into a text format. This improves data quality to prepare data for further processing, like how we explained earlier. Techniques include optical character recognition (OCR) technology and image enhancement features like de-skewing or de-noising, to ensure that the text output is highly accurate.
Documents can contain multiple pages with different formats, i.e different types or subtypes. Intelligent document classification uses AI-based technologies to automatically classify and separate multipage documents to pull out the relevant pages of information before extraction.
For example, a mortgage application typically includes multiple pages with documents such as a loan application, pay stubs, bank statements, ID documents, and others that need to be accurately identified and classified.
This step executes the information extraction process, to extract specific data from selected documents. Extraction is conducted using baseline extraction models, which are pre-trained to read specific document types. For example, an invoice extraction model, which has been trained using thousands of invoice samples, can be used to extract data points from invoices, such as the invoice number, date, and line items within invoices received in a real business setting.
Typically, IDP solutions include a library of pre-trained extraction models, which are pre-populated with the right fields for extraction. The relevant information is extracted from the document(s) before it is validated for accuracy in the next step.
Once data is extracted, it goes through a series of validation rules and AI-driven techniques to improve the extraction results. These could be predefined rules validated in an automated fashion or enhanced by using human supervision. A human in the loop can further validate the data, which allows the process to continuously learn and improve over time.
After verifying that the extraction is accurate, structured data is ready to be used for further downstream applications.
Why intelligent document processing matters for organisations
What IDP aims to do is automate a manual process that costs valuable human resources―translating to time and money. Manually reading thousands of complex documents is not only labour-intensive, but is difficult to scale up effectively, without incurring extra overheads or risks of making errors in the process.
With IDP, enterprises seek to achieve cost reduction, along with efficiency and productivity wins by having employees work on more value-add strategic tasks. Employees no longer need to ferret about specific data points in a sea of text. Instead, their task is drastically reduced to performance validation, where they verify if data points are being extracted accurately by the IDP solution when required. This leaves them more time for your employees on value-add and strategic tasks.
Being able to automate data extraction also means higher quality content that’s regularly updated, and accurately labelled as structured data fields. When this information is pumped into downstream systems like a content management system or customer relationship management (CRM) system, employees can access a birds-eye view of a specific domain or customer with a few clicks. When this is done right, IDP has multiple spillovers to various workplace processes and customer service.
Intelligent document processing use cases and their respective document types
- Banking and finance: invoices, financial statements, letters of credit, bills of lading, ISDA Master Agreements
- Insurance: claims forms, identification documents, transaction reports, cheques
- Human resources, training and development: resumes and CVs, work certificates, payslips, claims forms
- Legal: powers of attorney, constitutional documents, registration letters
- Mortgage: loan forms, bank statements, property documents
- Healthcare: medical certificates, medical invoices, medical reports
- Travel: hotel booking, flight booking
What is Omnitive Extract?
Omnitive Extract is TAIGER’s package-based intelligent document processing solution. It applies advanced natural and semantic language processing algorithms, combined with Machine Learning, to automatically identify, extract, clean, validate and store key information from unstructured and semi-structured documents.
At a glance: features of Omnitive Extract
Omnitive Extract is able to
- Ingest documents from various sources
- Automatically classify document types
- Pre-process multiple document formats using OCR and cleansers
- Extract data points with Machine Learning and Natural Language Processing algorithms
- Support data normalisation using custom business logic
- Calculate extraction confidence and identify data point accuracy threshold level
- Allow convenient human-in-the-loop extraction validation through a user-friendly post-processing GUI
- Consolidate similar data points across various documents
- Perform advanced document search
- Produce detailed system usage and extraction performance reports
- Support API integration for data export