80% of data is unstructured data, what can AI do about it?


Wisia Neo


Unstructured data is the fuel for tomorrow’s growth, and yet is locked today. Here’s how AI-powered data cleaning, Machine Learning document extraction, NLP and Ontologies can unlock its value.


Imagine today you need to do a cost and benefits analysis for a client’s existing business. You start and immediately realise the trove of data—reports, emails, client records, conversations, and the list continues—you’ll need to access and make sense of.

Not long after, you start to encounter a series of difficulties that you’ll need to handle manually and repeatedly; from finding accurate and validated information to assessing current analyses. This multiplied across analysts, departments and offices, is a serious roadblock. It’s one that replays itself daily, in SMEs to MNCs alike, anywhere in the world.

The key here is that most of these difficulties are tied to data—or more specifically, unstructured data. To explain this further, let’s first break down the three kinds of data organisations deal with.


Structured data vs unstructured data vs semi-structured data: differences and challenges

Unstructured data forms the bulk of the data we see today. It is rich in information and easy to create, such as with a scribble of a pen, or a quick email sent to a colleague. Unstructured data has no clear style or pattern, quite unlike its structured data counterpart that can find itself neatly tucked into rows and columns. Within organisations, these are exceptionally rare and much harder to create. However, the advantages are that applications can easily integrate with structured data to perform multiple complex tasks.

There’s also a third category that falls in between: semi-structured data. It contains elements of the former data types. Like unstructured data, semi-structured data could seem easy to read and decipher by humans, but a technology feat for traditional programmes.

Structured data
Semi-structured data
Unstructured data
Structured data refers to clearly defined numbers and strings stored in a formatted relational database. It is highly-organised in rows and columns of a table, making it easy to manipulate and light to store.
Semi-structured data is not as rigid to be formatted into a relational database, but has some structural consistency and quantitative elements. They may contain meta tags that describe the data (e.g. sender email address), though the actual contents will vary.
Unstructured data is not organised into a clear data model, and its attributes take various forms, often qualitative and text-heavy. They appear intuitive to humans to read, but are challenging for traditional programmes to understand.
Example: Relational SQL databases
Examples: JSON, XML, CSV etc.
Examples: Open text, freeform documents, contracts, contents in an email body, social media posts etc.

How much unstructured data is there?

Findings from IBM, Gartner and IDC along with various other analysts and vendors put a finger at 80% of enterprise data being unstructured data. And yes, this 80% is a ballpark figure that has been quoted a thousand times over, since decades back.

But the message is that unstructured data is important. It’s present, and must be managed. Data that’s unstructured and unanalysed is a US$430 billion opportunity cost in 2020 according to IDC. That’s five times the size of the global AI, RPA and data analytics market combined.

“Exploiting unstructured sources, primarily by applying text analytics, is an essential component of any comprehensive business-intelligence program, a key element in ensuring enterprise competitiveness.” - Seth Grimes, analyst for NLP, text analytics, and sentiment analysis

How can AI help unlock the value of unstructured data?

Data is the key foundation for everyday operations to thrive. It’s also an essential ingredient to be fed into information systems and AI models. Unfortunately, this entire premise is founded on high quality data, that must also be in massive volumes, and with a manageable amount of structure. 

What about getting this data in the first place, can AI do anything about it?

There’s a high wall of barriers ranging from the amount of education, business process re-wiring and technology development that is needed to unlock its value. However, the good news is that artificial intelligence (AI) today can alleviate part of the problem, and get the basics of data right.

4 ways AI can be used on unstructured data

1. AI can improve data quality for better data management

In an ideal world, high quality data would be found in clearly and consistently labelled fields, clean and ready for use in downstream applications. Oftentimes however, data is not understandable or readable—i.e. raw data.

Raw data can be strewn all over documents in unpredictable formats, or over various separate documents or repositories with a great deal of inconsistency. It is also likely that unstructured information would come ‘noisy’. This noise includes typos, signatures, html tags, confidentiality and other non-essential entities. While humans read them effortlessly, such noise has a significant impact on the machine-readable quality of the data.

Yet, acquiring accurate and comprehensive data sets, identifying anomalies or fraud, and detecting patterns can result in high manual effort. But ensuring data quality is an essential initial step necessary to prepare data for further processing and analysis.

This is where AI can play a critical role in ‘data cleaning’ to prepare them for applications across the data life-cycle process. Various AI-based strategies, that employ Machine Learning techniques for example, can improve data quality when it comes to data acquisition, maintenance, protecting and usage as seen below.

Unstructured Data Management

Source: CC CDQ

The value of improving data quality is a priority in many organisations with various maturity levels of metadata management. According to Gartner, Metadata management software is growing at 22.1% far faster than the overall infrastructure software market growth rate of 9%. Data quality will be a clear differentiating factor for any organisation’s data insights.


2. AI can convert information into structured formats through document processing

Taking data quality management further is data processing—transforming all kinds of information into structured, usable data. 

But before diving into uses of AI for document processing, keep in mind that the terminology for unstructured information in the data capture world is different. Information can also range from structured to unstructured as explained above. However, this information is referred to as documents instead of data

Structured to unstructured documents have slightly different definitions and examples as opposed to structured to unstructured data.

Structured documents
Semi-structured documents
Unstructured documents
Structured documents are fixed forms where the document layout and formatting are completely static across pages.
Semi-structured documents contain fixed information like names and dates, but also unstructured information such as descriptions of items purchased.
Unstructured documents contain information in a free format.
Examples: surveys, questionnaires, application forms
Examples: ID, bills of lading, invoices, purchase orders, medical certificates
Examples: letters, contracts, articles, powers of attorney, resume

As seen in the examples above, semi-structured and unstructured documents are everywhere. Document-intensive operations from customer onboarding and trading, to employee recruitment and claims management deal with a high volume of such documents.

Digging through all these information with the human eye is a painful process that incurs substantial overhead costs. As you’ve probably guessed, there are AI-powered tools today used to automate such document-intensive operations.

Within enterprises, the data capture tool to transform documents into structured information is called intelligent document processing (IDP).

Think of IDP as an essential step to make massive volumes of data usable once data is acquired and compiled. By extracting data points from unstructured documents, it helps enterprises cut short manual paper-driven processes to reduce costs and increase productivity.

Enterprise Grade IDP Solution TAIGER

IDP consists of a range of technologies to convert unstructured documents into structured formats. The process can be rather complicated, so let’s break down what are the essential components of an enterprise-grade IDP solution:

  1. Pre-processing: The process starts with converting the document from an image into a text format. This improves data quality to prepare data for further processing, like how we explained earlier. Techniques include optical character recognition (OCR) technology and image enhancement features like de-skewing or de-noising, to ensure that the text output is highly accurate.

  2. Classification: Uploaded documents can contain multiple pages with different types or subtypes. For example, a mortgage application typically includes multiple pages with different documents such as a loan application, pay stubs, bank statements, ID documents, and others that need to be accurately identified and classified. Intelligent document classification uses AI-based technologies to automatically classify and separate such multipage documents to prepare for extraction.

  3. Extraction: The third step begins the information extraction process proper. Typically, IDP solutions include a library of pre-trained extraction models, each designed to extract a list of data points from a specific type of document. For example, an invoice extraction model which has been trained using thousands of invoice samples can be used to extract data points such as the invoice number, date, and line items within invoices received in a real business setting.

  4. Post-processing: Once data is extracted, it goes through a series of validation rules and AI-driven techniques to improve the extraction results. These could be predefined rules validated in an automated fashion or enhanced by using human supervision. After verifying that the extraction is accurate, structured data is ready to be used for further downstream applications.

On this note, there are plenty of IDP solutions in the market, but not all are built the same—particularly so with regards to how information extraction models are trained to be able to extract data points from documents. Most extraction models use only Machine Learning (ML) as a technique to annotate the key values. This requires a substantial amount of samples and annotation effort by the users.

While the range of samples required depends, the estimate stands at thousands to apply machine learning algorithms, translating to several months of programming effort.

“You need thousands of examples.
No fewer than hundreds.
Ideally, tens or hundreds of thousands for “average” modeling problems.
Millions or tens-of-millions for “hard” problems like those tackled by deep learning.”
- Machine Learning Mastery

Today, advanced solutions use a combination of Machine Learning to deal with the structured or semi-structured content, and Natural Language Processing to read and understand the meaning of the words and sentences to extract key values in unstructured text. This reduces the need for a large sample size and the model creation can be much more effective (ie. higher accuracy).

Key Value Extraction
3. AI can connect the dots through a knowledge mind map

We’ve discussed various Machine Learning approaches that make sense of data by learning from volumes of training samples. But there’s another approach that takes a different tack—Ontology-based AI.

Instead of inferring patterns from massive training sets, this approach understands structured to unstructured data by looking at their fundamental concepts and relationships

Ontology is actually a branch of philosophy that deals with the study of being and existence. Linking ontology-based AI back to its philosophical origins can help us understand how it is used in information systems:

“An Ontology refers to a set of concepts and categories in a subject area or domain that shows their properties and the relations between them.”

- Oxford Living Dictionary

In AI, you can think of Ontology as a knowledge map that reflects the associations among different data points. Because these knowledge representations are machine-readable, ontologies give systems an understanding of the content and relationships between data elements, to enable systems to make inferences that mimic how a human brain works.

Ontology AI


To further illustrate this, imagine a large maritime company trading with a renowned bank like Citibank. Accessing various business units, data responsitries and spreadsheets, there can be many variations of ‘Citibank’. Perhaps as CITI, or CSL (Citibank Singapore Limited) and so on.

Reconciling all these entities through AI that can periodically scan the data and understand how concepts are linked (by equivalence, hierarchy or association) can give a dramatic improvement in accessibility. In other words, instead of learning that the variations of ‘Citibank’ are equivalent, an ontology can formally define their relationship. Suddenly, the shipping company can instantly understand its entire exposure to Citibank.

4. AI can make data accessible through enterprise search

With a knowledge graph in place, ontology-based AI has huge spillovers to data accessibility. This is particularly valuable amidst expanding data silos, which if are hard to reach or hard to read, can make decision making extremely cumbersome.

Think about when you had to schedule meetings with multiple teams to sync up on a piece of work; or scroll through layers of folders within your shared workspace to manually locate nuggets of information, not knowing where’s the first place to look. This is as much a managed process as it is a technology driven one.

Adding metadata automatically, using an ontology as explained above, can make various formats of data searchable in a modern data management process. This will support taxonomic resolution to create a common understanding of data across various domains as well as allow collaborative utilisation where users across departments can capture and visualise metadata including the ability to tag, rate and rank the metadata.

In TAIGER, metatags are assigned through our Automatic Entity Recognition tool. It recognises and extracts entities based on the knowledge model (ontology) and classifies them into categories such as person, organisations, or locations.

Omnitive Search Automatic Entity Recognition Tool

AI can also be applied to search for other keywords, their synonyms, and even vague concepts that you might not have a keyword for. For the latter, semantic search capabilities is the answer. Tools that leverage Natural Language Processing and Knowledge Representation—such as TAIGER’s semantic search engine, Omnitive Search—helps to access these tricky sources of information with human-like intelligence.


AI as the key to unlock digital transformation, one data point at a time

With the explosion of information faced by enterprises today, particularly from customers, it’s hard to understate the value of AI in unlocking unstructured information. AI’s ability to organise, extract and interpret data makes it a key tool necessary for broader solutions needed for larger domains and industries, that can improve customer experiences to commercial strategies exponentially.

And so, perhaps the question forward is less about whether AI can unlock the value of unstructured data, but where has AI not been applied?

About TAIGER’s Omnitive Extract and Omnitive Search

Omnitive Extract and Omnitive Search are TAIGER’s solutions to unlocking unstructured data. Being deployed across verticals within global financial institutions and the public sector, Omnitive successfully copes with highly unstructured information to increase productivity and returns on investment.

Speak with one of our solutions managers today on how the Omnitive solution suite can unlock better data management, extraction and discoverability within your organisation.


Your browser is unable to display our site correctly.
It is best viewed on a modern browser, such as Chrome, Safari or Firefox.