Structured data vs unstructured data vs semi-structured data: differences and challengesUnstructured data forms the bulk of the data we see today. It is rich in information and easy to create, such as with a scribble of a pen, or a quick email sent to a colleague. Unstructured data has no clear style or pattern, quite unlike its structured data counterpart that can find itself neatly tucked into rows and columns. Within organisations, these are exceptionally rare and much harder to create. However, the advantages are that applications can easily integrate with structured data to perform multiple complex tasks. There’s also a third category that falls in between: semi-structured data. It contains elements of the former data types. Like unstructured data, semi-structured data could seem easy to read and decipher by humans, but a technology feat for traditional programmes.
|Structured data||Semi-structured data||Unstructured data|
|Structured data refers to clearly defined numbers and strings stored in a formatted relational database. It is highly-organised in rows and columns of a table, making it easy to manipulate and light to store.||Semi-structured data is not as rigid to be formatted into a relational database, but has some structural consistency and quantitative elements. They may contain meta tags that describe the data (e.g. sender email address), though the actual contents will vary.||Unstructured data is not organised into a clear data model, and its attributes take various forms, often qualitative and text-heavy. They appear intuitive to humans to read, but are challenging for traditional programmes to understand.|
|Example: Relational SQL databases||Examples: JSON, XML, CSV etc.||Examples: Open text, freeform documents, contracts, contents in an email body, social media posts etc.|
How much unstructured data is there?
Findings from IBM, Gartner and IDC along with various other analysts and vendors put a finger at 80% of enterprise data being unstructured data. And yes, this 80% is a ballpark figure that has been quoted a thousand times over, since decades back.
But the message is that unstructured data is important. It’s present, and must be managed. Data that’s unstructured and unanalysed is a US$430 billion opportunity cost in 2020 according to IDC. That’s five times the size of the global AI, RPA and data analytics market combined.
Exploiting unstructured sources, primarily by applying text analytics, is an essential component of any comprehensive business-intelligence program, a key element in ensuring enterprise competitiveness.” – Seth Grimes, analyst for NLP, text analytics, and sentiment analysis
How can AI help unlock the value of unstructured data?
Data is the key foundation for everyday operations to thrive. It’s also an essential ingredient to be fed into information systems and AI models. Unfortunately, this entire premise is founded on high quality data, that must also be in massive volumes, and with a manageable amount of structure.
What about getting this data in the first place, can AI do anything about it?
There’s a high wall of barriers ranging from the amount of education, business process re-wiring and technology development that is needed to unlock its value. However, the good news is that artificial intelligence (AI) today can alleviate part of the problem, and get the basics of data right.
- AI can improve data quality for better data management.
- AI can convert information into structured formats through document processing.
- AI can connect the dots through a knowledge mind map.
- AI can make data accessible through enterprise search.
1. AI can improve data quality for better data management
In an ideal world, high quality data would be found in clearly and consistently labelled fields, clean and ready for use in downstream applications. Oftentimes however, data is not understandable or readable—i.e. raw data.
Raw data can be strewn all over documents in unpredictable formats, or over various separate documents or repositories with a great deal of inconsistency. It is also likely that unstructured information would come ‘noisy’. This noise includes typos, signatures, html tags, confidentiality and other non-essential entities. While humans read them effortlessly, such noise has a significant impact on the machine-readable quality of the data.
Yet, acquiring accurate and comprehensive data sets, identifying anomalies or fraud, and detecting patterns can result in high manual effort. But ensuring data quality is an essential initial step necessary to prepare data for further processing and analysis.
This is where AI can play a critical role in ‘data cleaning’ to prepare them for applications across the data life-cycle process. Various AI-based strategies, that employ Machine Learning techniques for example, can improve data quality when it comes to data acquisition, maintenance, protecting and usage as seen below.
|Structured documents||Semi-structured documents||Unstructured documents|
|Structured documents are fixed forms where the document layout and formatting are completely static across pages.||Semi-structured documents contain fixed information like names and dates, but also unstructured information such as descriptions of items purchased.||Unstructured documents contain information in a free format.|
|Examples: surveys, questionnaires, application forms||Examples: ID, bills of lading, invoices, purchase orders, medical certificates.||Examples: letters, contracts, articles, powers of attorney, resume.|
As seen in the examples above, semi-structured and unstructured documents are everywhere. Document-intensive operations from customer onboarding and trading, to employee recruitment and claims management deal with a high volume of such documents.
Digging through all these information with the human eye is a painful process that incurs substantial overhead costs. As you’ve probably guessed, there are AI-powered tools today used to automate such document-intensive operations.
Within enterprises, the data capture tool to transform documents into structured information is called intelligent document processing (IDP).
Think of IDP as an essential step to make massive volumes of data usable once data is acquired and compiled. By extracting data points from unstructured documents, it helps enterprises cut short manual paper-driven processes to reduce costs and increase productivity.
- Pre-processing: The process starts with converting the document from an image into a text format. This improves data quality to prepare data for further processing, like how we explained earlier. Techniques include optical character recognition (OCR) technology and image enhancement features like de-skewing or de-noising, to ensure that the text output is highly accurate.
- Classification: Uploaded documents can contain multiple pages with different types or subtypes. For example, a mortgage application typically includes multiple pages with different documents such as a loan application, pay stubs, bank statements, ID documents, and others that need to be accurately identified and classified. Intelligent document classification uses AI-based technologies to automatically classify and separate such multipage documents to prepare for extraction.
- Extraction: The third step begins the information extraction process proper. Typically, IDP solutions include a library of pre-trained extraction models, each designed to extract a list of data points from a specific type of document. For example, an invoice extraction model which has been trained using thousands of invoice samples can be used to extract data points such as the invoice number, date, and line items within invoices received in a real business setting.
- Post-processing: Once data is extracted, it goes through a series of validation rules and AI-driven techniques to improve the extraction results. These could be predefined rules validated in an automated fashion or enhanced by using human supervision. After verifying that the extraction is accurate, structured data is ready to be used for further downstream applications.
You need thousands of examples.
No fewer than hundreds.
Ideally, tens or hundreds of thousands for “average” modeling problems.
Millions or tens-of-millions for “hard” problems like those tackled by deep learning
– Machine Learning Mastery
“An Ontology refers to a set of concepts and categories in a subject area or domain that shows their properties and the relations between them.”
– Oxford Living Dictionary
In AI, you can think of Ontology as a knowledge map that reflects the associations among different data points. Because these knowledge representations are machine-readable, ontologies give systems an understanding of the content and relationships between data elements, to enable systems to make inferences that mimic how a human brain works.
To further illustrate this, imagine a large maritime company trading with a renowned bank like Citibank. Accessing various business units, data responsitries and spreadsheets, there can be many variations of ‘Citibank’. Perhaps as CITI, or CSL (Citibank Singapore Limited) and so on.
Reconciling all these entities through AI that can periodically scan the data and understand how concepts are linked (by equivalence, hierarchy or association) can give a dramatic improvement in accessibility. In other words, instead of learning that the variations of ‘Citibank’ are equivalent, an ontology can formally define their relationship. Suddenly, the shipping company can instantly understand its entire exposure to Citibank.
4. AI can make data accessible through enterprise search
With a knowledge graph in place, ontology-based AI has huge spillovers to data accessibility. This is particularly valuable amidst expanding data silos, which if are hard to reach or hard to read, can make decision making extremely cumbersome.
Think about when you had to schedule meetings with multiple teams to sync up on a piece of work; or scroll through layers of folders within your shared workspace to manually locate nuggets of information, not knowing where’s the first place to look. This is as much a managed process as it is a technology driven one.
Adding metadata automatically, using an ontology as explained above, can make various formats of data searchable in a modern data management process. This will support taxonomic resolution to create a common understanding of data across various domains as well as allow collaborative utilisation where users across departments can capture and visualise metadata including the ability to tag, rate and rank the metadata.
In TAIGER, metatags are assigned through our Automatic Entity Recognition tool. It recognises and extracts entities based on the knowledge model (ontology) and classifies them into categories such as person, organisations, or locations.
AI can also be applied to search for other keywords, their synonyms, and even vague concepts that you might not have a keyword for. For the latter, semantic search capabilities is the answer. Tools that leverage Natural Language Processing and Knowledge Representation—such as TAIGER’s semantic search engine, Omnitive Search—helps to access these tricky sources of information with human-like intelligence.
AI as the key to unlock digital transformation, one data point at a time
With the explosion of information faced by enterprises today, particularly from customers, it’s hard to understate the value of AI in unlocking unstructured information. AI’s ability to organise, extract and interpret data makes it a key tool necessary for broader solutions needed for larger domains and industries, that can improve customer experiences to commercial strategies exponentially.
And so, perhaps the question forward is less about whether AI can unlock the value of unstructured data, but where has AI not been applied?
About TAIGER’s Omnitive Extract and Omnitive Search
Omnitive Extract and Omnitive Search are TAIGER’s solutions to unlocking unstructured data. Being deployed across verticals within global financial institutions and the public sector, Omnitive successfully copes with highly unstructured information to increase productivity and returns on investment.
Speak with one of our solutions managers today on how the Omnitive solution suite can unlock better data management, extraction and discoverability within your organisation.