This post was created by Shelf with Insider Studios.
We’ve all heard the explanations for why AI projects fail: the models aren’t advanced enough, they don’t remember past interactions, they hallucinate answers — the list goes on.
However, those explanations overlook AI’s number one enemy: the messy, constantly changing, and poorly managed documents and files that companies feed into their Agentic, RAG, Agent Assist, or other applications.
Most executives are unaware that the majority of their data — approximately 90%, according to IDC — is unstructured. It lives in emails, PDFs, slide decks, product manuals, knowledge bases, policies, contracts, and more.
Unstructured data is messy, constantly changing, and has been neglected for years. And it’s the biggest blocker to AI initiatives.
The problem of quality and context
Poor quality and a lack of context are major problems facing unstructured data.
On the quality side, files are often outdated, duplicative, or inconsistent. They carry compliance risks when old policies circulate alongside new ones. And they contain redundant, obsolete, and trivial content (known as ROT) that clogs GenAI and prevents accurate outcomes.
On the context side, documents are missing metadata, definitions of key business terms, and effective methods for processing free-flowing text. Industry jargon, product names, customer references, and even employee names often go undefined. Free-flowing text that isn’t tethered to a straightforward “who, what, when, and where” makes it trustworthy.
Feed this kind of data into AI, and the outcome is inevitable: garbage in, garbage out.
Why ‘unstructured’ is radically different from ‘structured’
Structured data is simple. It’s binary: an event happened or it didn’t. Unstructured data is fundamentally different because it represents thought. And thoughts aren’t black and white. They’re nuanced, fluid, and dynamic. They don’t fit neatly into rows and columns.
Consider a 20-page policy document. Today, it may be 100% accurate, but tomorrow, perhaps one paragraph’s information needs updating. Three months later, another section is out of date. Suddenly, the document is misleading, even though 99% of the text itself remains correct.
That dynamic, qualitative nature is what makes unstructured data so difficult. It requires continuous quality assurance, not one-time checks. And unlike structured data, it can’t be quantifiably measured. It must be repeatedly evaluated for accuracy, context, and currency.
Why workarounds don’t work
When confronted with unstructured data chaos, many IT teams default to working on the retrieval mechanism because it can more readily be tweaked and adjusted. Here are some techniques:
- Context engineering focuses on improving retrieval, rather than the underlying data. If the content is bad, this simply accelerates the retrieval of bad answers.
- RAG/Graph RAG creates relationships between data but disregards whether the content itself has been quality-assured. Without context, it still produces incorrect results more quickly.
- Rewriting and structuring content into templates is a failed approach that enterprises have tried with last-generation chatbots by attempting to predict every possible Q&A pair. It doesn’t scale.
- Unstructured data ETL processes only metadata and ignore the free text, context, and semantics, leaving most of the meaning untouched.
These are all sound strategies once the data itself is fixed. But until then, they simply optimize the retrieval of bad information.
The agentic barrier, and what needs to change
The next frontier in enterprise AI is agentic systems. However, ask an agent to reason over bad data that hasn’t been quality-assured and contextualized, and it will yield poor results. It won’t know which version of a document is the current one or understand company-specific jargon, and it definitely won’t connect tasks across departments. The outcome is hallucinations, bad recommendations, and wasted resources. Cue the headlines: “AI doesn’t work.”
According to Teresa Tung, global AI and data lead at Accenture, “We’ve done over 2300 GenAI Projects in the last year. About 1 in 2 cannot scale due to data readiness.”
To address the underlying content, enterprises must stop treating it as structured data and begin handling it as the complex, dynamic content it truly is. That means:
- Breaking content into its parts: metadata, concepts, and free-flowing text.
- Quality-assuring each part separately for accuracy, compliance, and updates.
- Implementing change management so content remains current.
- Enhancing with context — adding the who, what, when, where, and why.
- Processing semantically.
This isn’t enrichment by filling in missing fields. It’s an enhancement that adds metadata, context, and validation, making unstructured data usable for AI.
Could companies fix unstructured data themselves? In theory, yes, but most companies have millions of files. Manually breaking, quality-assuring, and enhancing them take time that they don’t have.
Fix the data, fix the AI
“Adding context and quality assurance to your documents and files doesn’t have to be a huge drain of time and resources,” says Sedarius Tekara Perrotta, CEO of Shelf. “Shelf automates cleanup, contextualization, and quality assurance at scale, turning messy documents and files into Smart Data that AI can finally use with confidence.”
The companies that address their unstructured data issues experience results that speak for themselves. A major coffee brand improved chatbot accuracy to 93% by simply refining its content. A global airline took a struggling RAG initiative, where answers were correct only half the time, and by automating content quality assurance across thousands of files, saw accuracy rates jump to nearly 90% overnight.
“You can’t have highly performant enterprise AI if you don’t have highly performant enterprise unstructured data,” says Perrotta. “We’ve seen companies that invest in fixing the underlying data feed into GenAI experience over 50% accuracy improvements in less than a month.”
Unstructured data doesn’t have to remain enemy number one, but companies that continue to feed AI messy, outdated, and poorly contextualized content will see more pilots stall, more budgets wasted, and more employees and customers lose trust. By addressing unstructured data head-on, companies can turn around stalled initiatives and get some low-hanging fruit wins with GenAI.