Large language models have an impressive ability to generate human-like content, but they also run the risk of generating confusing or inaccurate responses. In some cases, LLM responses can be harmful, biased, or even nonsensical. The cause? Poor data quality. 

According to a poll of IT leaders by Gartner, poor data quality is the top obstacle companies face in their generative AI initiatives. This is especially true for organizations that use LLMs to interact with internal sources of knowledge, like customer data, financial transactions, or private healthcare information. 

Fortunately, there are a number of solutions to poor data quality. In this article, we study one critical strategy: data enrichment. This is the process of providing contextualizing data to your retrieval-augmented generation (RAG) system to help LLMs work more efficiently. 

For instance, suppose your organization uses an acronym that the LLM doesn’t understand by default or understands it as something else. In these cases, the LLM will fail to understand users’ queries and respond inappropriately. With robust data enrichment, you can provide the context the LLM needs to answer questions properly. 

Let’s explore four techniques to enrich your unstructured data with contextualizing information. 

1. Named Entity Recognition (NER)

Unstructured data, such as free-form text, is challenging to analyze, so you have to create some structure yourself. Named Entity Recognition is a form of data enrichment that transforms this unstructured data into a structured format, making it more accessible and analyzable for AI models.

NER involves identifying and classifying key information elements of text into predefined categories such as the names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. These entities are tagged with metadata from your controlled knowledge source that give the LLM better context. 

NER helps in extracting the “who,” “where,” and “when” from text, which is essential for providing context to LLMs. By understanding the entities in a conversation or text, LLMs can generate more relevant and accurate responses or analyses.

With entities identified, it’s easier to index documents and enhance the searchability of the content. This is particularly beneficial for RAG, as it enables more efficient and accurate retrieval of relevant information.

For AI systems that interact with users, like chatbots or virtual assistants, NER is critical because users will expect to interact with these terms. Understanding entities means they can respond more appropriately to user queries or comments. 

Examples of Named Entity Recognition

Customer Feedback Analysis: In customer reviews, NER can identify product names, customer names, attributes, and locations, helping your LLM to quickly understand customer sentiments about specific products or services.

News Aggregation: For news articles, NER can extract entities like the names of people, organizations, locations, dates, and events. This facilitates the categorization and clustering of news stories for better navigation and retrieval.

Healthcare: In medical records, NER can identify drug names, medical conditions, patient names, and treatment procedures, aiding in the organization and analysis of patient information for better healthcare outcomes.

Practical Application in RAG

In the context of RAG, enriching unstructured data with NER allows the model to leverage a more structured database. Using a tool like SpaCy, a popular Python library for NLP, you can extract entities from your data and enrich them with additional information. With the new metadata, the LLM can better understand the entities and their relevance.

When a query or input is received, the RAG quickly identifies relevant entities within the query and then retrieves the most pertinent information or documents from your knowledge source based on these entities. 

For example, if your text mentions “Apple,” SpaCy can help determine whether it refers to the fruit or the technology company. Providing such clarity enhances the LLM’s comprehension and its responses or decisions based on that data.

This process enhances the model’s ability to generate informed, context-aware responses, thereby improving the overall user experience and the utility of the AI system.

5 Point RAG Strategy Guide to Prevent Hallucinations & Bad Answers This guide designed to help teams working on GenAI Initiatives gives you five actionable strategies for RAG pipelines that will improve answer quality and prevent hallucinations.

Midjourney depiction of data visualization, findability and enrichment.

2. Keyword Generation

Keywords serve as a bridge connecting users’ queries to the most relevant content. They act as anchors that categorize and segment data, and can provide a quick snapshot of the text’s content. This helps LLMs understand a document’s topic or themes.

Keyword generation is another technique of giving structure to unstructured data. This is the process of identifying and extracting significant words or phrases from unstructured data that best represent its content and context. 

KeyBert presents one of the solutions for keyword extraction, a library that employs a transformer-based model (BERT) to extract the keywords. These terms help RAG systems to better understand the context of the unstructured data. 

Additionally, by tagging texts with accurate keywords, the LLM can efficiently locate and present the most pertinent information in response to a search query. Ultimately, this enhances the accessibility and usefulness of your RAG model.

Examples of Keyword Generation

Legal Document Retrieval: When analyzing legal documents or case law, keywords such as “contract dispute,” “intellectual property,” or “antitrust litigation” can be generated. These keywords enable RAG systems to quickly retrieve relevant legal precedents or articles from the knowledge source. This helps lawyers or legal researchers find pertinent information without sifting through extensive legal databases manually.

Healthcare and Medical Research: In a healthcare setting, RAG solutions could use keywords that pertain to medical research papers or patient records. These keywords allow the model to retrieve and synthesize relevant studies, treatment protocols, or patient outcomes, aiding healthcare professionals in making informed decisions.

Customer Support and FAQ Automation: Keywords generated from customer queries or product manuals, such as “battery life,” “software update,” or “warranty period,” can help retrieve the most relevant sections from a knowledge base or FAQ repository. This enables the system to provide precise, contextually relevant answers to customer inquiries.

Market Research and Competitive Analysis: RAG systems can use keywords generated from industry reports or news articles, like “market share,” “emerging trends,” or “consumer preferences.” These keywords help the model to fetch and compile relevant information from various sources, offering businesses insights into market dynamics, competitive landscapes, or consumer behaviors.

Practical Application in RAG

In the context of RAG, keyword generation plays a critical role in data enrichment. When a query is inputted, the model can leverage generated keywords to quickly sift through vast amounts of data and pinpoint relevant information or documents. 

This targeted retrieval based on keywords substantially enhances the LLM’s ability to generate responses or content that is pertinent, accurate, and contextually aligned with the user’s request.

By integrating keyword generation into the data preprocessing pipeline, you can significantly amplify the efficiency and effectiveness of your RAG system. Ultimately, your application will deliver more nuanced, relevant, and context-aware responses or analyses.

3. Topic Modeling

Topic modeling is a technique used to discover the underlying thematic structure in a collection of texts. It’s a form of unsupervised learning that identifies topics in a text corpus, without needing any prior annotations or labeling. This method is pivotal in organizing, understanding, and summarizing large datasets of unstructured text.

Topic modeling algorithms can uncover latent topics within texts which provides insights into the main themes. This is particularly useful for dealing with large volumes of text where manual analysis is impractical.

By categorizing texts into topics, RAG systems can more efficiently retrieve information related to a specific theme or subject. This enhances the relevance and precision of the retrieved content.

For instance, suppose a RAG system retrieves four pages of a book. In this case, the LLM lacks enough context to understand the pages. But if those pages were bundled with the chapter title, book title, headings, image metadata, and other details, the LLM would have a better understanding of the retrieved content. 

(Interestingly, this is similar to how human minds work. More data points create a clearer picture.)

Topic modeling also enables personalized content recommendations by matching user preferences with topics identified in content, enhancing user engagement and satisfaction.

Examples of Topic Modeling

News Aggregation: Topic modeling can categorize news articles into topics such as politics, sports, entertainment, etc. For RAG, this categorization allows for the targeted retrieval of news content based on a user’s query or interest, improving the relevance of the information presented.

Academic Research: In academic databases, topic modeling can help categorize research papers into different fields or topics, like machine learning, quantum physics, or medieval history. This assists researchers in finding relevant literature and enables RAG to provide more focused and pertinent information retrieval.

Customer Feedback Analysis: By applying topic modeling to customer reviews and feedback, businesses can identify prevalent themes, such as product quality, customer service, or pricing. This insight helps companies address specific areas of concern and allows RAG to extract and summarize relevant feedback themes more effectively.

Social Media Monitoring: Topic modeling can analyze social media posts to identify trending topics or sentiment about specific subjects, even when users don’t use the same keywords. This provides organizations with real-time insights into public opinion, enabling RAG systems to tap into current trends for generating content or responses.

Midjourney depiction of futuristic book in a library

Practical Application in RAG

In the context of RAG, topic modeling enriches the retrieval process by allowing the model to understand the broader themes within the data. BERTopic, a transformer-based topic modeling technique, is useful here. 

When a query is received, the model can use the identified topics to fetch relevant documents or data. This ensures that the information retrieved is not just keyword-specific but also contextually aligned with the overall theme of the query. 

Essentially, this means the user doesn’t have to use precise keywords to find information or otherwise interact with the LLM. The LLM gains the ability to search and converse by meaning. 

Ultimately, this leads to a more nuanced understanding and generation of responses, making RAG more effective in delivering accurate and contextually relevant information.

4. Link Contextualization

Link contextualization refers to the practice of providing detailed descriptions for hyperlinks within the content of your knowledge source. This gives your RAG system insight into the context and content of the linked material, enabling a more nuanced understanding beyond the surface level of texts.

Ultimately, link contextualization creates a richer, interconnected data environment, allowing RAG systems to make more informed associations and retrieve more relevant content in response to queries. This creates a better experience for users. 

Without link contextualization, the LLM is limited to the link’s anchor text. In some cases, the anchor text is only a few words. In other cases, the anchor text is a string of meaningless characters (such as a Google Doc link), which leaves the LLM totally ignorant of the destination page’s content. 

This technique is particularly useful in enhancing the model’s ability to perform tasks like summarizing, content generation, and answering questions based on linked resources. Semantic search also benefits greatly from this approach as it allows for a more nuanced understanding of linked content, improving the search results’ relevance and depth.

Examples of Link Contextualization

Research and Academic Papers: In academic papers, link contextualization can help RAG systems understand the relevance of citations and references. This provides insights into the background and supporting materials that influence the main content.

News Articles: For news content, contextualizing links to previous articles, related events, or biographical information can enrich the model’s comprehension of current events, historical context, and the relationships between various news elements.

Corporate Knowledge Bases: In organizational knowledge bases, contextualizing links to internal reports, policy documents, or project pages can aid RAG in delivering more comprehensive and relevant corporate information to employees.

E-Learning Platforms: On educational websites, link contextualization can assist RAG in understanding the connections between different learning modules, background materials, and supplementary content, enhancing the educational experience.

Practical Application in RAG

Incorporating link contextualization into RAG systems allows them to leverage a deeper understanding of the content and context of hyperlinks, not just isolated pieces of text. It also makes a text corpus with robust interlinking (such as an internal knowledge base) more valuable as the LLM gains contextual understanding of every linked page. 

When processing a query, a RAG system can use the rich descriptions of links to fetch and integrate more nuanced information, ensuring that the generated responses are not only relevant but also deeply informed by the interconnected content.

Beyond Data Enrichment

Data enrichment techniques such as Named Entity Recognition, keyword generation, topic modeling, and link contextualization are indispensable for enhancing the performance and effectiveness of RAG. 

By providing contextualizing information and structuring unstructured data, these strategies enable RAG systems to generate more accurate, context-aware responses, thereby improving user experiences and the utility of AI systems. 

As organizations harness the power of generative AI, data enrichment methodologies will be essential tools for unlocking the full potential of these technologies and driving innovation across various domains.