While large language models excel in mimicking human-like content generation, they also pose risks of producing confusing or erroneous responses, often stemming from poor data quality. 

Poor data quality is the primary hurdle for companies embarking on generative AI projects, according to Gartner’s study of IT leaders. This challenge is especially problematic for organizations that leverage LLMs to engage with internal knowledge sources (i.e., a knowledge base).

Fortunately, we have a number of strategies at our disposal to address poor data quantity. In this article, we explore one critical strategy: Identifying unstructured data risks before they’re fed into your LLM. 

Let’s walk through four techniques for mitigating risks in your unstructured data. Your job is to identify the risky data in your knowledge source and address it before your language model uses it to generate inappropriate content. 

1. Noise Detection

Noise detection involves identifying and filtering out irrelevant, misleading, or erroneous information that can distort the learning process and compromise the accuracy of AI-generated outputs. Filtering out this noise ensures that only high-quality, relevant information is utilized by RAG, thereby enhancing the reliability and effectiveness of the model’s responses.

For instance, suppose you have a document with page numbers at the end of each page. During processing, those numbers would be mingled with the document’s content and ultimately confuse the LLM. It’s best to strip these page numbers out before processing to ensure the document’s meaning remains intact. 

Examples of Noise in Unstructured Data

Spelling errors and typos: Incorrectly spelled words or typographical errors can introduce noise into the dataset, leading to misinterpretation by RAG solutions. 

Irrelevant information: Extraneous details or unrelated content within the dataset can distract the model and skew its understanding of the input. This includes disclaimers, page numbers, inline advertisements, irrelevant dates, and footnotes.

Inconsistent formatting: Variations in formatting styles or structures across different documents can create inconsistencies in the dataset, making it challenging for the model to extract meaningful insights.

Ambiguous or misleading content: Text containing ambiguous language, double meanings, or misleading information can introduce noise and affect the accuracy of the model’s outputs. This ambiguous language can be eliminated or adjusted for clarity. 

How Noise Detection Fits into RAG

Incorporating noise detection into the preprocessing pipeline of a RAG model helps ensure that the model only uses clean, high-quality data. This, in turn, improves the model’s ability to generate accurate and contextually relevant responses. Good data in, good response out, as they say. 

Furthermore, noise detection helps streamline the data, providing the model with a more coherent and reliable input. It also reduces the amount of preprocessing necessary, which reduces your time and monetary investment.

2. Duplicates Identification

Duplicates refer to multiple instances of the same or highly similar content. Over-representation of this data through repetition can skew the model’s training process and lead to biased or inaccurate outputs.

Duplicates identification involves the use of algorithms to detect and remove duplicate entries from the dataset, ensuring that each data point is unique and contributes meaningfully to the model’s learning process.

5 Point RAG Strategy Guide to Prevent Hallucinations & Bad Answers This guide designed to help teams working on GenAI Initiatives gives you five actionable strategies for RAG pipelines that will improve answer quality and prevent hallucinations.

Examples of Duplicates in Unstructured Data

  1. Replicated articles or documents: Multiple copies of the same article or document may exist within the dataset, leading to redundancy and potentially biasing the model’s understanding of certain topics or themes.
  2. Near-duplicate content: Slightly modified versions of the same content, such as paraphrased text or variations in formatting, can also be considered duplicates and should be identified and removed. This also includes cosine similarity (similarities of meaning). 
  3. Data entry errors: Inconsistent data entry practices or inadvertent duplications during data collection can result in duplicate entries within the dataset.
  4. Scraped content: When collecting data from multiple sources, there may be instances where identical or highly similar content is scraped and included in the dataset, leading to duplication.

How Duplicates Identification Fits into RAG

Before feeding unstructured data into the RAG systeml, duplicates identification algorithms can be applied to detect and remove duplicate entries, ensuring that RAG has access to only clean data, free from redundant or biased information. 

This makes the model more efficient and focused, allowing it to better generalize patterns and relationships within the data. This, in turn, improves the model’s ability to generate accurate and contextually relevant responses to user queries.

Additionally, removing duplicate content helps you avoid processing redundant data. Processing data takes time and comes at a cost, so limiting it as much as possible is important. 

Ongoing duplicates identification can help maintain the integrity of the data and prevent the accumulation of redundant information over time. This iterative approach ensures that the RAG system remains adaptable and responsive to changes in the data landscape.

Midjourney depiction of poor data quality

3. Link Health Assessment

The third step to identifying and minimizing data risks is an assessment of all of the links within your knowledge source. 

Link health assessment is a proactive strategy aimed at evaluating the quality and reliability of hyperlinks within unstructured data before integrating them into RAG. 

Hyperlinks play a crucial role in providing additional context and reference points within textual data. However, broken or misleading links can introduce noise and inaccuracies into the dataset, compromising the model’s performance and the reliability of generated outputs.

Link health assessment involves analyzing the health and validity of hyperlinks by checking for factors such as:

Accessibility: Ensuring that hyperlinks lead to accessible and functioning web pages or resources. Broken links can disrupt the flow of information and hinder the model’s ability to retrieve relevant content.

Relevance: Assessing the relevance and credibility of linked resources to ensure that they align with the overall context and objectives of the dataset. Irrelevant or misleading links can introduce noise and bias into the model’s training data.

Authority: Evaluating the authority and trustworthiness of linked sources to mitigate the risk of propagating false or unreliable information. Links to reputable and authoritative sources enhance the credibility of the dataset and the reliability of model outputs.

Recency: Checking the recency of linked content to ensure that it reflects up-to-date information and aligns with the temporal context of the dataset. Outdated or obsolete links can lead to inaccuracies and undermine the relevance of the model’s responses.

Examples of Link Health Assessment in Unstructured Data

Link health assessment algorithm typically take three steps:

  1. Verifying the validity of hyperlinks in textual documents by programmatically checking their HTTP status codes to detect broken or inaccessible links.
  2. Assessing the credibility of linked sources by analyzing factors such as domain authority, publication reputation, and user ratings.
  3. Monitoring the recency of linked content by comparing publication dates or timestamps to ensure that the information remains current and relevant.

How Link Health Assessment Fits into RAG

Incorporating link health assessment into the preprocessing pipeline of a RAG system helps ensure the reliability and accuracy of the model’s training data by filtering out unreliable or outdated hyperlinks.

Before integrating unstructured data containing hyperlinks into RAG, link health assessment algorithms evaluate the quality and validity of linked resources. This mitigates the risk of propagating false or misleading information and enhances the credibility of the model’s outputs.

By proactively assessing link health, RAG systems can make more informed decisions about the relevance and reliability of linked content, leading to more accurate and contextually relevant responses to user queries.

Midjourney depiction of private information detection

4. Private Information Detection

Private information detection is a crucial strategy aimed at safeguarding sensitive or confidential data within unstructured datasets before incorporating them into RAG. Unintentional exposure of private information can lead to privacy violations, legal consequences, and erosion of user trust.

Private information detection involves identifying and redacting sensitive information, which can include anything you don’t want your LLM to interact with. Here are some common examples:

Personally Identifiable Information (PII): Personal data elements such as names, addresses, social security numbers, birthdates, ages, email addresses, and phone numbers.

Financial Information: Sensitive financial data such as credit card numbers, bank account details, and transaction records.

Health Information: Medical information including diagnoses, treatment records, and patient identifiers protected by regulations such as HIPAA.

Confidential Corporate Data: Proprietary business information, trade secrets, and internal communications.

Examples of Private Information Detection in Unstructured Data

To detect and eliminate private information within your datasets, consider using these methods: 

  • Natural Language Processing (NLP) techniques to extract and classify entities within text data, flagging instances of PII, financial data, and health information.
  • Pattern recognition algorithms to identify structured formats such as credit card numbers or social security numbers within textual data.
  • Machine learning models trained on labeled datasets to automatically detect and redact sensitive information based on contextual cues and patterns.

How Private Information Detection Fits into RAG

Integrating private information detection into the preprocessing pipeline of RAG helps ensure compliance with privacy regulations and protects sensitive data from unauthorized access or disclosure.

Before feeding unstructured data containing potentially sensitive information into the RAG system the detection algorithms outlined above can be applied to identify and redact sensitive data elements. This helps mitigate the risk of privacy breaches and ensures that the model’s outputs do not inadvertently expose confidential information.

SpaCy’s Named Entity Recognition (NER) capabilities can be used to scan text data for potential PII, such as names, locations, phone numbers, or other identifiers. Once identified, these pieces of information can be flagged, redacted, or handled according to the necessary privacy protocols, ensuring that your LLM training or analysis respects privacy constraints and complies with legal standards.

By proactively detecting and redacting private information, RAG systems can generate responses that respect user privacy and comply with data protection regulations. This enhances user trust and confidence in the model’s ability to handle sensitive information responsibly.

As you can imagine, private information depends on context. For instance, a location may be private in one case (such as healthcare records), but relevant in other cases (like a customer’s shipping address). Before connecting a RAG system to a knowledge source, it’s up to the owner of the knowledge base to decide what constitutes as private for their use case. 

Midjourney depiction of poor data quality

Don’t Be Limited by Poor Data Quality

Addressing data risks in RAG is paramount for ensuring their effectiveness, reliability, and ethical use. Strategies such as noise detection, duplicates identification, link health assessment, and private information detection play pivotal roles in mitigating the potential pitfalls of unstructured data. 

By integrating these proactive measures into the preprocessing pipeline, RAG solutions can operate on high-quality, trustworthy datasets, leading to more accurate, contextually relevant outputs while safeguarding against privacy breaches and maintaining user trust. 

As the field of generative AI continues to evolve, prioritizing data risk management will be essential for fostering responsible innovation and maximizing the benefits of these powerful technologies.

Read more of this series to learn more strategies to address poor data quality and make LLMs perform optimally. 

  • Part 1: Unstructured Data Enrichment
  • Part 3: Filtering Out High Risk Data
  • Part 4: Fixing Data at the Source
  • Part 5: Monitoring for Risk and Improvements