Hallucinations and ungrounded results are a significant challenge in Content Processing systems. When AI-generated content contains statements that are inconsistent with the input data or knowledge base, it can lead to the spread of misinformation and erode trust in the system.

Microsoft Azure’s Groundedness service aims to address this issue by detecting hallucinations, but how well does it actually perform?

In this blog post, we’ll analyze the service’s effectiveness using a permuted Q&A dataset.

Azure Groundedness Service Overview

The Azure Groundedness service is designed to identify whether the information in AI-generated content is grounded in the input data. It analyzes the output content and compares it against the provided context to determine if any parts are inconsistent or hallucinated.

Here’s a simplified overview of how it works:

  1. The service takes the generated content and input context as inputs
  2. It breaks down the content into smaller units (e.g. sentences or phrases). For each unit, it searches for supporting evidence in the input context
  3. If no clear support is found, that unit is flagged as potentially ungrounded
  4. The service returns an overall groundedness score and highlights the ungrounded parts

By surfacing hallucinations, the Groundedness service aims to improve the reliability and factuality of AI content generation. However, as we’ll see, it has significant limitations.

Dataset Structure

    AdversarialQA is a question answering dataset collected using an adversarial human-and-model-in-the-loop approach. The dataset consists of three subsets, each collected using a different model in the annotation loop:

    • BiDAF : Questions collected with a BiDAF model in the loop
    • BERT : Questions collected with a BERT model in the loop
    • RoBERTa : Questions collected with a RoBERTa model in the loop

    Each subset contains 10,000 training, 1,000 validation, and 1,000 test examples. The questions were composed by human annotators with the goal of finding examples that the model in the loop could not answer correctly. The context passages were sourced from the SQuAD1.1 dataset.

    The questions in each subset become progressively more challenging as the model in the loop becomes stronger (BiDAF < BERT < RoBERTa).

    This is reflected in the question types, with the RoBERTa subset containing more questions that require complex reasoning skills compared to the BiDAF subset.

    5 Obstacles to Avoid in RAG Deployment: A Strategic Guide Learn how to prevent RAG failure points and maximize the ROI from your AI implementations.

    Dataset Description

    To test the Azure Groundedness service, we used a Question Answering dataset that was modified to introduce hallucinations. The original dataset, sourced from AdversarialQA mixed by complexity questions, contains question-answer pairs along with the relevant context passages.
    Script was used to transform this data:

    
    def prepare_data_mix(row_dict, rows=1000):
        df = pd.DataFrame(row_dict)
        df_sample = df.sample(min(len(df), rows))
        df_sample['q_a_type'] = 'Original'
        random_rows = df_sample.sample(int(rows*0.25), random_state=42)
        random_rows['answer'] = np.random.permutation(random_rows['answer'])
        df_sample.loc[random_rows.index, 'answer'] = random_rows['answer']
        df_sample.loc[random_rows.index, 'q_a_type'] = 'Permuted_Answer'
        df_sample.reset_index(drop=True, inplace=True)
        return df_sample
    

    This function does the following:

    1. Samples a subset of the original QA pairs
    2. Marks them as Original in the q_a_type column
    3. Randomly selects 25% of these rows
    4. Permutes the answers among this subset
    5. Marks the permuted rows as Permuted_Answer in q_a_type

    The end result is a dataset where 25% of the answers are hallucinated by design. These permuted answers should be incompatible with their corresponding questions and context passages.

    Analysis of Results

    Let’s dive into the results of evaluating the Azure Groundedness service on this mixed dataset.

    First, we can see that the dataset has 750 original examples and 250 permuted ones, matching the 25% permutation rate:

    
    df.value_counts('q_a_type')
    q_a_type
    Original           750
    Permuted_Answer    250 
    Name: count, dtype: int64
    

    However, when we look at how many were flagged as containing hallucinations (ungroundedDetected), the numbers are concerning:

    
    df.value_counts('ungroundedDetected')  
    ungroundedDetected
    False    879
    True     121
    Name: count, dtype: int64
    

    Only 121 out of the 1,000 total examples were marked as ungrounded, even though we know 250 contain permuted answers.

    Digging deeper, we can check how many of the permuted examples were correctly identified:

    
    df[(df['ungroundedDetected'] == True) & (df['q_a_type'] != 'Original')].shape
    (88, 10)
    

    So out of the 250 examples that definitely contain hallucinations, only 88 were detected. That’s a dismal 35.2% recall.

    On the flip side, let’s look at how many of the original examples were wrongly flagged as ungrounded (false positives):

    
    df[(df['ungroundedDetected'] == True) & (df['q_a_type'] == 'Original')].shape
    (33, 10)  
    

    The service marked 33 of the 750 original, non-permuted examples as containing hallucinations. While not as bad as the false negatives, this still indicates a precision issue.

    To quantify the overall performance, we can calculate precision, recall and F1 scores, treating Permuted_Answer as the positive class:

    
    from sklearn.metrics import precision_recall_fscore_support
    
    df["true_hallucination"] = df["q_a_type"] == "Permuted_Answer"
    df["pred_hallucination"] = df['ungroundedDetected']
    
    precision_recall_fscore_support(df["true_hallucination"], 
                                    df["pred_hallucination"],
                                    average='binary')
    (0.7272727272727273, 0.352, 0.4743935309973045, None)  
    

    The results are disappointing across the board:

    • Precision of 0.727 means over 1 in 4 flagged examples were not actually hallucinated
    • Recall of 0.352 means almost 2 out of 3 hallucinated examples were missed
    • F1 score of 0.474 is a far cry from effective hallucination detection

    Comparison to RAGTruth Benchmark

    The RAGTruth paper (Niu et al., 2024) provides a useful point of comparison for the Azure Groundedness service results. RAGTruth is a large-scale dataset specifically designed to evaluate hallucination detection in retrieval-augmented generation (RAG) settings, making it highly relevant to our analysis.
    The authors benchmarked several hallucination detection methods on RAGTruth, including prompt-based approaches using GPT-3.5 and GPT-4, as well as a fine-tuned LLaMA model.

    At the response level, their results were as follows:

    MethodF1 Score
    GPT-4 Prompt63.4%
    Fine-tuned LLaMA-13B78.7%
    Azure Groundedness47.4%

    In our experiments, Azure’s Groundedness service attained a much lower F1 of 47.4% compared to the GPT-4 prompt (63.4%) and fine-tuned LLaMA-13B (78.7%).

    For span-level detection, the benchmark reported:

    MethodF1 Score
    GPT-4 Prompt28.3%
    Fine-tuned LLaMA-13B52.7%

    While we did not evaluate Azure at the span level, its poor response-level performance suggests it would struggle to precisely localize hallucinations. These results further underscore the shortcomings of Azure’s Groundedness service compared to state-of-the-art methods.

    The RAGTruth benchmark demonstrates that significantly better hallucination detection is possible, both using prompts with more capable models like GPT-4, and through fine-tuning on a high-quality dataset.

    Microsoft could potentially boost its service by pursuing similar approaches. However, even the best-performing models in the RAGTruth evaluation left substantial room for improvement, especially in span-level detection.

    This highlights the difficulty of the hallucination detection task and the need for continued research and development of more robust methods. Datasets like RAGTruth will play a key role in measuring progress.

    Limitations in Azure Groundedness

    Our analysis, along with the RAGTruth benchmark results, reveals serious limitations in the Azure Groundedness service’s ability to reliably detect hallucinations in the context of retrieval-augmented generation.

    On our permuted Q&A dataset, Azure achieved only a 47.4% F1 score at the response level, far behind the 63.4% of a GPT-4 prompt and 78.7% of a fine-tuned LLaMA model on RAGTruth. The low recall is particularly concerning, as the vast majority of hallucinated responses slipped through undetected.

    RAGTruth also highlighted the difficulty of precise span-level detection, where even the best model attained just 52.7% F1. While we did not evaluate Azure at this granularity, its poor response-level showing suggests it would fare even worse at localizing specific hallucinations.

    These shortcomings underscore the need for significant improvements before the Azure Groundedness service can be considered a robust solution for surfacing factual inconsistencies. The RAGTruth results demonstrate that leveraging more advanced language models, either through prompting or fine-tuning, is a promising path forward that Microsoft could potentially emulate.

    However, hallucination detection remains an open challenge, with ample room for progress even among state-of-the-art methods. Continued research and development of novel approaches, spurred by high-quality benchmark datasets like RAGTruth, will be essential to realizing the full potential of trustworthy retrieval-augmented generation.

    As language models become increasingly deployed in high-stakes applications, the ability to reliably surface hallucinations and prevent the spread of misinformation is paramount. Our analysis suggests that Azure’s Groundedness service, in its current form, falls short of this crucial goal. Substantial advancements are necessary to close the gap between its performance and the demands of real-world use cases.