Stay Ahead, Stay ONMINE

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Originally published on HuggingFace TL;DR We present LettuceDetect, a lightweight hallucination detector for Retrieval-Augmented Generation (RAG) pipelines. It is an encoder-based model built on ModernBERT, released under the MIT license with ready-to-use Python packages and pretrained models. What: LettuceDetect is a token-level detector that flags unsupported segments in LLM answers. 🥬 How: Trained on RAGTruth (18k examples), leveraging ModernBERT for context lengths up to 4k tokens. 🚀 Why: It addresses (1) the context-window limits in prior encoder-only models, and (2) the high compute costs of LLM-based detectors. ⚖️ Highlights: Beats prior encoder-based models (e.g., Luna) on RAGTruth. ✅ Surpasses fine-tuned Llama-2-13B [2] at a fraction of the size, and is highly efficient at inference. ⚡️ Entirely open-source with an MIT license. 🔓 LettuceDetect keeps your RAG framework fresh by spotting rotten parts of your LLM’s outputs. 😊 Quick links Why LettuceDetect? Large Language Models (LLMs) have made considerable advancements in NLP tasks, like GPT-4 [4], the Llama-3 models [5], or Mistral [6] (and many more). Despite the success of LLMs, Hallucinations remain a key obstacle deploying LLMs in high-stakes scenarios (such as in healthcare or legal) [7,8]. Retrieval-Augmented Generation (RAG) attempts to mitigate hallucinations by grounding an LLM’s responses in retrieved documents, providing external knowledge that the model can reference [9]. But even though RAG is a powerful method to reduce hallucinations, LLMs still suffer from hallucinations in these settings [1]. Hallucinations are information in the output that is nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into: Intrinsic hallucinations: Stemming from the model’s preexisting internal knowledge. Extrinsic hallucinations: Occurring when the answer conflicts with the context or references provided While RAG approaches can mitigate intrinsic hallucinations, they are not immune to extrinsic hallucinations. Sun et al. [11] showed that models tend to prioritize their intrinsic knowledge over the external context. As LLMs remain prone to hallucinations, their applications in critical domains e.g. medical or legal, can be still flawed. Current solutions for hallucination detection Current solutions for hallucination detection can be categorized into different categories based on the approach they take: Prompt-based detectors These methods (e.g., RAGAS, Trulens, ARES) typically leverage zero-shot or few-shot prompts to detect hallucinations. They often rely on large LLMs (like GPT-4) and employ strategies such as SelfCheckGPT [12], LM vs. LM [13], or Chainpoll [14]. While often effective, they can be computationally expensive due to repeated LLM calls. Fine-tuned LLM detectors Large models (e.g., Llama-2, Llama-3) can be fine-tuned for hallucination detection [1,15]. This can yield high accuracy (as shown by the RAGTruth authors using Llama-2-13B or the RAG-HAT work on Llama-3-8B) but is resource-intensive to train and deploy. Inference costs also tend to be high due to their size and slower speeds. Encoder-based detectors Models like Luna [2] rely on a BERT-style encoder (often limited to 512 tokens) for token-level classification. These methods are generally more efficient than running a full LLM at inference but are constrained by short context windows and attention mechanisms optimized for smaller inputs. ModernBERT for long context ModernBERT [3] is a drop-in replacement for BERT and is a state-of-the-art encoder-only transformers architecture that incorporates several modern design improvements over the original BERT model such as it uses Rotary Positional Embeddings (RoPe) to handle sequences of up to 8,192 tokens, unpadding optimization to eliminate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating attention for more efficient attention computation. LettuceDetect capitalizes on ModernBERT’s extended context window to build a token-level classifier for hallucination detection. This approach sidesteps many limitations of older BERT-based models (e.g., short context bounds) and avoids the inference overhead of large LLM-based detectors. Our experiments show that LettuceDetect outperforms other encoder-based systems while remaining competitive with fine-tuned LLM detectors at a fraction of their computational cost. Data RAGTruth is the first large-scale open-source (MIT License) benchmark specifically designed to evaluate hallucination detection in Retrieval-Augmented Generation (RAG) settings. It contains 18,000 annotated examples spanning multiple tasks: Question answering (QA): Sampled from the MS MARCO dataset, where up to three documents are retrieved for each question, and an LLM is prompted to produce an answer. Data-to-text generation: Based on the Yelp Open Dataset, where LLMs generate reviews for sampled businesses. News summarization: Uses random documents from CNN/DailyMail, prompting LLMs to provide summaries. Each example in RAGTruth includes: A question and prompt, One or more reference passages or context documents, An LLM-generated response, Human annotations at the span level indicating whether any part of the answer is factually unsupported or contradictory to the reference passages. Below is an illustrative sample from RAGTruth, showing an instance of hallucination: QUESTION How to explain quartiles? REFERENCE Passage 2:Second quartile (Q2) which is more commonly known as median splits the data in half (50%).Median divides the data into a lower half and an upper half.Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data.(…) RESPONSE The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.(…) ANNOTATION Span: “highest 75%” Type: Evident Conflict Reason: Original text: “lowest 75% (or highest 25%)” Generative text: “highest 75%” This example showcases a RAGTruth annotation involving specific spans in the answer that conflict with the reference passages, providing both the type of hallucination (e.g., Evident Conflict) and a human-readable rationale. Our method only uses binary annotations and does not use the type of hallucination annotations. Method Image by Author A high-level depiction of LettuceDetect. Here, an example Question, Context, and Answer triplet is processed. First, the text is tokenized, after which LettuceDetect performs token-level classification. Tokens from both the question and context are masked (indicated by the red line in the figure) to exclude them from the loss function. Each token in the answer receives a probability indicating whether it is hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination probabilities above 0.5 into a single predicted span. We train ModernBERT-base and ModernBERT-large variants as token-classification models on the RAGTruth dataset. The input to the model is a concatenation of Context, Question, and Answer segments, with specialized tokens ([CLS]) (for the context) and ([SEP]) (as separators). We limit the sequence length to 4,096 tokens for computational feasibility, though ModernBERT can theoretically handle up to 8,192 tokens. Tokenization and data processing Tokenizer: We employ AutoTokenizer from the Transformers library to handle subword Tokenization, inserting [CLS] and [SEP] appropriately. Labeling: Context/question tokens are masked (i.e., assigned a label of -100 in PyTorch) so that they do not contribute to the loss. Each answer token receives a label of 0 (supported) or 1 (hallucinated). Model architecture Our models build on Hugging Face’s AutoModelForTokenClassification, using ModernBERT as the encoder and a classification head on top. Unlike some previous encoder-based approaches (e.g., ones pre-trained on NLI tasks), our method uses only ModernBERT with no additional pretraining stage. Training configuration Optimizer: AdamW, with a learning rate of 1 * 10^-5 and weight decay of 0.01. Hardware: Single NVIDIA A100 GPU. Epochs: 6 total training epochs. Batching: Batch size of 8, Data loading with PyTorch DataLoader (shuffling enabled), Dynamic padding via DataCollatorForTokenClassification to handle variable-length sequences efficiently. During training, we monitor token-level F1 scores on a validation split, saving checkpoints using the safetensors format. Once training is complete, we upload the best-performing models to Hugging Face for public access. At inference time, the model outputs a probability of hallucination for each token in the answer. We aggregate consecutive tokens exceeding a 0.5 threshold to produce span-level predictions, indicating exactly which segments of the answer are likely to be hallucinated. The figure above illustrates this workflow. Next, we provide a more detailed evaluation of the model’s performance. Results We evaluate our models on the RAGTruth test set across all task types (Question Answering, Data-to-Text, and Summarization). For each example, RAGTruth includes manually annotated spans indicating hallucinated content. Example-level results We first assess the example-level question: Does the generated answer contain any hallucination at all? Our large model (lettucedetect-large-v1) attains an overall F1 score of 79.22%, surpassing: GPT-4 (63.4%), Luna (65.4%) (the previous state of the art encoder-based model), Fine-tuned Llama-2-13B (78.7%) as presented in the RAGTruth paper. It is second only to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), but LettuceDetect is significantly smaller and faster to run. Meanwhile, our base model (lettucedetect-base-v1) remains highly competitive while using fewer parameters. Image by Author Above is a comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings. Span-level results Beyond detecting if an answer contains hallucinations, we also examine LettuceDetect’s ability to identify the exact spans of unsupported content. Here, LettuceDetect achieves state-of-the-art results among models that have reported span-level performance, substantially outperforming the fine-tuned Llama-2-13B model from the RAGTruth paper [1] and other baselines. Image by Author Most methods, like RAG-HAT [15], do not report span-level metrics, so we do not compare to them here. Inference efficiency Both lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and can process 30–60 examples per second on a single NVIDIA A100 GPU. This makes them practical for industrial workloads, real-time user-facing systems, and resource-constrained environments. Overall, these results show that LettuceDetect has a good balance: it achieves near state-of-the-art accuracy at a fraction of the size and cost compared to large LLM-based judges, while offering precise, token-level hallucination detection. Get going Install the package: pip install lettucedetect Then, you can use the package as follows: from lettucedetect.models.inference import HallucinationDetector # For a transformer-based approach: detector = HallucinationDetector( method=”transformer”, model_path=”KRLabsOrg/lettucedect-base-modernbert-en-v1″ ) contexts = [“France is a country in Europe. The capital of France is Paris. The population of France is 67 million.”,] question = “What is the capital of France? What is the population of France?” answer = “The capital of France is Paris. The population of France is 69 million.” # Get span-level predictions indicating which parts of the answer are considered hallucinated. predictions = detector.predict(context=contexts, question=question, answer=answer, output_format=”spans”) print(“Predictions:”, predictions) # Predictions: [{‘start’: 31, ‘end’: 71, ‘confidence’: 0.9944414496421814, ‘text’: ‘ The population of France is 69 million.’}] Conclusion We introduced LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By utilizing ModernBERT’s extended context capabilities, our models achieve strong performance on the RAGTruth benchmark while retaining high inference efficiency. This work lays the groundwork for future research directions, such as expanding to additional datasets, supporting multiple languages, and exploring more advanced architectures. Even at this stage, LettuceDetect demonstrates that effective hallucination detection can be achieved using lean, purpose-built encoder-based models. Citation If you find this work useful, please cite it as follows: @misc{Kovacs:2025,       title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},        author={Ádám Kovács and Gábor Recski},       year={2025},       eprint={2502.17125},       archivePrefix={arXiv},       primaryClass={cs.CL},       url={https://arxiv.org/abs/2502.17125},  } Also, if you use our code, please don’t forget to give us a star ⭐ on our GitHub repository here. References [1] Niu et al., 2024, RAGTruth: A Dataset for Hallucination Detection in Retrieval-Augmented Generation [2] Luna: A Simple and Effective Encoder-Based Model for Hallucination Detection in Retrieval-Augmented Generation [3] ModernBERT: A Modern BERT Model for Long-Context Processing [4] GPT-4 report [5] Llama-3 report [6] Mistral 7B [7] Kaddour et al., 2023, Challenges and Applications of Large Language Models [8] Huang et al., 2025, A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [9] Gao et al., 2024, Retrieval-Augmented Generation for Large Language Models: A Survey [10] Ji et al., 2023, Survey of Hallucination in Natural Language Generation [11] Sun et al., 2025, ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability [12] Manakul et al., 2023, SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [13] Cohen et al., 2023, LM vs LM: Detecting Factual Errors via Cross Examination [14] Friel et al., 2023, Chainpoll: A high efficacy method for LLM hallucination detection [15] Song et al., 2024, RAG-HAT: A Hallucination-Aware Tuning Pipeline for {LLM} in Retrieval-Augmented Generation [16] Devlin et al., 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Originally published on HuggingFace

TL;DR

We present LettuceDetect, a lightweight hallucination detector for Retrieval-Augmented Generation (RAG) pipelines. It is an encoder-based model built on ModernBERT, released under the MIT license with ready-to-use Python packages and pretrained models.

  • What: LettuceDetect is a token-level detector that flags unsupported segments in LLM answers. 🥬
  • How: Trained on RAGTruth (18k examples), leveraging ModernBERT for context lengths up to 4k tokens. 🚀
  • Why: It addresses (1) the context-window limits in prior encoder-only models, and (2) the high compute costs of LLM-based detectors. ⚖️
  • Highlights:
    • Beats prior encoder-based models (e.g., Luna) on RAGTruth. ✅
    • Surpasses fine-tuned Llama-2-13B [2] at a fraction of the size, and is highly efficient at inference. ⚡️
    • Entirely open-source with an MIT license. 🔓

LettuceDetect keeps your RAG framework fresh by spotting rotten parts of your LLM’s outputs. 😊

Quick links


Why LettuceDetect?

Large Language Models (LLMs) have made considerable advancements in NLP tasks, like GPT-4 [4], the Llama-3 models [5], or Mistral [6] (and many more). Despite the success of LLMs, Hallucinations remain a key obstacle deploying LLMs in high-stakes scenarios (such as in healthcare or legal) [7,8].

Retrieval-Augmented Generation (RAG) attempts to mitigate hallucinations by grounding an LLM’s responses in retrieved documents, providing external knowledge that the model can reference [9]. But even though RAG is a powerful method to reduce hallucinations, LLMs still suffer from hallucinations in these settings [1]. Hallucinations are information in the output that is nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into:

  • Intrinsic hallucinations: Stemming from the model’s preexisting internal knowledge.
  • Extrinsic hallucinations: Occurring when the answer conflicts with the context or references provided

While RAG approaches can mitigate intrinsic hallucinations, they are not immune to extrinsic hallucinations. Sun et al. [11] showed that models tend to prioritize their intrinsic knowledge over the external context. As LLMs remain prone to hallucinations, their applications in critical domains e.g. medical or legal, can be still flawed.

Current solutions for hallucination detection

Current solutions for hallucination detection can be categorized into different categories based on the approach they take:

  1. Prompt-based detectors These methods (e.g., RAGAS, Trulens, ARES) typically leverage zero-shot or few-shot prompts to detect hallucinations. They often rely on large LLMs (like GPT-4) and employ strategies such as SelfCheckGPT [12], LM vs. LM [13], or Chainpoll [14]. While often effective, they can be computationally expensive due to repeated LLM calls.
  2. Fine-tuned LLM detectors Large models (e.g., Llama-2, Llama-3) can be fine-tuned for hallucination detection [1,15]. This can yield high accuracy (as shown by the RAGTruth authors using Llama-2-13B or the RAG-HAT work on Llama-3-8B) but is resource-intensive to train and deploy. Inference costs also tend to be high due to their size and slower speeds.
  3. Encoder-based detectors Models like Luna [2] rely on a BERT-style encoder (often limited to 512 tokens) for token-level classification. These methods are generally more efficient than running a full LLM at inference but are constrained by short context windows and attention mechanisms optimized for smaller inputs.

ModernBERT for long context

ModernBERT [3] is a drop-in replacement for BERT and is a state-of-the-art encoder-only transformers architecture that incorporates several modern design improvements over the original BERT model such as it uses Rotary Positional Embeddings (RoPe) to handle sequences of up to 8,192 tokens, unpadding optimization to eliminate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating attention for more efficient attention computation.

LettuceDetect capitalizes on ModernBERT’s extended context window to build a token-level classifier for hallucination detection. This approach sidesteps many limitations of older BERT-based models (e.g., short context bounds) and avoids the inference overhead of large LLM-based detectors. Our experiments show that LettuceDetect outperforms other encoder-based systems while remaining competitive with fine-tuned LLM detectors at a fraction of their computational cost.

Data

RAGTruth is the first large-scale open-source (MIT License) benchmark specifically designed to evaluate hallucination detection in Retrieval-Augmented Generation (RAG) settings. It contains 18,000 annotated examples spanning multiple tasks:

  • Question answering (QA): Sampled from the MS MARCO dataset, where up to three documents are retrieved for each question, and an LLM is prompted to produce an answer.
  • Data-to-text generation: Based on the Yelp Open Dataset, where LLMs generate reviews for sampled businesses.
  • News summarization: Uses random documents from CNN/DailyMail, prompting LLMs to provide summaries.

Each example in RAGTruth includes:

  1. A question and prompt,
  2. One or more reference passages or context documents,
  3. An LLM-generated response,
  4. Human annotations at the span level indicating whether any part of the answer is factually unsupported or contradictory to the reference passages.

Below is an illustrative sample from RAGTruth, showing an instance of hallucination:


QUESTION

How to explain quartiles?

REFERENCE

Passage 2:
Second quartile (Q2) which is more commonly known as median splits the data in half (50%).
Median divides the data into a lower half and an upper half.
Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data.
(…)

RESPONSE

The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.
(…)

ANNOTATION

  • Span: “highest 75%”
  • Type: Evident Conflict
  • Reason:
    • Original text: “lowest 75% (or highest 25%)”
    • Generative text: “highest 75%”

This example showcases a RAGTruth annotation involving specific spans in the answer that conflict with the reference passages, providing both the type of hallucination (e.g., Evident Conflict) and a human-readable rationale. Our method only uses binary annotations and does not use the type of hallucination annotations.

Method

Diagram of LettuceDetect
Image by Author

A high-level depiction of LettuceDetect. Here, an example Question, Context, and Answer triplet is processed. First, the text is tokenized, after which LettuceDetect performs token-level classification. Tokens from both the question and context are masked (indicated by the red line in the figure) to exclude them from the loss function. Each token in the answer receives a probability indicating whether it is hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination probabilities above 0.5 into a single predicted span.


We train ModernBERT-base and ModernBERT-large variants as token-classification models on the RAGTruth dataset. The input to the model is a concatenation of Context, Question, and Answer segments, with specialized tokens ([CLS]) (for the context) and ([SEP]) (as separators). We limit the sequence length to 4,096 tokens for computational feasibility, though ModernBERT can theoretically handle up to 8,192 tokens.

Tokenization and data processing

  • Tokenizer: We employ AutoTokenizer from the Transformers library to handle subword Tokenization, inserting [CLS] and [SEP] appropriately.
  • Labeling:
    • Context/question tokens are masked (i.e., assigned a label of -100 in PyTorch) so that they do not contribute to the loss.
    • Each answer token receives a label of 0 (supported) or 1 (hallucinated).

Model architecture

Our models build on Hugging Face’s AutoModelForTokenClassification, using ModernBERT as the encoder and a classification head on top. Unlike some previous encoder-based approaches (e.g., ones pre-trained on NLI tasks), our method uses only ModernBERT with no additional pretraining stage.

Training configuration

  • Optimizer: AdamW, with a learning rate of 1 * 10^-5 and weight decay of 0.01.
  • Hardware: Single NVIDIA A100 GPU.
  • Epochs: 6 total training epochs.
  • Batching:
    • Batch size of 8,
    • Data loading with PyTorch DataLoader (shuffling enabled),
    • Dynamic padding via DataCollatorForTokenClassification to handle variable-length sequences efficiently.

During training, we monitor token-level F1 scores on a validation split, saving checkpoints using the safetensors format. Once training is complete, we upload the best-performing models to Hugging Face for public access.

At inference time, the model outputs a probability of hallucination for each token in the answer. We aggregate consecutive tokens exceeding a 0.5 threshold to produce span-level predictions, indicating exactly which segments of the answer are likely to be hallucinated. The figure above illustrates this workflow.

Next, we provide a more detailed evaluation of the model’s performance.

Results

We evaluate our models on the RAGTruth test set across all task types (Question Answering, Data-to-Text, and Summarization). For each example, RAGTruth includes manually annotated spans indicating hallucinated content.

Example-level results

We first assess the example-level question: Does the generated answer contain any hallucination at all? Our large model (lettucedetect-large-v1) attains an overall F1 score of 79.22%, surpassing:

  • GPT-4 (63.4%),
  • Luna (65.4%) (the previous state of the art encoder-based model),
  • Fine-tuned Llama-2-13B (78.7%) as presented in the RAGTruth paper.

It is second only to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), but LettuceDetect is significantly smaller and faster to run. Meanwhile, our base model (lettucedetect-base-v1) remains highly competitive while using fewer parameters.

Comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna)
Image by Author

Above is a comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings.

Span-level results

Beyond detecting if an answer contains hallucinations, we also examine LettuceDetect’s ability to identify the exact spans of unsupported content. Here, LettuceDetect achieves state-of-the-art results among models that have reported span-level performance, substantially outperforming the fine-tuned Llama-2-13B model from the RAGTruth paper [1] and other baselines.

Image by Author

Most methods, like RAG-HAT [15], do not report span-level metrics, so we do not compare to them here.

Inference efficiency

Both lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and can process 30–60 examples per second on a single NVIDIA A100 GPU. This makes them practical for industrial workloads, real-time user-facing systems, and resource-constrained environments.

Overall, these results show that LettuceDetect has a good balance: it achieves near state-of-the-art accuracy at a fraction of the size and cost compared to large LLM-based judges, while offering precise, token-level hallucination detection.

Get going

Install the package:

pip install lettucedetect

Then, you can use the package as follows:

from lettucedetect.models.inference import HallucinationDetector

# For a transformer-based approach:

detector = HallucinationDetector(

    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"

)

contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]

question = "What is the capital of France? What is the population of France?"

answer = "The capital of France is Paris. The population of France is 69 million."

# Get span-level predictions indicating which parts of the answer are considered hallucinated.

predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")

print("Predictions:", predictions)

# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]

Conclusion

We introduced LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By utilizing ModernBERT’s extended context capabilities, our models achieve strong performance on the RAGTruth benchmark while retaining high inference efficiency. This work lays the groundwork for future research directions, such as expanding to additional datasets, supporting multiple languages, and exploring more advanced architectures. Even at this stage, LettuceDetect demonstrates that effective hallucination detection can be achieved using lean, purpose-built encoder-based models.

Citation

If you find this work useful, please cite it as follows:

@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, 
      author={Ádám Kovács and Gábor Recski},
      year={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}

Also, if you use our code, please don’t forget to give us a star ⭐ on our GitHub repository here.

References

[1] Niu et al., 2024, RAGTruth: A Dataset for Hallucination Detection in Retrieval-Augmented Generation

[2] Luna: A Simple and Effective Encoder-Based Model for Hallucination Detection in Retrieval-Augmented Generation

[3] ModernBERT: A Modern BERT Model for Long-Context Processing

[4] GPT-4 report

[5] Llama-3 report

[6] Mistral 7B

[7] Kaddour et al., 2023, Challenges and Applications of Large Language Models

[8] Huang et al., 2025, A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[9] Gao et al., 2024, Retrieval-Augmented Generation for Large Language Models: A Survey

[10] Ji et al., 2023, Survey of Hallucination in Natural Language Generation

[11] Sun et al., 2025, ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

[12] Manakul et al., 2023, SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

[13] Cohen et al., 2023, LM vs LM: Detecting Factual Errors via Cross Examination

[14] Friel et al., 2023, Chainpoll: A high efficacy method for LLM hallucination detection

[15] Song et al., 2024, RAG-HAT: A Hallucination-Aware Tuning Pipeline for {LLM} in Retrieval-Augmented Generation

[16] Devlin et al., 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

IBM, Arm team up to bring Arm software to IBM Z mainframes

That gap is precisely what the collaboration is intended to address, said Rachita Rao, senior analyst at Everest Group. “This is a mainframe adjacency play,” she said. “The intent is to extend IBM Z and LinuxONE environments by enabling Arm-compatible workloads to run closer to systems of record. While hyperscalers

Read More »

Microsoft facing CMA probe of its business software portfolio

Smith added that Microsoft recognizes that the CMA “will continue to review and assess additional issues relating to our products and services, including in the business software market. We are committed to working quickly and constructively to address these issues, including by providing all the information the CMA needs to

Read More »

Energy Department Initiates Additional Strategic Petroleum Reserve Emergency Exchange to Stabilize Global Oil Supply

WASHINGTON—The U.S. Department of Energy (DOE) issued a Request for Proposal (RFP) today for an emergency exchange of 10-million-barrels from the Strategic Petroleum Reserve (SPR). This action is part of the coordinated release of 400-million-barrels from IEA member nations’ strategic reserves President Trump previously announced. The United States continues to deliver on its 172-million-barrel release commitment.  The crude oil will originate from the Strategic Petroleum Reserve’s (SPR) Bryan Mound site. Today’s action builds on the initial phase of the Emergency Exchange, which moved quickly to award 45.2 million barrels from the Bayou Choctaw, Bryan Mound, and West Hackberry SPR sites. The 10-million-barrel exchange leverages the full capabilities of the SPR, alongside the President’s limited Jones Act waiver, to accelerate critical near-term oil flows into the market.  “Today’s action furthers the United States’ efforts to move oil quickly to the market and mitigate short-term supply disruptions,” said DOE Assistant Secretary of the Hydrocarbons and Geothermal Energy Office Kyle Haustveit. “Thanks to President Trump, America is managing our national security assets responsibly again. Through this exchange, we will continue to refill the Strategic Petroleum Reserve by bringing additional barrels back at a later date through this pragmatic exchange structure, strengthening its long-term readiness and all at no cost to the American taxpayer.”  Under DOE’s exchange authority, participating companies will return the borrowed 10 million barrels with additional premium barrels by next year. This exchange delivers immediate crude to refiners and the market while generating additional barrels for the American people at no cost to taxpayers.   Bids for the solicitation are due no later than 11:00 A.M. CT on Monday, April 6, 2026.    For more information on the SPR, please visit DOE’s website.   

Read More »

Trump Administration Keeps Colorado Coal Plant Open to Ensure Affordable, Reliable and Secure Power in Colorado

WASHINGTON—U.S. Secretary of Energy Chris Wright today issued an emergency order to keep a Colorado coal plant operational to ensure Americans maintain access to affordable, reliable and secure electricity. The order directs Tri-State Generation and Transmission Association (Tri-State), Platte River Power Authority, Salt River Project, PacifiCorp, and Public Service Company of Colorado (Xcel Energy), in coordination with the Western Area Power Administration (WAPA) Rocky Mountain Region and Southwest Power Pool (SPP), to take all measures necessary to ensure that Unit 1 at the Craig Station in Craig, Colorado is available to operate. Unit One of the coal plant was scheduled to shut down at the end of 2025 but on December 30, 2025, Secretary Wright issued an emergency order directing Tri-State and the co-owners to ensure that Unit 1 at the Craig Station remains available to operate. “The last administration’s energy subtraction policies threatened America’s energy security and positioned our nation to likely experience significantly more blackouts in the coming years—thankfully, President Trump won’t let that happen,” said Energy Secretary Wright. “The Trump Administration will continue taking action to ensure we don’t lose critical generation sources. Americans deserve access to affordable, reliable, and secure energy to power their homes all the time, regardless of whether the wind is blowing or the sun is shining.” Thanks to President Trump’s leadership, coal plants across the country are reversing plans to shut down. In 2025, more than 17 gigawatts (GW) of coal-power electricity generation were saved. On April 1, once Tri-State and the WAPA Rocky Mountain Region join the SPP RTO West expansion, SPP is directed to take every step to employ economic dispatch to minimize costs to ratepayers. According to DOE’s Resource Adequacy Report, blackouts were on track to potentially increase 100 times by 2030 if the U.S. continued to take reliable

Read More »

NextDecade contractor Bechtel awards ABB more Rio Grande LNG automation work

NextDecade Corp. contractor Bechtel Corp. has awarded ABB Ltd. additional integrated automation and electrical solution orders, extending its scope to Trains 4 and 5 of NextDecade’s 30-million tonne/year (tpy)  Rio Grande LNG (RGLNG) plant in Brownsville, Tex. The orders were booked in third- and fourth-quarters 2025 and build on ABB’s Phase 1 work with Trains 1-3, totaling 17 million tpy.  The scope for RGLNG Trains 4 and 5 includes deployment of an integrated control and safety system consisting of a distributed control system, emergency shutdown, and fire and gas systems. An electrical controls and monitoring system will provide unified visibility of the plant’s electrical infrastructure. These two overarching solutions will provide a common automation platform. ABB will also supply medium-voltage drives, synchronous motors, transformers, motor controllers and switchgear.  The orders also include local equipment buildings—two for Train 4 and one for Train 5— housing critical control and electrical systems in prefabricated modules to streamline installation and commissioning on site. The solutions being delivered to Bechtel use ABB adaptive execution, a methodology for capital projects designed to optimize engineering work and reduce delivery timelines. Phase 1 of RGLNG is under construction and expected to begin operations in 2027. Operations at Train 4 are expected in 2030 and Train 5 in 2031. ABB’s senior vice-president for the Americas, Scott McCay, confirmed to Oil & Gas Journal at CERAWeek by S&P Global in Houston that the company is doing similar work through Tecnimont for Argent LNG’s planned 25-million tpy plant in Port Fourchon, La.; 10-million tpy Phase 1 and 15-million tpy Phase 2. Argent is targeting 2030 completion for its plant.

Read More »

Persistent oil flow imbalances drive Enverus to increase crude price forecast

Citing impacts from the Iran war, near-zero flows through the Strait of Hormuz, accelerating global stock draws, and expectations for a muted US production response despite higher prices, Enverus Intelligence Research (EIR) raised its Brent crude oil price forecast. EIR now expects Brent to average $95/bbl for the remainder of 2026 and $100/bbl in 2027, reflecting what it described as a persistent global oil flow imbalance that continues to draw down inventories. “The world has an oil flow problem that is draining stocks,” said Al Salazar, director of research at EIR. “Whenever that oil flow problem is resolved, the world is left with low stocks. That’s what drives our oil price outlook higher for longer.” The outlook assumes the Strait of Hormuz remains largely closed for 3 months. EIR estimates that each month of constrained flows shifts the price outlook by about $10–15/bbl, underscoring the scale of the disruption and uncertainty around its duration. Despite West Texas Intermediate (WTI) prices of $90–100/bbl, EIR does not expect US producers to materially increase output. The firm forecasts US liquids production growth of 370,000 b/d by end-2026 and 580,000 b/d by end-2027, citing drilling-to-production lags, industry consolidation, and continued capital discipline. Global oil demand growth for 2026 has been reduced to about 500,000 b/d from 1.0 million b/d as higher energy prices and anticipated supply disruptions weigh on economic activity. Cumulative global oil stock draws are estimated at roughly 1 billion bbl through 2027, with non-OECD inventories—particularly in Asia—absorbing nearly half of the impact. A 60-day Jones Act waiver may provide limited short-term US shipping flexibility, but EIR said the measure is unlikely to materially affect global oil prices given broader market forces.

Read More »

Equinor begins drilling $9-billion natural gas development project offshore Brazil

Equinor has started drilling the Raia natural gas project in the Campos basin presalt offshore Brazil. The $9-billion project is Equinor’s largest international investment, its largest project under execution, and marks the deepest water depth operation in its portfolio. The drilling campaign, which began Mar. 24 with the Valaris DS‑17 drillship, includes six wells in the Raia area 200 km offshore in water depths of around 2,900 m. The area is expected to hold recoverable natural gas and condensate reserves of over 1 billion boe. Raia’s development concept is based on production through wells connected to a 126,000-b/d floating production, storage and offloading unit (FPSO), which will treat produced oil/condensate and gas. Natural gas will be transported through a 200‑km pipeline from the FPSO to Cabiúnas, in the city of Macaé, Rio de Janeiro state. Once in operation, expected in 2028, the project will have the capacity to export up to 16 million cu m/day of natural gas, which could represent 15% of Brazil’s natural gas demand, the company said in a release Mar. 24. “While drilling takes place, integration and commissioning activities on the FPSO are progressing well putting us on track towards a safe start of operations in 2028,” said Geir Tungesvik, executive vice-president, projects, drilling and procurement, Equinor. The Raia project is operated by Equinor (35%), in partnership with Repsol Sinopec Brasil (35%) and Petrobras (30%).

Read More »

Woodfibre LNG receives additional modules as construction advances

Woodfibre LNG LP has received two major modules within a week for its under‑construction, 2.1‑million tonne/year (tpy) LNG export plant near Squamish, British Columbia, advancing construction to about 65% complete. The deliveries include the liquefaction module—the project’s heaviest and most critical process unit—and the powerhouse module, which will serve as the plant’s central power and control hub. The liquefaction module, delivered aboard the heavy cargo vessel Red Zed 1, is the 15th of 19 modules scheduled for installation at the site, the company said in a Mar. 24 release. Weighing about 10,847 metric tonnes and occupying a footprint roughly equivalent to a football field, it is among the largest modules fabricated for the project. Once installed and commissioned, the liquefaction module will cool natural gas to about –162°C, converting it into LNG for export. Shortly after the liquefaction module’s arrival, Woodfibre LNG received the powerhouse module, the 16th module delivered to site. Weighing more than 4,200 metric tonnes, the powerhouse module will function as a power and control system, receiving electricity from BC Hydro and managing and distributing power to the plant’s electric‑drive compressors. The Woodfibre LNG project is designed as the first LNG export plant to use electric‑drive motors for liquefaction, replacing conventional gas‑turbine‑driven compressors. The Siemens electric‑drive system will be powered by renewable hydroelectricity from BC Hydro, eliminating the largest operational source of greenhouse gas emissions typically associated with liquefaction, the company said. The project is being built near the community of Squamish on the traditional territory of the Sḵwx̱wú7mesh Úxwumixw (Squamish Nation) and is regulated in part by the Indigenous government.  All 19 modules are expected to arrive on site by spring 2026. Construction is scheduled for completion in 2027. Woodfibre LNG is owned by Woodfibre LNG Ltd. Partnership, which is 70% owned by Pacific Energy Corp.

Read More »

Data Center Jobs: Engineering, Construction, Commissioning, Sales, Field Service and Facility Tech Jobs Available in Major Data Center Hotspots

Each month Data Center Frontier, in partnership with Pkaza, posts some of the hottest data center career opportunities in the market. Here’s a look at some of the latest data center jobs posted on the Data Center Frontier jobs board, powered by Pkaza Critical Facilities Recruiting. Looking for Data Center Candidates? Check out Pkaza’s Active Candidate / Featured Candidate Hotlist Power Applications Engineer Pittsburgh, PA This position is also available in: Denver, CO and Andrews, SC.  Our client is a leading provider and manufacturer of industrial electrical power equipment used in industrial applications for mission critical operations. They help their customers save money by reducing energy and operating costs and provide solutions for modernizing their customer’s existing electrical infrastructure. This company provides cooling solutions to many of the world’s largest organizations and government facilities and enterprise clients, colocation providers and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Electrical Commissioning Engineer Ashburn, VA This traveling position is also available in: New York, NY; White Plains, NY;  Dallas, TX; Richmond, VA; Montvale, NJ; Charlotte, NC; Atlanta, GA; Hampton, GA; New Albany, OH; Cedar Rapids, IA; Phoenix, AZ; Salt Lake City, UT;  Kansas City, MO; Omaha, NE; Chesterton, IN or Chicago, IL. *** ALSO looking for a LEAD EE and ME CxA Agents and CxA PMs. ***  Our client is an engineering design and commissioning company that has a national footprint and specializes in MEP critical facilities design. They provide design, commissioning, consulting and management expertise in the critical facilities space. They have a mindset to provide reliability, energy efficiency, sustainable design and LEED expertise when providing these consulting services for enterprise, colocation and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive

Read More »

No joke: data centers are warming the planet

The researchers also made use of a database provided by the International Energy Agency (IEA) that the authors pointed out contains more than 11,000 locations worldwide, of which 8,472 have been detected to dwell outside of highly dense urban areas. The latter locations were then used to “quantify the effect of data centers on the environment in terms of the LST gradient that could be measured on the areas surrounding each data center.” Asking the wrong question Asked if AI data centers are really causing local warming, or if this phenomenon is overstated, Sanchit Vir Gogia, chief analyst at Greyhound Research, said, “the signal is real, but the industry is asking the wrong question. The research shows a consistent rise in land surface temperature of around 2°C  following the establishment of large data centre facilities.” The debate, however, “has quickly shifted to causality: whether this is driven by operational heat from compute, or by land transformation during construction. That distinction matters scientifically, but it does not change the strategic implication.” Land surface temperature, said Gogia, is not the same as air temperature, and that gap will be used to challenge the findings. “But dismissing the signal on that basis would be a mistake,” he noted. “Data centers concentrate energy use, replace natural surfaces with heat-retaining materials, and continuously reject heat into the environment. Those are known drivers of thermal change.” He added, “the uncomfortable truth is this: Even if the exact mechanism is debated, the outcome aligns with first principles. Infrastructure at this scale alters its surroundings. The industry does not yet have a clean way to separate construction impact from operational impact, and that ambiguity makes the risk harder to model, not easier. This is not overstated, it is under-interpreted.” Location strategy must change But will the findings change

Read More »

Schneider Electric Maps the AI Data Center’s Next Design Era

The coming shift to higher-voltage DC That internal power challenge led Simonelli to one of the most consequential architectural topics in the interview: the likely transition toward higher-voltage DC distribution at very high rack densities. He framed it pragmatically. At current density levels, the industry knows how to get power into racks at 200 or 300 kilowatts. But as densities rise toward 400 kilowatts and beyond, conventional AC approaches start to run into physical limits. Too much cable, too much copper, too much conversion equipment, and too much space consumed by power infrastructure rather than GPUs. At that point, he said, higher-voltage DC becomes attractive not for philosophical reasons, but because it reduces current, shrinks conductor size, saves space, and leaves more room for revenue-generating compute. “It is again a paradigm shift,” Simonelli said of DC power at these densities. “But it won’t be everywhere.” That is probably right. The transition will not be universal, and the exact thresholds will evolve. But his underlying point is powerful. As rack densities climb, electrical architecture starts to matter not only for efficiency and reliability, but for physical space allocation inside the rack. Put differently, power distribution becomes a compute-enablement issue. Distance between accelerators matters, too. The closer GPUs and TPUs can be kept together, the better they perform. If power infrastructure can be compacted, more of the rack can be devoted to dense compute, improving the economics and performance of the system. That is a strong example of how AI is collapsing traditional boundaries between facility engineering and compute architecture. The two are no longer cleanly separable. Gas now, renewables over time On onsite power, Simonelli was refreshingly direct. If the goal is dispatchable onsite generation at the scale now being contemplated for AI facilities, he said, “there really isn’t an alternative

Read More »

SoftBank’s 10 GW Ohio Campus Marks a Turning Point for AI Infrastructure

Renewables can reduce carbon intensity, but they cannot independently meet the need for continuous, multi-gigawatt firm capacity without large-scale storage and balancing resources. For developers targeting guaranteed availability within this decade, natural gas remains the most readily deployable option, despite the political and environmental tradeoffs it introduces. AEP and the Cost Allocation Model If the generation plan explains the engineering logic, the AEP structure speaks to the political one. At the center is one of the most contested questions in the data center market: who pays for the transmission and grid upgrades required to serve large new loads? Utilities, regulators, consumer advocates, and large-load customers are increasingly divided on this issue. Data center developers point to economic development benefits, including jobs and tax revenue. Consumer advocates counter that residential ratepayers should not subsidize infrastructure built primarily to serve hyperscale demand. The Ohio arrangement is being positioned as a response to that conflict. DOE states that SB Energy and AEP Ohio are partnering on $4.2 billion in new transmission infrastructure, with SB Energy committing to fund those investments rather than passing costs through to ratepayers. AEP has echoed that position, indicating the structure is intended to avoid upward pressure on transmission rates for Ohio customers. Whether that outcome holds will depend on regulatory review and execution. But the structure itself is significant. It frames a model in which large-load developers directly fund the transmission infrastructure required to support their projects, rather than relying on broader cost recovery mechanisms. That makes the project more than a construction milestone. It positions it as a potential policy template. If validated, this approach could influence how utilities and regulators across the U.S. address cost allocation for AI-scale infrastructure, particularly as similar disputes intensify in constrained grid regions. Why 765-kV Transmission Signals Scale AEP says the

Read More »

Q1 Executive Roundtable Recap

Matt Vincent is Editor in Chief of Data Center Frontier, where he leads editorial strategy and coverage focused on the infrastructure powering cloud computing, artificial intelligence, and the digital economy. A veteran B2B technology journalist with more than two decades of experience, Vincent specializes in the intersection of data centers, power, cooling, and emerging AI-era infrastructure. Since assuming the EIC role in 2023, he has helped guide Data Center Frontier’s coverage of the industry’s transition into the gigawatt-scale AI era, with a focus on hyperscale development, behind-the-meter power strategies, liquid cooling architectures, and the evolving energy demands of high-density compute, while working closely with the Digital Infrastructure Group at Endeavor Business Media to expand the brand’s analytical and multimedia footprint. Vincent also hosts The Data Center Frontier Show podcast, where he interviews industry leaders across hyperscale, colocation, utilities, and the data center supply chain to examine the technologies and business models reshaping digital infrastructure. Since its inception he serves as Head of Content for the Data Center Frontier Trends Summit. Before becoming Editor in Chief, he served in multiple senior editorial roles across Endeavor Business Media’s digital infrastructure portfolio, with coverage spanning data centers and hyperscale infrastructure, structured cabling and networking, telecom and datacom, IP physical security, and wireless and Pro AV markets. He began his career in 2005 within PennWell’s Advanced Technology Division and later held senior editorial positions supporting brands such as Cabling Installation & Maintenance, Lightwave Online, Broadband Technology Report, and Smart Buildings Technology. Vincent is a frequent moderator, interviewer, and keynote speaker at industry events including the HPC Forum, where he delivers forward-looking analysis on how AI and high-performance computing are reshaping digital infrastructure. He graduated with honors from Indiana University Bloomington with a B.A. in English Literature and Creative Writing and lives in southern New Hampshire with

Read More »

Executive Roundtable: The AI Infrastructure Credibility Test

For the fourth installment of DCF’s Executive Roundtable for the First Quarter of 2026, we turn to a question that increasingly sits alongside power and capital as a defining constraint. Credibility. As AI-driven data center development accelerates, public scrutiny is rising in parallel. Communities, regulators, and policymakers are taking a closer look at the industry’s footprintin terms of its energy consumption, its land use, and its broader impact on local infrastructure and ratepayers. What was once a relatively low-profile sector has become a visible and, at times, contested presence in regional economies. This shift reflects the sheer scale of the current build cycle. Multi-hundred-megawatt and gigawatt campuses are no longer theoretical in any sense. They are actively being proposed and constructed across key markets. With that scale comes heightened expectations around transparency, accountability, and tangible community benefit. At the same time, the industry faces a more complex regulatory and political landscape. Questions around grid capacity, rate structures, environmental impact, and economic incentives are increasingly being debated in public forums, from state utility commissions to local zoning boards. In this environment, the ability to secure approvals is no longer assured, even in historically favorable markets. The concept of a “social license to operate” has therefore moved to the forefront. Beyond technical execution, developers and operators must now demonstrate that AI infrastructure can be deployed in a way that aligns with community priorities and delivers shared value. In this roundtable, our panel of industry leaders explores what will define that credibility in the years ahead and what the data center industry must do to sustain its momentum in an era of growing public scrutiny.

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »