LettuceDetect: A Hallucination Detection Framework for RAG Applications

Stay Ahead, Stay ONMINE

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Originally published on HuggingFace TL;DR We present LettuceDetect, a lightweight hallucination detector for Retrieval-Augmented Generation (RAG) pipelines. It is an encoder-based model built on ModernBERT, released under the MIT license with ready-to-use Python packages and pretrained models. What: LettuceDetect is a token-level detector that flags unsupported segments in LLM answers. 🥬 How: Trained on RAGTruth (18k examples), leveraging ModernBERT for context lengths up to 4k tokens. 🚀 Why: It addresses (1) the context-window limits in prior encoder-only models, and (2) the high compute costs of LLM-based detectors. ⚖️ Highlights: Beats prior encoder-based models (e.g., Luna) on RAGTruth. ✅ Surpasses fine-tuned Llama-2-13B [2] at a fraction of the size, and is highly efficient at inference. ⚡️ Entirely open-source with an MIT license. 🔓 LettuceDetect keeps your RAG framework fresh by spotting rotten parts of your LLM’s outputs. 😊 Quick links Why LettuceDetect? Large Language Models (LLMs) have made considerable advancements in NLP tasks, like GPT-4 [4], the Llama-3 models [5], or Mistral [6] (and many more). Despite the success of LLMs, Hallucinations remain a key obstacle deploying LLMs in high-stakes scenarios (such as in healthcare or legal) [7,8]. Retrieval-Augmented Generation (RAG) attempts to mitigate hallucinations by grounding an LLM’s responses in retrieved documents, providing external knowledge that the model can reference [9]. But even though RAG is a powerful method to reduce hallucinations, LLMs still suffer from hallucinations in these settings [1]. Hallucinations are information in the output that is nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into: Intrinsic hallucinations: Stemming from the model’s preexisting internal knowledge. Extrinsic hallucinations: Occurring when the answer conflicts with the context or references provided While RAG approaches can mitigate intrinsic hallucinations, they are not immune to extrinsic hallucinations. Sun et al. [11] showed that models tend to prioritize their intrinsic knowledge over the external context. As LLMs remain prone to hallucinations, their applications in critical domains e.g. medical or legal, can be still flawed. Current solutions for hallucination detection Current solutions for hallucination detection can be categorized into different categories based on the approach they take: Prompt-based detectors These methods (e.g., RAGAS, Trulens, ARES) typically leverage zero-shot or few-shot prompts to detect hallucinations. They often rely on large LLMs (like GPT-4) and employ strategies such as SelfCheckGPT [12], LM vs. LM [13], or Chainpoll [14]. While often effective, they can be computationally expensive due to repeated LLM calls. Fine-tuned LLM detectors Large models (e.g., Llama-2, Llama-3) can be fine-tuned for hallucination detection [1,15]. This can yield high accuracy (as shown by the RAGTruth authors using Llama-2-13B or the RAG-HAT work on Llama-3-8B) but is resource-intensive to train and deploy. Inference costs also tend to be high due to their size and slower speeds. Encoder-based detectors Models like Luna [2] rely on a BERT-style encoder (often limited to 512 tokens) for token-level classification. These methods are generally more efficient than running a full LLM at inference but are constrained by short context windows and attention mechanisms optimized for smaller inputs. ModernBERT for long context ModernBERT [3] is a drop-in replacement for BERT and is a state-of-the-art encoder-only transformers architecture that incorporates several modern design improvements over the original BERT model such as it uses Rotary Positional Embeddings (RoPe) to handle sequences of up to 8,192 tokens, unpadding optimization to eliminate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating attention for more efficient attention computation. LettuceDetect capitalizes on ModernBERT’s extended context window to build a token-level classifier for hallucination detection. This approach sidesteps many limitations of older BERT-based models (e.g., short context bounds) and avoids the inference overhead of large LLM-based detectors. Our experiments show that LettuceDetect outperforms other encoder-based systems while remaining competitive with fine-tuned LLM detectors at a fraction of their computational cost. Data RAGTruth is the first large-scale open-source (MIT License) benchmark specifically designed to evaluate hallucination detection in Retrieval-Augmented Generation (RAG) settings. It contains 18,000 annotated examples spanning multiple tasks: Question answering (QA): Sampled from the MS MARCO dataset, where up to three documents are retrieved for each question, and an LLM is prompted to produce an answer. Data-to-text generation: Based on the Yelp Open Dataset, where LLMs generate reviews for sampled businesses. News summarization: Uses random documents from CNN/DailyMail, prompting LLMs to provide summaries. Each example in RAGTruth includes: A question and prompt, One or more reference passages or context documents, An LLM-generated response, Human annotations at the span level indicating whether any part of the answer is factually unsupported or contradictory to the reference passages. Below is an illustrative sample from RAGTruth, showing an instance of hallucination: QUESTION How to explain quartiles? REFERENCE Passage 2:Second quartile (Q2) which is more commonly known as median splits the data in half (50%).Median divides the data into a lower half and an upper half.Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data.(…) RESPONSE The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.(…) ANNOTATION Span: “highest 75%” Type: Evident Conflict Reason: Original text: “lowest 75% (or highest 25%)” Generative text: “highest 75%” This example showcases a RAGTruth annotation involving specific spans in the answer that conflict with the reference passages, providing both the type of hallucination (e.g., Evident Conflict) and a human-readable rationale. Our method only uses binary annotations and does not use the type of hallucination annotations. Method Image by Author A high-level depiction of LettuceDetect. Here, an example Question, Context, and Answer triplet is processed. First, the text is tokenized, after which LettuceDetect performs token-level classification. Tokens from both the question and context are masked (indicated by the red line in the figure) to exclude them from the loss function. Each token in the answer receives a probability indicating whether it is hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination probabilities above 0.5 into a single predicted span. We train ModernBERT-base and ModernBERT-large variants as token-classification models on the RAGTruth dataset. The input to the model is a concatenation of Context, Question, and Answer segments, with specialized tokens ([CLS]) (for the context) and ([SEP]) (as separators). We limit the sequence length to 4,096 tokens for computational feasibility, though ModernBERT can theoretically handle up to 8,192 tokens. Tokenization and data processing Tokenizer: We employ AutoTokenizer from the Transformers library to handle subword Tokenization, inserting [CLS] and [SEP] appropriately. Labeling: Context/question tokens are masked (i.e., assigned a label of -100 in PyTorch) so that they do not contribute to the loss. Each answer token receives a label of 0 (supported) or 1 (hallucinated). Model architecture Our models build on Hugging Face’s AutoModelForTokenClassification, using ModernBERT as the encoder and a classification head on top. Unlike some previous encoder-based approaches (e.g., ones pre-trained on NLI tasks), our method uses only ModernBERT with no additional pretraining stage. Training configuration Optimizer: AdamW, with a learning rate of 1 * 10^-5 and weight decay of 0.01. Hardware: Single NVIDIA A100 GPU. Epochs: 6 total training epochs. Batching: Batch size of 8, Data loading with PyTorch DataLoader (shuffling enabled), Dynamic padding via DataCollatorForTokenClassification to handle variable-length sequences efficiently. During training, we monitor token-level F1 scores on a validation split, saving checkpoints using the safetensors format. Once training is complete, we upload the best-performing models to Hugging Face for public access. At inference time, the model outputs a probability of hallucination for each token in the answer. We aggregate consecutive tokens exceeding a 0.5 threshold to produce span-level predictions, indicating exactly which segments of the answer are likely to be hallucinated. The figure above illustrates this workflow. Next, we provide a more detailed evaluation of the model’s performance. Results We evaluate our models on the RAGTruth test set across all task types (Question Answering, Data-to-Text, and Summarization). For each example, RAGTruth includes manually annotated spans indicating hallucinated content. Example-level results We first assess the example-level question: Does the generated answer contain any hallucination at all? Our large model (lettucedetect-large-v1) attains an overall F1 score of 79.22%, surpassing: GPT-4 (63.4%), Luna (65.4%) (the previous state of the art encoder-based model), Fine-tuned Llama-2-13B (78.7%) as presented in the RAGTruth paper. It is second only to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), but LettuceDetect is significantly smaller and faster to run. Meanwhile, our base model (lettucedetect-base-v1) remains highly competitive while using fewer parameters. Image by Author Above is a comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings. Span-level results Beyond detecting if an answer contains hallucinations, we also examine LettuceDetect’s ability to identify the exact spans of unsupported content. Here, LettuceDetect achieves state-of-the-art results among models that have reported span-level performance, substantially outperforming the fine-tuned Llama-2-13B model from the RAGTruth paper [1] and other baselines. Image by Author Most methods, like RAG-HAT [15], do not report span-level metrics, so we do not compare to them here. Inference efficiency Both lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and can process 30–60 examples per second on a single NVIDIA A100 GPU. This makes them practical for industrial workloads, real-time user-facing systems, and resource-constrained environments. Overall, these results show that LettuceDetect has a good balance: it achieves near state-of-the-art accuracy at a fraction of the size and cost compared to large LLM-based judges, while offering precise, token-level hallucination detection. Get going Install the package: pip install lettucedetect Then, you can use the package as follows: from lettucedetect.models.inference import HallucinationDetector # For a transformer-based approach: detector = HallucinationDetector( method=”transformer”, model_path=”KRLabsOrg/lettucedect-base-modernbert-en-v1″ ) contexts = [“France is a country in Europe. The capital of France is Paris. The population of France is 67 million.”,] question = “What is the capital of France? What is the population of France?” answer = “The capital of France is Paris. The population of France is 69 million.” # Get span-level predictions indicating which parts of the answer are considered hallucinated. predictions = detector.predict(context=contexts, question=question, answer=answer, output_format=”spans”) print(“Predictions:”, predictions) # Predictions: [{‘start’: 31, ‘end’: 71, ‘confidence’: 0.9944414496421814, ‘text’: ‘ The population of France is 69 million.’}] Conclusion We introduced LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By utilizing ModernBERT’s extended context capabilities, our models achieve strong performance on the RAGTruth benchmark while retaining high inference efficiency. This work lays the groundwork for future research directions, such as expanding to additional datasets, supporting multiple languages, and exploring more advanced architectures. Even at this stage, LettuceDetect demonstrates that effective hallucination detection can be achieved using lean, purpose-built encoder-based models. Citation If you find this work useful, please cite it as follows: @misc{Kovacs:2025, title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, author={Ádám Kovács and Gábor Recski}, year={2025}, eprint={2502.17125}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.17125}, } Also, if you use our code, please don’t forget to give us a star ⭐ on our GitHub repository here. References [1] Niu et al., 2024, RAGTruth: A Dataset for Hallucination Detection in Retrieval-Augmented Generation [2] Luna: A Simple and Effective Encoder-Based Model for Hallucination Detection in Retrieval-Augmented Generation [3] ModernBERT: A Modern BERT Model for Long-Context Processing [4] GPT-4 report [5] Llama-3 report [6] Mistral 7B [7] Kaddour et al., 2023, Challenges and Applications of Large Language Models [8] Huang et al., 2025, A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [9] Gao et al., 2024, Retrieval-Augmented Generation for Large Language Models: A Survey [10] Ji et al., 2023, Survey of Hallucination in Natural Language Generation [11] Sun et al., 2025, ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability [12] Manakul et al., 2023, SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [13] Cohen et al., 2023, LM vs LM: Detecting Factual Errors via Cross Examination [14] Friel et al., 2023, Chainpoll: A high efficacy method for LLM hallucination detection [15] Song et al., 2024, RAG-HAT: A Hallucination-Aware Tuning Pipeline for {LLM} in Retrieval-Augmented Generation [16] Devlin et al., 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Originally published on HuggingFace

TL;DR

We present LettuceDetect, a lightweight hallucination detector for Retrieval-Augmented Generation (RAG) pipelines. It is an encoder-based model built on ModernBERT, released under the MIT license with ready-to-use Python packages and pretrained models.

What: LettuceDetect is a token-level detector that flags unsupported segments in LLM answers. 🥬
How: Trained on RAGTruth (18k examples), leveraging ModernBERT for context lengths up to 4k tokens. 🚀
Why: It addresses (1) the context-window limits in prior encoder-only models, and (2) the high compute costs of LLM-based detectors. ⚖️

Highlights:
- Beats prior encoder-based models (e.g., Luna) on RAGTruth. ✅
- Surpasses fine-tuned Llama-2-13B [2] at a fraction of the size, and is highly efficient at inference. ⚡️
- Entirely open-source with an MIT license. 🔓

LettuceDetect keeps your RAG framework fresh by spotting rotten parts of your LLM’s outputs. 😊

Quick links

Why LettuceDetect?

Large Language Models (LLMs) have made considerable advancements in NLP tasks, like GPT-4 [4], the Llama-3 models [5], or Mistral [6] (and many more). Despite the success of LLMs, Hallucinations remain a key obstacle deploying LLMs in high-stakes scenarios (such as in healthcare or legal) [7,8].

Retrieval-Augmented Generation (RAG) attempts to mitigate hallucinations by grounding an LLM’s responses in retrieved documents, providing external knowledge that the model can reference [9]. But even though RAG is a powerful method to reduce hallucinations, LLMs still suffer from hallucinations in these settings [1]. Hallucinations are information in the output that is nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into:

Intrinsic hallucinations: Stemming from the model’s preexisting internal knowledge.
Extrinsic hallucinations: Occurring when the answer conflicts with the context or references provided

While RAG approaches can mitigate intrinsic hallucinations, they are not immune to extrinsic hallucinations. Sun et al. [11] showed that models tend to prioritize their intrinsic knowledge over the external context. As LLMs remain prone to hallucinations, their applications in critical domains e.g. medical or legal, can be still flawed.

Current solutions for hallucination detection

Current solutions for hallucination detection can be categorized into different categories based on the approach they take:

Prompt-based detectors These methods (e.g., RAGAS, Trulens, ARES) typically leverage zero-shot or few-shot prompts to detect hallucinations. They often rely on large LLMs (like GPT-4) and employ strategies such as SelfCheckGPT [12], LM vs. LM [13], or Chainpoll [14]. While often effective, they can be computationally expensive due to repeated LLM calls.
Fine-tuned LLM detectors Large models (e.g., Llama-2, Llama-3) can be fine-tuned for hallucination detection [1,15]. This can yield high accuracy (as shown by the RAGTruth authors using Llama-2-13B or the RAG-HAT work on Llama-3-8B) but is resource-intensive to train and deploy. Inference costs also tend to be high due to their size and slower speeds.
Encoder-based detectors Models like Luna [2] rely on a BERT-style encoder (often limited to 512 tokens) for token-level classification. These methods are generally more efficient than running a full LLM at inference but are constrained by short context windows and attention mechanisms optimized for smaller inputs.

ModernBERT for long context

ModernBERT [3] is a drop-in replacement for BERT and is a state-of-the-art encoder-only transformers architecture that incorporates several modern design improvements over the original BERT model such as it uses Rotary Positional Embeddings (RoPe) to handle sequences of up to 8,192 tokens, unpadding optimization to eliminate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating attention for more efficient attention computation.

LettuceDetect capitalizes on ModernBERT’s extended context window to build a token-level classifier for hallucination detection. This approach sidesteps many limitations of older BERT-based models (e.g., short context bounds) and avoids the inference overhead of large LLM-based detectors. Our experiments show that LettuceDetect outperforms other encoder-based systems while remaining competitive with fine-tuned LLM detectors at a fraction of their computational cost.

Data

RAGTruth is the first large-scale open-source (MIT License) benchmark specifically designed to evaluate hallucination detection in Retrieval-Augmented Generation (RAG) settings. It contains 18,000 annotated examples spanning multiple tasks:

Question answering (QA): Sampled from the MS MARCO dataset, where up to three documents are retrieved for each question, and an LLM is prompted to produce an answer.
Data-to-text generation: Based on the Yelp Open Dataset, where LLMs generate reviews for sampled businesses.
News summarization: Uses random documents from CNN/DailyMail, prompting LLMs to provide summaries.

Each example in RAGTruth includes:

A question and prompt,
One or more reference passages or context documents,
An LLM-generated response,
Human annotations at the span level indicating whether any part of the answer is factually unsupported or contradictory to the reference passages.

Below is an illustrative sample from RAGTruth, showing an instance of hallucination:

QUESTION

How to explain quartiles?

REFERENCE

Passage 2:
Second quartile (Q2) which is more commonly known as median splits the data in half (50%).
Median divides the data into a lower half and an upper half.
Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data.
(…)

RESPONSE

The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.
(…)

ANNOTATION

Span: “highest 75%”
Type: Evident Conflict
Reason:
- Original text: “lowest 75% (or highest 25%)”
- Generative text: “highest 75%”

This example showcases a RAGTruth annotation involving specific spans in the answer that conflict with the reference passages, providing both the type of hallucination (e.g., Evident Conflict) and a human-readable rationale. Our method only uses binary annotations and does not use the type of hallucination annotations.

Method

Diagram of LettuceDetect — Image by Author

A high-level depiction of LettuceDetect. Here, an example Question, Context, and Answer triplet is processed. First, the text is tokenized, after which LettuceDetect performs token-level classification. Tokens from both the question and context are masked (indicated by the red line in the figure) to exclude them from the loss function. Each token in the answer receives a probability indicating whether it is hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination probabilities above 0.5 into a single predicted span.

We train ModernBERT-base and ModernBERT-large variants as token-classification models on the RAGTruth dataset. The input to the model is a concatenation of Context, Question, and Answer segments, with specialized tokens ([CLS]) (for the context) and ([SEP]) (as separators). We limit the sequence length to 4,096 tokens for computational feasibility, though ModernBERT can theoretically handle up to 8,192 tokens.

Tokenization and data processing

Tokenizer: We employ AutoTokenizer from the Transformers library to handle subword Tokenization, inserting [CLS] and [SEP] appropriately.
Labeling:
- Context/question tokens are masked (i.e., assigned a label of -100 in PyTorch) so that they do not contribute to the loss.
- Each answer token receives a label of 0 (supported) or 1 (hallucinated).

Model architecture

Our models build on Hugging Face’s AutoModelForTokenClassification, using ModernBERT as the encoder and a classification head on top. Unlike some previous encoder-based approaches (e.g., ones pre-trained on NLI tasks), our method uses only ModernBERT with no additional pretraining stage.

Training configuration

Optimizer: AdamW, with a learning rate of 1 * 10^-5 and weight decay of 0.01.
Hardware: Single NVIDIA A100 GPU.
Epochs: 6 total training epochs.
Batching:
- Batch size of 8,
- Data loading with PyTorch DataLoader (shuffling enabled),
- Dynamic padding via DataCollatorForTokenClassification to handle variable-length sequences efficiently.

During training, we monitor token-level F1 scores on a validation split, saving checkpoints using the safetensors format. Once training is complete, we upload the best-performing models to Hugging Face for public access.

At inference time, the model outputs a probability of hallucination for each token in the answer. We aggregate consecutive tokens exceeding a 0.5 threshold to produce span-level predictions, indicating exactly which segments of the answer are likely to be hallucinated. The figure above illustrates this workflow.

Next, we provide a more detailed evaluation of the model’s performance.

Results

We evaluate our models on the RAGTruth test set across all task types (Question Answering, Data-to-Text, and Summarization). For each example, RAGTruth includes manually annotated spans indicating hallucinated content.

Example-level results

We first assess the example-level question: Does the generated answer contain any hallucination at all? Our large model (lettucedetect-large-v1) attains an overall F1 score of 79.22%, surpassing:

GPT-4 (63.4%),
Luna (65.4%) (the previous state of the art encoder-based model),
Fine-tuned Llama-2-13B (78.7%) as presented in the RAGTruth paper.

It is second only to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), but LettuceDetect is significantly smaller and faster to run. Meanwhile, our base model (lettucedetect-base-v1) remains highly competitive while using fewer parameters.

Comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna) — Image by Author

Above is a comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings.

Span-level results

Beyond detecting if an answer contains hallucinations, we also examine LettuceDetect’s ability to identify the exact spans of unsupported content. Here, LettuceDetect achieves state-of-the-art results among models that have reported span-level performance, substantially outperforming the fine-tuned Llama-2-13B model from the RAGTruth paper [1] and other baselines.

Most methods, like RAG-HAT [15], do not report span-level metrics, so we do not compare to them here.

Inference efficiency

Both lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and can process 30–60 examples per second on a single NVIDIA A100 GPU. This makes them practical for industrial workloads, real-time user-facing systems, and resource-constrained environments.

Overall, these results show that LettuceDetect has a good balance: it achieves near state-of-the-art accuracy at a fraction of the size and cost compared to large LLM-based judges, while offering precise, token-level hallucination detection.

Get going

Install the package:

pip install lettucedetect

Then, you can use the package as follows:

from lettucedetect.models.inference import HallucinationDetector

# For a transformer-based approach:

detector = HallucinationDetector(

    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"

)

contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]

question = "What is the capital of France? What is the population of France?"

answer = "The capital of France is Paris. The population of France is 69 million."

# Get span-level predictions indicating which parts of the answer are considered hallucinated.

predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")

print("Predictions:", predictions)

# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]

Conclusion

We introduced LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By utilizing ModernBERT’s extended context capabilities, our models achieve strong performance on the RAGTruth benchmark while retaining high inference efficiency. This work lays the groundwork for future research directions, such as expanding to additional datasets, supporting multiple languages, and exploring more advanced architectures. Even at this stage, LettuceDetect demonstrates that effective hallucination detection can be achieved using lean, purpose-built encoder-based models.

Citation

If you find this work useful, please cite it as follows:

@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, 
      author={Ádám Kovács and Gábor Recski},
      year={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}

Also, if you use our code, please don’t forget to give us a star ⭐ on our GitHub repository here.

References

[1] Niu et al., 2024, RAGTruth: A Dataset for Hallucination Detection in Retrieval-Augmented Generation

[2] Luna: A Simple and Effective Encoder-Based Model for Hallucination Detection in Retrieval-Augmented Generation

[3] ModernBERT: A Modern BERT Model for Long-Context Processing

[4] GPT-4 report

[5] Llama-3 report

[6] Mistral 7B

[7] Kaddour et al., 2023, Challenges and Applications of Large Language Models

[8] Huang et al., 2025, A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[9] Gao et al., 2024, Retrieval-Augmented Generation for Large Language Models: A Survey

[10] Ji et al., 2023, Survey of Hallucination in Natural Language Generation

[11] Sun et al., 2025, ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

[12] Manakul et al., 2023, SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

[13] Cohen et al., 2023, LM vs LM: Detecting Factual Errors via Cross Examination

[14] Friel et al., 2023, Chainpoll: A high efficacy method for LLM hallucination detection

[15] Song et al., 2024, RAG-HAT: A Hallucination-Aware Tuning Pipeline for {LLM} in Retrieval-Augmented Generation

[16] Devlin et al., 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

For many NFL teams, a new season means infrastructure modernization

Cisco has been aggressive in its platform strategy, and the willingness of six more teams to embrace it demonstrates the maturity of its AI-ready infrastructure. Cisco has been an official NFL partner for the past five seasons. Aaron Amendolia, the NFL’s deputy CIO, cited the vendor’s delivery of “world-class networking

Quinas readies UltraRam, flash memory with DRAM speed

For starters, the memory is built on what is called a” III-V technology,” a class of semiconductor materials that are composed of elements from groups III and V of the periodic table, the company stated. These materials have unique properties that make them ideal for use in electronic devices such

7 Wi-Fi certifications to bolster wireless networking skills

Organization: Certified Wireless Network Professionals (CWNP) Price: $149.99 for the CWTS-102 exam How to prepare: CWNP offers several resources to prepare including: a live training class, a self-paced training kit, a study and reference guide, an electronic practice test, an eLearning module, an eLearning bundle, and a test and go

Running the numbers on VMware Cloud Foundation: How one midsize enterprise is making it work

Grinnell leases its infrastructure, and one big renewal coming up was that of all of its hardware, which would happen a little over a year after the VMware subscription renewal. “We were all-in on HPE,” said Wright. “It’s a great platform.” Grinnell was leasing its servers and storage, not buying,

Saudi Arabia’s $5.5B Bond Sets Course for Record Issuance

Saudi Arabia sold $5.5 billion of international bonds on Tuesday to help plug its budget deficit, putting it on course for a record year of issuance as it continues to spend heavily on Crown Prince Mohammed bin Salman’s economic-diversification projects. The two-part Sukuk, or Islamic debt, sale was made up of a $2.25 billion five-year note and a $3.25 billion 10-year bond. The shorter tranche priced with a spread of 65 basis points over US Treasuries, while the longer one was sold at 75 basis points. Investors had placed around $17.5 billion of orders, underscoring the strong demand for Saudi debt even as the Gulf nation ramps ups issuance in the face of lower oil prices and high spending that’s squeezing the government’s finances. Saudi Arabia has now sold almost $20 billion in dollar- and euro-denominated debt this year, according to data compiled by Bloomberg. That cements its status as one of the busiest issuers in emerging markets. It is also well above the 2024 full-year tally and within a whisker of the annual record of $21.5 billion set in 2017. The latest sale adds to a pick up in syndicated loan activity and a fresh wave of local bank issuance as Saudi companies, the government and sovereign wealth fund come up with extra financing to back the crown prince’s Vision 2030 agenda. Those financing needs are increasingly pressing, with Brent oil prices down about 8% this year to around $69 a barrel. With prices subdued, the Saudi government projects a fiscal shortfall of about 2.3% of gross domestic product this year. It has widely telegraphed its intention to issue bonds to fill the hole. That’s in addition to measures such as privatizing some assets. While the kingdom’s ratio of debt to GDP is low by global standards at under 30%, the International Monetary Fund sees

It’s time for customer-oriented approaches to generator interconnection

Travis Kavulla is vice president of regulatory affairs at NRG Energy and a past president of the National Association of Regulatory Utility Commissioners. Eric Blank is chairman of the Colorado Public Utilities Commission. Hardly a week goes by in the power sector these days without recriminations arising from a new era of tight supply conditions relative to soaring demand in the power sector. One of the hottest points of contention is the broken process by which new power plants are interconnected to the power grid. In order to gain grid access, a would-be power plant files an interconnection request to a transmission utility, and, having staked that claim, it takes a place in line — a line that has grown and grown, with years of delay for interconnecting projects to gain access to the grid, with the amount of projects in that line growing in all markets to many multiples of the given market’s total demand. These Gold Rush rules were appropriate at the origins of the industry’s restructuring when substantial latent capacity in the power grid was being underutilized, or where additional grid capacity could be opened up for only modest incremental cost. The transformational “open access” orders issued by the Federal Energy Regulatory Commission during this era unleashed an era of investment in power generators by standardizing utility transmission tariffs and making it clear that anyone who wanted to make a go of it in the power-generation business could do so. But like most Gold Rushes, this early bonanza that was meant to quickly tap a resource has ended up in a bureaucratic snarl. The grid today is largely tapped out, yet there are too many stakes in the ground that will never prove out. The messy backlog of the generator interconnection queue that has been left in

US DOE Earmarks $35MM to Support Emerging Energy Tech

The United States Department of Energy (DOE) will channel more than $35 million toward developing emerging energy technologies. The DOE said in a media release that the funds will be divided among 42 projects related to grid security, artificial intelligence, nuclear energy, and advanced manufacturing, and located at DOE national laboratories, plants, and sites. The selected projects will leverage over $21 million in cost share from private and public partners, bringing total funding to more than $57.5 million, according to DOE. The funds are provided through DOE’s Technology Commercialization Fund (TCF) program, managed through the Office of Technology Commercialization’s Core Laboratory Infrastructure for Market Readiness (CLIMR) Lab Call. The program, according to DOE, strengthens America’s economic and national security by supporting public-private partnerships that maximize taxpayer investments, advance American innovation, and ensure the U.S. stays ahead in global competitiveness. “The Energy Department’s National Labs play an important role in ensuring the United States leads the world in innovation”, DOE Secretary Chris Wright said. “These projects have the potential to accelerate technological breakthroughs that will define the future of science and help secure America’s energy future”. This year’s selections span across 19 DOE national labs, plants, and sites, DOE said, highlighting Lawrence Berkeley National Laboratory’s launch of America’s Cradle to Commerce (AC2C), which builds on the Cradle to Commerce (C2C) program. It provides wraparound support for lab-to-market innovation. In just 18 months, C2C has proven impact with more than $15M raised by participating startups and five commercial pilots launched, DOE said. Pacific Northwest National Laboratory plans to enhance and broaden the free Visual Intellectual Property Search (VIPS) tool with the VIPS 2.0 project. The new platform will enable smooth searches across a wide range of National Lab innovations available for licensing or open-source sharing, DOE said. Meanwhile, Argonne National Laboratory

California Geothermal Lease Sales Net Over $2.7MM

The U.S. Bureau of Land Management announced, in a statement posted on its website recently, that Bureau of Land Management geothermal lease sales in California netted over $2.7 million. The Bureau noted in that statement that it accepted winning bids on 13 parcels across 22,685 public acres in Imperial, Lassen, and Modoc counties for $2,711,858 in total receipts for a geothermal lease sale. The Bureau said in the statement that it may issue leases once review and payment are complete. “The sale generated an average of $117 per acre offered, supporting American prosperity by increasing potential for domestic energy production,” the Bureau stated. “For each parcel leased, 50 percent of the bid, rental receipts, and subsequent royalties will go to the state of California, 25 percent will go to the county where the lease is located, and the remaining 25 percent will go to the U.S. Treasury,” it added. “Geothermal lease sales support domestic energy production and American energy independence, while contributing to the nation’s economic and military security,” the Bureau continued. “Consistent with Executive Order 14154, ‘Unleashing American Energy’, the BLM’s geothermal lease sales help meet the energy needs of U.S. citizens and solidify the nation as a global energy leader long into the future and achieve American Energy Dominance,” it went on to state. The Bureau noted in the statement that leasing is the first step in the process to develop federal geothermal resources. The organization added that it ensures geothermal development meets the requirements set forth by the National Environmental Policy Act of 1969 and other applicable legal authorities. In its statement, the Bureau described geothermal as “an abundant resource, especially in the West, where the BLM has authority to manage geothermal resource leasing, exploration, and development on approximately 245 million surface acres of public lands and the 700 million acres

Texas Critical Data Centers and Thunderhead Ink Preliminary Power Deal

Texas Critical Data Centers LLC (TCDC), a 50-50 venture between New Era Energy & Digital Inc. and Sharon AI Inc., has signed a non-binding term sheet with Thunderhead Energy Solutions LLC for a natural gas-fired generation facility with a capacity of about 250 megawatts. Thunderhead will fund, construct and operate the facility using a hybrid deployment of reciprocating engines and turbines. The facility will serve “as the energy backbone for TCDC’s high-performance, AI-optimized compute campus”, a joint statement said. The parties expect to start construction of the power facility this year, targeting completion over the next 18 months. Planned to rise in Ector County, Texas, TCDC would be scalable to up to one gigawatt, according to New Era. In July TCDC completed the acquisition of 235 acres from Grow Odessa near the City of Odessa. It has entered into a letter of intent with the same seller for the purchase of an additional 203 contiguous acres. “The agreement with Thunderhead is one more major milestone in our buildout and reinforces our vision of delivering energy-resilient, AI-native infrastructure”, said New Era chief executive E. Will Gray II. “It also ensures TCDC will provide robust, SB6-compliant power to support the next wave of AI growth in West Texas”. This is the first agreement announced by New Era Energy & Digital since rebranding from New Era Helium Inc. to reflect its shift into a vertically integrated energy supplier. The rebranded New Era aims to develop “next-generation digital infrastructure and integrated power assets, including powered land and powered shells”, it said in a statement August 12. “The company delivers turnkey solutions that will enable hyperscale, enterprise and edge operators to accelerate data center deployment, optimize total cost of ownership and future-proof their infrastructure investments”. The Midland, Texas-based company “projects generational AI infrastructure demand will grow exponentially

Plains to Become Majority Owner of EPIC Crude Pipeline System

Diamondback Energy Inc. and Kinetik Holdings Inc. have signed agreements to sell each of their 27.5 percent stakes in EPIC Crude Holdings LP to Plains for around $1.57 billion. Plains will become the majority owner with a 55 percent interest in EPIC Crude Holdings, owner of the EPIC Crude Oil Pipeline. Ares Management Corp.’s EPIC Midstream Holdings LP will retain an operating stake of 45 percent. Stretching 800 miles, the pipeline system carries Delaware Basin and Midland Basin supply from locations near Crane, Midland, Orla and Wink, Texas, and Eagle Ford supply from locations near Gardendale and Hobson, Texas. The pipeline system delivers the oil to EPIC Crude Holdings’ 3.4-million-barrel Robstown Terminal near Corpus Christi, according to EPIC Midstream. The pipeline system, which became fully operational April 2020, has a nameplate capacity of 600,000 barrels per day (bpd), expandable up to one million bpd, and nearly seven million barrels of operational storage, according to EPIC Midstream. The assets boost Plains’ Permian wellhead to water strategy, Plains said in a statement on its website, noting the pipeline system is “underpinned by long-term minimum volume commitments from high-quality customers”. “This transaction strengthens our position as the premier crude oil midstream provider, complements our asset footprint and enhances our customer offering”, said Plains chair, chief executive and president Willie Chiang. “The combination of our stake in EPIC Crude Holdings coupled with our existing integrated Permian and Eagle Ford assets enhances our commitment to offering a high level of connectivity and flexibility for our customers. “By further linking our Permian and Eagle Ford gathering systems to Corpus Christi, we are enhancing market access and ensuring our customers have reliable, cost-effective routes to multiple demand centers”. Plains agreed to pay Diamondback and Kinetik an additional $193 million should an expansion of the pipeline system to a

SAP data sovereignty service lets customers run cloud workloads inside their data centers

A range of developments, primarily geo-political in nature, have transformed this outlook. Now, sovereignty is as much tied up with the growing sense that operational, political, and even technological independence is essential, especially for EU-based enterprises. SAP has embraced this concern. “The digital resilience of Europe depends on sovereignty that is secure, scalable and future-ready,” said Martin Merz, president, SAP Sovereign Cloud. “SAP’s full-stack sovereign cloud offering delivers exactly that, giving customers the freedom to choose their deployment model while helping ensure compliance up to the highest standards.” This reflects the company’s commitment to supporting the EU’s “digital autonomy,” he said. The company has made digital sovereignty a strategic priority, and will invest €20 billion ($23.3 billion) to develop new digital sovereignty products for the EU as well as for other territories. A decade ago, the idea of cloud services promoted the notion of a single global infrastructure market. Now it looks just as likely that there will be a balkanization of global cloud infrastructure into geographical domains. “For decades, enterprises have handed over too much power to their cloud providers – power over infrastructure, power over availability, and most importantly, power over their own data,” commented Garima Kapoor, co-founder and co-CEO of US AI object storage company, MinIO. “CIOs are realizing that outsourcing control to a public cloud provider is no longer an option. The concept of sovereignty is evolving. It’s no longer just as a means of maintaining compliance with data regulations but is now viewed as a strategic and architectural imperative for enterprises that want to own their digital destiny,” she said.

Alibaba Cloud tweaks software for networking efficiency gains

Alibaba Cloud said that it has been using ZooRoute in AliCloud for the last 18 months, where it has reduced outage time by 92.71%. Nezha for network performance in high-demand VMs Another software upgrade is helping Alibaba Cloud maintain network performance for high-demand virtual machines (VMs) without spending more on SmartNIC-accelerated virtual switches (vSwitches). Nezha, a distributed vSwitch load-sharing system, identifies idle SmartNICs and uses them to create a remote resource pool for high-demand virtual NICs (vNICs). Alibaba has tested the system in its data centers for a year and said in the paper that “Nezha effectively resolves vSwitch overloads and removes it as a bottleneck.” With the number of concurrent flows improved by up to 50x, and the number of vNICs by up to 40x, the bottleneck s now the VM kernel stack, the researchers wrote. Dai’s Forrester said that Nezha’s stateless offloading and cluster-wide pooling design is superior to solutions being pursued by rival cloud service providers. Separately, Alibaba’s cloud computing division has also been working on another software update that will enable it to provide better network performance for AI workloads.

AI networking success requires deep, real-time observability

Most research participants also told us they need to improve visibility into their data center network fabrics and WAN edge connectivity services. (See also: 10 network observability certifications to boost IT operations skills) The need for real-time data Observability of AI networks will require many enterprises to optimize how their tools collect network data. For instance, most observability tools rely on SNMP polling to pull metrics from network infrastructure, and these tools typically poll devices at five minute intervals. Shorter polling intervals can adversely impact network performance and tool performance. Sixty-nine percent of survey participants told EMA that AI networks require real-time infrastructure monitoring that SNMP simply cannot support. Real-time telemetry closes visibility gaps. For instance, AI traffic bursts that create congestion and packet drops may last only seconds, an issue that a five-minute polling interval would miss entirely. To achieve this level of metric granularity, network teams will have to adopt streaming network telemetry. Unfortunately, support of such technology is still uneven among network infrastructure and network observability vendors due to a lack of industry standardization and a perception among vendors that customers simply don’t need it. Well, AI is about to create a lot of demand for it. In parallel to the need for granular infrastructure metrics, 51% of respondents told EMA that they need more real-time network flow monitoring. In general, network flow technologies such as NetFlow and IPFIX can deliver data nearly in real-time, with delays of seconds or a couple minutes depending on the implementation. However, other technologies are less timely. In particular, the VPC flow logs generated by cloud providers are do not offer the same data granularity. Network teams may need to turn to real-time packet monitoring to close cloud visibility gaps. Smarter analysis for smarter networks Network teams also need their network

Equinix Bets on Nuclear and Fuel Cells to Meet Exploding Data Center Energy Demand

A New Chapter in Data Center Energy Strategy Equinix’s strategic investments in advanced nuclear and fuel cell technologies mark a pivotal moment in the evolution of data center energy infrastructure. By proactively securing power sources like Oklo’s fast reactors and Radiant’s microreactors, Equinix is not merely adapting to the industry’s growing energy demands but is actively shaping the future of sustainable, resilient power solutions. This forward-thinking approach is mirrored across the tech sector. Google, for instance, has partnered with Kairos Power to develop small modular reactors (SMRs) in Tennessee, aiming to supply power to its data centers by 2030 . Similarly, Amazon has committed to deploying 5 gigawatts of nuclear energy through partnerships with Dominion Energy and X-energy, underscoring the industry’s collective shift towards nuclear energy as a viable solution to meet escalating power needs . The urgency of these initiatives is underscored by projections from the U.S. Department of Energy, which anticipates data center electricity demand could rise to 6.7%–12% of total U.S. production by 2028, up from 4.4% in 2023. This surge, primarily driven by AI technologies, is straining existing grid infrastructure and prompting both public and private sectors to explore innovative solutions. Equinix’s approach, i.e. investing in both immediate and long-term energy solutions, sets a precedent for the industry. By integrating fuel cells for near-term needs and committing to advanced nuclear projects for future scalability, Equinix exemplifies a balanced strategy that addresses current challenges while preparing for future demands. As the industry moves forward, the collaboration between data center operators, energy providers, and policymakers will be crucial. The path to a sustainable, resilient energy future for data centers lies in continued innovation, strategic partnerships, and a shared commitment to meeting the digital economy’s power needs responsibly.

Evolving to Meet AI-Era Data Center Power Demands: A Conversation with Rehlko CEO Brian Melka

On the latest episode of the Data Center Frontier Show Podcast, we sat down with Brian Melka, CEO of Rehlko, to explore how the century-old mission-critical power provider is reinventing itself to support the new realities of AI-driven data center growth. Rehlko, formerly known as Kohler Energy, rebranded a year ago but continues to draw on more than a century of experience in power generation and backup systems. Melka emphasized that while the name has changed, the mission has not: delivering reliable, scalable, and flexible energy solutions to support always-on digital infrastructure. Meeting Surging AI Power Demands Asked how Rehlko is evolving to support the next wave of data center development, Melka pointed to two major dynamics shaping the market: Unprecedented capacity needs driven by AI training and inference. New, “spiky” usage patterns that strain traditional backup systems. “Power generation is something we’ve been doing longer than anyone else, starting in 1920,” Melka noted. “As we look forward, it’s not just about the scale of backup power required — it’s about responsiveness. AI has very large short-duration power demands that put real strain on traditional systems.” To address this, Rehlko is scaling its production capacity fourfold over the next three to four years, while also leveraging its global in-house EPC (engineering, procurement, construction) capabilities to design and deliver hybrid systems. These combine diesel or gas generation with battery storage and short-duration modulation, creating a more responsive power backbone for AI data centers. “We’re the only ones out there that can deliver that breadth of capability on a full turnkey basis,” Melka said. “It positions us to support customers as they navigate these new patterns of energy demand.” Speed to Power Becomes a Priority In today’s market, “speed to power” has become the defining theme. Developers and operators are increasingly considering

Data Center Chip Giants Negotiate Political Moves, Tariffs, and Corporate Strategies

And with the current restrictions being placed on US manufacturers selling AI parts to China, reporting says NVIDIA is developing a Blackwell-based China chip, more capable than the current H20 but still structured to comply with U.S. export rules. Reuters reported that it would be a single-die design (roughly half the compute of the dual-die B300), with HBM and NVLink, sampling as soon as next month. A second compliant workstation/inference product (RTX6000D) is also in development. Chinese agencies have reportedly discouraged use of NVIDIA H20 in government work, favoring Huawei Ascend. However, there have been reports describing AI training using the Ascend to be “challenging”, forcing some AI firms to revert to NVIDIA for large-scale training while using Ascend for inference. This keeps China demand alive for compliant NVIDIA/AMD parts—hence the U.S. interest in revenue-sharing. Meanwhile, AMD made its announcements at June’s “Advancing AI 2025” to set MI350 (CDNA 4) expectations and a yearly rollout rhythm that’s designed to erase NVIDIA’s time lead as much as fight on absolute perf/Watt. If MI350 systems ramp aligns with major cloud designs in 2026, AMD’s near-term objective is defending MI300X momentum while converting large customers to multi-vendor strategies (often pairing MI clusters with NVIDIA estates for redundancy and price leverage). The 15% China license fee will shape how AMD prices MI-series export SKUs and whether Chinese hyperscalers still prefer them to the domestic alternative (Huawei Ascend), which continue to face software/toolchain challenges. If Chinese buyers balk or Beijing discourages purchases, the revenue-share may be moot; if they don’t, AMD has a path to keep seats warm in China while building MI350 demand elsewhere. Beyond China export licenses, the U.S. and EU recently averted a larger trade war by settling near 15% on certain sectors, which included semiconductors, as opposed to the far more

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle