Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so it’s important that their responses are factually accurate.
In order to continue improving their performance on this industry-wide challenge, we have to better understand the types of use cases where models struggle to provide an accurate response and better measure factuality performance in those areas.
The FACTS Benchmark Suite
Today, we’re teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark, with three additional factuality benchmarks, including:
- A Parametric Benchmark that measures the model’s ability to access its internal knowledge accurately in factoid question use-cases.
- A Search Benchmark that tests a model’s ability to use Search as a tool to retrieve information and synthesize it correctly.
- A Multimodal Benchmark that tests a model’s ability to answer prompts related to input images in a factually correct manner.
We are also updating the original FACTS grounding benchmark with Grounding Benchmark – v2, an extended benchmark to test a model’s ability to provide answers grounded in the context of a given prompt.
Each benchmark was carefully curated to produce a total of 3,513 examples, which we are making publicly available today. Similar to our previous release, we are following standard industry practice and keeping an evaluation set held-out as a private set. The FACTS Benchmark Suite Score (or FACTS Score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes owning the private held-out sets, testing the leading LLMs on the benchmarks, and hosting the results on a public leaderboard. More details about the FACTS evaluation methodology can be found in our tech report.
Benchmark overview
Parametric Benchmark
The FACTS Parametric benchmark assesses the ability of models to accurately answer factual questions, without the aid of external tools like web search. All the questions in the benchmark are “trivia style” questions driven by user interest that can be answered via Wikipedia (a standard source for LLM pretraining). The resulting benchmark consists of a 1052-item public set and a 1052-item private set.



















