FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

2025-12-09

1 min read

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

Google introduces the FACTS Benchmark Suite, a comprehensive evaluation framework for measuring factuality of large language models across four distinct benchmarks.

•Parametric Benchmark: 2,104 items (public + private) testing models' ability to answer trivia-style factual questions using internal knowledge without external tools
•Search Benchmark: 1,884 items testing models' ability to use a shared web search tool to retrieve and synthesize information, often requiring multi-step retrieval
•Multimodal Benchmark: 1,522 items evaluating factually accurate text generation in response to image-based questions requiring visual grounding and parametric knowledge
•Grounding Benchmark v2: an updated version of the original benchmark testing context-grounded answering
•

Gemini 3 Pro achieved the highest FACTS Score of 68.8%, reducing error rates by 55% on Search and 35% on Parametric compared to Gemini 2.5 Pro; all models scored below 70%

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Developer's guide to Gemini Enterprise and A2UI integration

Boston Children’s uses AI to unlock new diagnoses

How Braintrust turns customer requests into code with Codex

May 29, 2026