Skip to main content
Knowledge4Policy
Knowledge for policy
Supporting policy with scientific evidence

We mobilise people and resources to create, curate, make sense of and use knowledge to inform policymaking across Europe.

  • News | 10 Sep 2025
AI benchmarking: Nine challenges and a way forward

A recent JRC paper explores AI benchmarks, which are considered an essential tool to evaluate performance, capabilities, and risks of AI models. Through a comprehensive literature review, the paper identifies key shortcomings of AI benchmarking, as well as policy approaches that could mitigate these.

Benchmarks are a common approach to evaluate the performance of software and hardware systems, by comparing them to a standard or reference point. In AI development, they are used to facilitate cross-model comparisons, measure performance and track model progress, and they have emerged as essential evaluation tools for AI developers as well as regulators.

However, as the impact of AI benchmarking grows, concerns have been raised about their limitations and side effects when assessing sensitive topics such as high-impact capabilities, safety and systemic risks.

In a paper to be presented at AIES 2025,1 JRC researchers carried out an interdisciplinary meta-review of approximately 110 studies that identify key shortcomings in AI benchmarking practices. Focusing on software-oriented benchmarks executed without direct human intervention, they identified a range of limitations, including issues in the design and application of benchmarks, as well as broader sociotechnical issues and systemic flaws. They also considered policy approaches that could help mitigate the challenges created by current benchmarking practices.

Key benchmarking issues

The paper presents a taxonomy of nine reasons to be cautious in the use of AI benchmarks, identified through the meta-review. 

Wheel showing Proposed categorisation of current interlinked AI benchmarking issues.
(© European Commission)
Proposed categorisation of current interlinked AI benchmarking issues.

One of the issues relates to construct validity and epistemological claims. Many benchmarks fail to measure what they claim to measure, and even lack a clear definition of what they are attempting to assess. This makes it impossible to evaluate their success. When it comes to concepts such as “fairness” and “bias”, it is particularly difficult to reach a clear and stable ground truth, and benchmarks that claim to evaluate these may provide a false sense of certainty.  

The paper also points out that the roots of benchmark tests are often commercial. Benchmarks that are used to showcase AI capabilities to a customer audience may discourage thorough self-critique. So-called “SOTA-chasing”, or the “benchmark effect”, has encouraged a competitive culture where benchmark scores may be valued higher than the thorough insights and evaluations they were originally intended to foster.   

The rapid development of AI is also a challenge, as benchmarks struggle to keep up with the increasing capabilities of AI models. In some cases, models achieve such high accuracy scores that the benchmark is rendered ineffective, and the slow implementation of benchmark frameworks can make it challenging to flag AI model risks in a timely way.

The paper provides more detail on these and the five other issues identified. The authors also provide recommendations for how the challenges can be mitigated. They recommend that for benchmarks to be relied on, they need to:

  • be well-documented and transparent;
  • include clearly defined tasks, metrics, and performance evaluation mechanisms to prevent capabilities misrepresentation; evaluate diversity and inclusivity in benchmark design, accounting for various perspectives and cultural contexts;
  • target multimodal and real-world capabilities;
  • continuously assess potential misuse while integrating dynamic benchmarks to prevent gaming, sandbagging, and data contamination; establish rigorous evaluation protocols to validate and update benchmark results in line with rapid model improvements;
  • evaluate errors and unintended consequences alongside performance and capabilities.

The paper concludes that an important task for policymakers going forward will be to help assess which benchmarks can be trusted, based on the conditions outlined. This will support the uptake of trustworthy AI across Europe, a mission the JRC is committed to supporting.

  1. See here for a pre-print of the full paper, or here for a summary paper presented at the ICML TAIG Workshop in July 2025