Voices Clash in Public Over Controversial AI Benchmark Results from Elon Musk’s Company, xAI

Published: 22 Feb 2025

The world of artificial intelligence is being rocked by a public dispute over benchmarking inconsistencies between prominent AI firms.

The usually insular world of artificial intelligence is getting a jolt of transparency, with a dramatic public dispute over AI performance benchmarks. At the center of the dispute are two leading tech firms — Elon Musk’s xAI and an industry giant, OpenAI. The conflict originated from benchmarks produced for xAI’s latest AI model, Grok 3. An employee of OpenAI has blown the whistle on potentially manipulated results, casting doubts on xAI’s integrity.

According to the accuser, xAI had published a graph showing Grok 3 outperforming OpenAI’s best currently available model, o3-mini-high, on a recognized AI test, AIME 2025. But the devil, as they say, is in the details. Critics have said xAI’s graph omitted o3-mini-high’s AIME 2025 score at “cons@64”.

Defending his company, xAI co-founder, Igor Babushkin, shot back at critics by alleging that OpenAI has published similarly ambiguous benchmark charts in the past. He maintained that their aim was to show Grok 3’s potential and achievements, not to mislead.

This saga has highlighted a vital aspect of the AI industry - benchmarks. Clear, consistent and honest benchmark data is crucial for effectively gauging AI capability, potential, and progress. In this era of AI advancement, it’s imperative that tech firms maintain transparency not only for the progression of technology but also for maintaining public trust. This dispute may spark a renewed push for universal benchmark standards in the AI industry. Only time will tell if this will lead to any reform.

•Did xAI lie about Grok 3’s benchmarks? techcrunch.com22-02-2025