Is xAI’s Grok 3 Claim Under Fire? OpenAI Challenges Benchmark Results

AI industry faces credibility crisis as benchmark reporting methods come under scrutiny
Is xAI’s Grok 3 Claim Under Fire? OpenAI Challenges Benchmark Results
Published on

The tech industry loves gossip more than soap operas. The new gossip in the tech industry is on AI benchmarks. AI benchmarks are advertisements for performances. These are different statistics that everyone loves to see when choosing between binary options. Yes or no, you may laugh at it, but the entire tech industry adopts any technology based on these statistics. 

This week, two frontrunners are at the centre of this heated debate - Elon Musk’s xAI and OpenAI employees. The heated discussion is over benchmark reporting, exposing deeper issues on how AI companies present their achievements to the public. OpenAI employees accuse xAI of publishing misleading benchmark results for its latest AI model, Grok 3. However, one of the xAI’s co-founders, Igor Babushckin, denied the claims and insisted that his company is right. However, the truth lies between the critical issues plaguing the AI industry: the benchmark transparency problem, increasingly aggressive marketing tactics, and the hidden cost behind performance metrics.

Benchmark Transparency Problem

Benchmarks are standardized sets of tasks to compare and evaluate the performance of AI models on specific problems. They acquire skill, not intelligence. But everyone looks at these benchmarks as intelligence markers, which they are not. Recently, in a blog post for xAI, the company published its benchmark graphs demonstrating Grok 3’s performance on the  American Invitational Mathematics Examination (AIME) 2025, a benchmark consisting of 30 challenging math questions for mathletes competing to gain their place in the US team at Math Olympiads.

OpenAI’s best-performing model, o3-mini-high, was beaten by Grok 3 reasoning Beta and Grok 3 mini reasoning, two xAI Grok AI model variants, as data published on xAI’s blog. But OpenAI employees quickly pointed out on X that xAI did not publish the data on o3-mini-high score for AIME 2025 at “cons@64”.

cons@64 is short for “consec@64”, it refers to how a question is answered (64 attempts are undertaken before choosing from the frequently occurring answer among the attempts). It boasts a model value up on any graph, and its absence from any graph paints a misleading impression of superiority. Amazingly, the latest eye-candy for the global investor, DeepSeek, has yet to publish its results from cons@64 benchmark.

Transparency Crisis in AI Benchmarking

We are in cherry-picking season every time a company releases its metrics. Without any standardized reporting methods, this same cat-and-mouse loop will follow. Debates will only be centered around selective benchmark reporting.

This new debate just highlights the problem the industry has been facing for a long time. Since anyone can design a benchmark for testing. They release reports from their test, and the same debate structure is followed each time. We need a way to move out of this crisis, and now the time has come to form a new meaningful reporting mechanism for AI benchmark reporting for investors, researchers, and the common public.

Marketing Wars between AI Giants

AI has entered its marketing phase. The current marketing war is around the reasoning score for AIME 2025 at “@1” — percentage of answers on first attempt when solving a problem. Even though xAI Grok 3 achieves this “@1” score for its two variants, it falls below OpenAI’s o3-mini-high’s score. Yet xAI is marketing its Grok 3 AI model as the world's smartest. Right now, every model's performance score on cons@64 models can only provide a clear picture of this intensifying competition.

Hidden Cost Behind AI Performance

The most concerning part is that these benchmarks' computational and monetary costs remain hidden. We are not yet aware of the price it took each model to achieve its best score. Cementing that the tech industry now requires standardized reporting requirements, similar to automotive fuel efficiency benchmarks. Unless these mechanisms are applied, the confusion will only increase.

Conclusion

A motto should be embraced: look behind headline claims. More questions about performance to achieve a result should be the new norm. Understand all the benchmarks, even if a new one is created for a certain test. We must understand what benchmarks measure and what they omit for meaningful comparisons and to avoid skepticism. The xAI and OpenAI rivalry should be a turning point, where we should look beyond technical achievements to choose the best AI model.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

Related Stories

No stories found.
logo
Analytics Insight: Latest AI, Crypto, Tech News & Analysis
www.analyticsinsight.net