GPT-5.5 at the Top of the AI Rankings: A Triumph with a Troubling Catch

Technologies
BB.LV
Publiation data: 24.04.2026 09:45
GPT-5.5 at the Top of the AI Rankings: A Triumph with a Troubling Catch

The latest model GPT-5.5 from OpenAI confidently topped the Intelligence Index by Artificial Analysis. It outperformed its closest competitors by three points, breaking the tie among leading AI developers; however, its leadership is overshadowed by one very concerning feature.

Artificial Analysis has awarded the GPT-5.5 model first place in its prestigious Intelligence Index. This new development from OpenAI managed to surpass its closest competitors by three points, thereby breaking the previous tie between giants such as OpenAI, Anthropic, and Google.

Experts at Artificial Analysis received exclusive early access to the model, allowing them to thoroughly test all five levels of its reasoning: xhigh, high, medium, low, and non-reasoning. However, despite the impressive results, the report contains a significant caveat.

On the AA-Omniscience benchmark, which assesses factual knowledge and the tendency to "hallucinate," the GPT-5.5 xhigh version demonstrated the best accuracy. It provided 57% correct answers to exceptionally difficult questions, which is an outstanding figure.

However, the hallucination rate for this model was shockingly high — 86%. In comparison, Claude Opus 4.7 max had a rate of only 36%, while Gemini 3.1 Pro Preview had 50%.

It is important to understand that 86% does not mean that GPT-5.5 "hallucinates" in most of its answers. According to the methodology of Artificial Analysis, the "hallucination rate" is the percentage of incorrect answers among all situations where the model failed to provide an absolutely correct answer.

This includes cases where the AI made a mistake, answered only partially, or outright refused to answer. Simply put, this metric vividly demonstrates how often the model prefers to confidently err rather than honestly admit its lack of knowledge.

The AA-Omniscience benchmark was specifically designed to identify and assess this critical issue. The test includes 6000 questions covering 42 topics from six broad areas of knowledge.

Among them are: business, humanities and social sciences, health, law, software engineering, as well as science, technology, and mathematics. The models answer these questions without access to search or any external tools.

The scoring system rewards exclusively correct answers while penalizing incorrect ones. It is important to note that the model is not penalized for refusing to answer if it is unsure of its knowledge. OpenAI itself states in its System Card that GPT-5.5 has become noticeably more accurate compared to GPT-5.4. This was noted in a sample of ChatGPT dialogues that users had previously identified as containing factual errors.

In these specific cases, individual statements were correct 23% more often, and the number of factual errors in responses decreased by 3%. However, OpenAI emphasizes that this is not a representative sample of all traffic, but rather specially selected scenarios that are particularly challenging for factual accuracy.

As a result, we are presented with a rather paradoxical picture. GPT-5.5, according to the independent ranking, appears to be a powerful universal model that truly surpasses competitors in knowledge according to AA-Omniscience.

However, it demonstrates a poorer ability to calibrate its own confidence in answers. For critically important tasks such as fact-checking, research, and preparing legal or medical documents, this aspect may be just as significant as its overall score in the ranking. A convincing but potentially erroneous answer from such a strong model still requires thorough verification. This is especially relevant if the AI operates without access to external sources and tools.

ALSO IN CATEGORY

READ ALSO