How to find the smartest AI

Copyright © HT Digital Streams Limit all rights reserved. Economist, The Economist 5 Min Read 26 Aug 2025, 12:28 pm ist Artificial Intelligence (AI) Development Summary Developers Build Fiendish tests. Only the best models can be the dizzying variety of letters across the page of one of Jonathan Roberts’ visual reasoning questions, looks like a word search compiled by a sadist. Testers are not only the task of finding the hidden words in the image, but to see a question written in the form of a star and then answer it (see below). The intention of Mr. Roberts’ bundle of a hundred questions is not to help people pass the time on the train. Instead, it is to provide the latest artificial intelligence (AI) models such as O3-Pro, June’s top-level release of Openai, with a test worthy of their skills. There is no shortage of tests for AI models. Some seek to measure general knowledge, others are subject -specific. There are those who aim to judge everything from puzzle solution and creativity to conversational ability. But not all these so -called measure tests they claim for. Many were compiled in a hurry, with defects and omissions; was too easy to cheat after filtering the training data of AI models; Or was just too easy for today’s “border” systems. Zerobench, the challenge launched by Mr Roberts and his colleagues at the University of Cambridge, is a prominent alternative. It is aimed at large multimodal models-SAS systems that can take images as well as text and inputs-and are aimed at providing a test that is easy (ISH) for the typical person and impossible for modern models. For now, no major language model (LLM) can score a single point. If some get better on a day, that would be quite an achievement. Zerobench is not alone. Enigmaal is a collection of more than a thousand multimodal puzzles compiled by Scale AI, an AI data booth. Unlike Zerobench, Enigmaal is trying to be easy for anyone. The puzzles, composed of a variety of existing online quiz resources, begin with the trouble of a fine cryptic crossword puzzle and become more difficult. If advanced AI systems are drawn up against the most difficult of these problems, their median score is zero. A Border Model from Anthropic, an AI laboratory, is the only model to have one of these questions right. Other question sets try to detect more specific abilities. Metrr, an AI security group, for example, follows the duration that people would take to perform individual tasks that AI models are now capable of (anthropic is the first to break the hour point). Another measure, the most beautiful name “The Last Exam of Mankind”, tests knowledge, rather than intelligence, with questions from the front line of human knowledge obtained from almost a thousand academic experts. One of the reasons for the new tests is the desire to avoid the mistakes of the past. Older benchmarks are full of sloppy phrases, bad marks or unfair questions. Imagenet, an early image recognition dataset, is a notorious example: a model that describes a photo of a mirror in which fruit reflects is penalized because he said that the picture of a mirror is but rewarded for identifying a banana. It is impossible to ask models to solve corrected versions of these tests without compromising researchers’ ability to compare them to models that have taken the poor versions. Newer tests – produced in an era in which AI research was flushed with resources – can be looked at effortlessly to detect such errors before production. The second reason for the rush to set up new tests is that models have taught the old ones. It was difficult to keep any general measure from the training data used by laboratories to train their models, leading to systems that perform better than in normal tasks. The third, and most urgent, issue that motivates the creation of new tests is saturation – Say models come close to getting full points. On a selection of 500 math problems in high school, for example, O3 Pro is likely to get an almost perfect score. But as O1-Mini, released nine months before, achieved 98.9%, the results do not offer observers a real sense of progress in the field. This is where Zerobench and its peers come in. Each one tries to measure a specific way that approaches or exceeds AI capabilities – the people. The last examination of humanity, for example, tried to devise intimidating questions about general knowledge (its name comes from its status as the most fine test it is possible to set), and asks for anything of the number of tendons supported by a specific humming bone to a translation of a piece of Palmyene script found on a Roman tomb. In a future where many AI models can achieve full points at such test, benchmarks may have to move away from knowledge-based questions. But even evaluations that are supposed to leave the test of time are overturned overnight. ARC-AGI, a non-verbal reasoning quiz, was introduced in 2024 with the aim of being difficult. Within six months, Openai announced a model, O3, to achieve 91.5%. For some AI developers, existing benchmarks miss the point. Openai’s boss Sam Altman played on the problems of quantifying the ominous when the firm released its GPT-4.5 in February. The system “will not crush benchmarks,” he tweeted. Instead, he added before publishing a short story, the model wrote: “There is a magic I haven’t felt before.” Some try to quantify the magic. For example, Chatbot Arena enables users to have blind chats with pairs of LLMs before asking to choose what is “better” – they define the term. Models that win the most matches float to the top of the scoreboard. This less rigid approach seems to be capturing some of the ineffective ‘magic’ that other ranking systems cannot do. However, they can also be played, with more complicated models higher with seductive human users. Others, who lend an argument to everyone with children at school age, questioned that can reveal any test over an AI model, above how good it is to pass the test. Simon Willey, an independent AI researcher in California, encourages users to watch the inquiries that existing AI systems fail to comply with before submitting it to their successors. In this way, user models can select models that go well with the tasks that matter to them, rather than make a high score that is not suitable for their needs. All of this assumes that AI models deliver the tests that they have their best shot. Sandbagging, in which models deliberately pass tests to hide their true capabilities (to prevent them from being removed), has been observed in a growing number of models. In a report published in May of researchers at Mats, an AI safety group, Top LLMS was able to identify when they are tested almost as well as the researchers themselves. It also complicates the search for reliable criteria. That said, the value for AI businesses of simple rankings that can top their products on top means that the race to build better benchmarks will continue. ARC-AGI 2 was released in March and still avoids today’s top systems. But, aware of how quickly it can change, on Arc-Agi 3 has already begun. Curious about the world? Sign in to Simply Science, our weekly subscriber newsletter, just to enjoy the scientific coverage of science. Catch all the business news, market news, news reports and latest news updates on Live Mint. Download the Mint News app to get daily market updates. More Topics #Kartic Intelligence Read the following story