Here is the plain text result:
If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass.
For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science and logic. Comparing the models’ scores over time served as a rough measure of A.I. progress.
But A.I. systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams.
Those tests aren’t in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure?
This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called “Humanity’s Last Exam,” that they claim is the hardest test ever administered to A.I. systems.
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known A.I. safety researcher and director of the Center for AI Safety. (The test’s original name, “Humanity’s Last Stand,” was discarded for being overly dramatic.)
The questions on Humanity’s Last Exam went through a two-step filtering process. First, submitted questions were given to leading A.I. models to solve.
If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers.
There are other tests trying to measure advanced A.I. capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by the A.I. researcher François Chollet.
But Humanity’s Last Exam is aimed at determining how good A.I. systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score.
Once the list of questions had been compiled, the researchers gave Humanity’s Last Exam to six leading A.I. models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.
Mr. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50 percent by the end of the year. At that point, he said, A.I. systems might be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure A.I.’s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science.
You can imagine a better version of this where we can give questions that we don’t know the answers to yet, and we’re able to verify if the model is able to help solve it for us,” said Summer Yue, Scale AI’s director of research and an organizer of the exam.
Part of what’s so confusing about A.I. progress these days is how jagged it is. We have A.I. models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges.
But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast A.I. is improving, depending on whether you’re looking at the best or the worst outputs.
That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for A.I. systems. I still believe that. But I also believe that we need more creative methods of tracking A.I. progress that don’t rely on standardized tests, because most of what humans do — and what we fear A.I. will do better than us — can’t be captured on a written exam.
Mr. Zhou, the theoretical particle physics researcher who submitted questions to Humanity’s Last Exam, told me that while A.I. models were often impressive at answering complex questions, he didn’t consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers.
“There’s a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an A.I. that can answer these questions might not be ready to help in research, which is inherently less structured.”
Source link