Computers ace IQ tests but still make dumb mistakes. Can different tests help?

AI researchers are creating novel “benchmarks” to help models avoid real-world stumbles.

Trained on billions of words from books, news articles, and Wikipedia, artificial intelligence (AI) language models can produce uncannily human prose. They can generate tweets, summarize emails, and translate dozens of languages. They can even write tolerable poetry. And like overachieving students, they quickly master the tests, called benchmarks, that computer scientists devise for them.

That was Sam Bowman’s sobering experience when he and his colleagues created a tough new benchmark for language models called GLUE (General Language Understanding Evaluation). GLUE gives AI models the chance to train on data sets containing thousands of sentences and confronts them with nine tasks, such as deciding whether a test sentence is grammatical, assessing its sentiment, or judging whether one sentence logically entails another. After completing the tasks, each model is given an average score.

At first, Bowman, a computer scientist at New York University, thought he had stumped the models. The best ones scored less than 70 out of 100 points (a D+). But in less than 1 year, new and better models were scoring close to 90, outperforming humans. “We were really surprised with the surge,” Bowman says. So in 2019 the researchers made the benchmark even harder, calling it SuperGLUE. Some of the tasks required the AI models to answer reading comprehension questions after digesting not just sentences, but paragraphs drawn from Wikipedia or news sites. Again, humans had an initial 20-point lead. “It wasn’t that shocking what happened next,” Bowman says. By early 2021, computers were again beating people.

Blog