AI safety tests are heavily flawed, new study finds — here’s why that could be a huge problem

A new study into the testing procedure behind common AI models has reached some worrying conclusions.

The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models.

According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model’s abilities or AI progress.

“Benchmarks underpin nearly all claims about advances in AI,” said Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are ge

See Full Page