We're excited to announce the release of "Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods"!
This paper is intended as chapter 5 in a comprehensive collection of literature reviews and distillations forming the AI Safety Atlas - a central reference for anyone who wants to understand evaluations and how they fit into the broader AI safety picture.
As frontier AI systems advance toward transformative capabilities, reliable safety assessment becomes necessary for responsible development and informed governance. This literature review provides a comprehensive taxonomy of AI safety evaluations addressing three key dimensions:
We clarify important distinctions between often-confused concepts like deception, scheming, and hallucinations while giving overview of many other safety-critical capabilities like cybersecurity exploitation and autonomous replication. The paper also explains concerning propensities like power-seeking, and control evaluations that assess whether safety measures remain effective when AI systems actively attempt to circumvent them.
Example diagram from the paper distinguishing honesty, truthfulness, hallucination, deception, and scheming. If a model is faithfully outputting its internal "beliefs" then it is honest, if these beliefs correspond to reality, then it is factual, else it is hallucinating. A model might just say things that help it get high reward as a myopic strategy with no ulterior long term motive, this can be akin to sycophancy. A model which is deceptive but with situationally aware long term plans is scheming (deceptively aligned).
We go through some practical evaluation design principles, examining how affordances (resources available during testing), scaling approaches (including automation and model-written evaluations), and integration methods (through training, security, and governance audits) can be combined to create robust safety frameworks.
We also bring up some limitations that safety evaluations might face including things like - model "sandbagging" (strategic underperformance on tests), organizational "safetywashing" (misrepresenting capability improvements as safety advancements), or the more fundamental inherent challenges of proving absence rather than presence of capabilities.
We would like to express our gratitude to Maxime Riché, Martin, Fabien Roger, Jeanne Salle, Camille Berger, and Leo Karoubi for their valuable feedback, discussions, and contributions to this work.
The website version and pdf can be accessed here.