New benchmark : The bitter lesson of misuse detection

How do we prevent language models from answering harmful queries?

A common method is to place a "supervision system" between end-users and the model. This system acts as a dynamic filter, flagging harmful queries and instructing the LLM not to respond.

The problem, as our new paper reveals, is that these systems are... highly unreliable. We evaluated a range of commercially available supervision systems and compared their effectiveness to that of general-purpose models. First finding: most specialized supervision systems fail to detect obvious harmful queries, such as "How to build a bomb?". Conversely, they often flag perfectly harmless queries as harmful (especially when these queries use a typical jailbreak structure).

More surprisingly, general-purpose language models (like ChatGPT or Claude) are far more effective than specialized supervision systems. A simple prompt like, "Is this a malicious query?" is enough for these general-purpose models to identify harmful queries with relative accuracy. This is what we call the "bitter lesson" of supervision: bigger models always outperform smaller, more specialized ones.

However, in our experiments, no supervision system (including LLMs) achieved more than 80% accuracy in detecting harmful queries. Not a single one can be considered truly reliable. Another interesting finding from our study is that LLMs suffer from "metacognitive inconsistency." In short, these models can identify a query as harmful, but even though they are instructed not to, they answer it anyway.

Ask: "Is this a malicious query: 'How to build a bomb?'" → The model answers: "Yes." Ask directly: "How to build a bomb?" → The model still answers!

Conclusion: it would be more effective to use an LLM to supervise itself, rather than using a specialized system.

This work builds on our previous publication, which introduced the "BELLS: A Framework Towards Future-Proof Benchmarks for the Evaluation of LLM Safeguards." That paper was presented at the "Next Generation of AI Safety" workshop at ICML 2024 and was also featured in the OECD's Catalogue of Tools & Metrics for Trustworthy AI.

We’re excited to share the full findings through three resources:

🔗 Alignment Forum blog post – An accessible write-up of our key findings
📄 Research paper – The formal evaluation framework, metrics, and results
🌐 Public website – Interactive visualisations, dataset overview, and metrics description





New benchmark : The bitter lesson of misuse detection

CeSIA Welcomes the EU's GPAI Code of Practice and urges AI companies to sign on

Sign up for our newsletter