Constitutional Classifiers: Stop AI Jailbreaks Cold
Discover how Anthropic’s next‑gen Constitutional Classifiers++ slash jailbreak risks while keeping Claude fast, safe, and highly useful
Jan 10, 2026 - Written by Lorenzo Pellegrini
Lorenzo Pellegrini
Jan 10, 2026
Anthropic Publishes Research on Next-Generation Constitutional Classifiers: What It Means for Safer AI
Anthropic has released new research on improved Constitutional Classifiers, an AI safety technique designed to protect large language models like Claude from “jailbreak” attacks while keeping them useful and efficient. These next-generation systems, often referred to as Constitutional Classifiers++, significantly strengthen defenses against high-risk misuse, reduce unnecessary refusals of harmless content, and cut the compute cost of safety safeguards to production-ready levels.
What Are Constitutional Classifiers?
Constitutional Classifiers are AI safety components that sit around a language model and monitor its inputs and outputs. Their role is to detect and block content that may be harmful or policy-violating, such as detailed instructions for chemical weapons, cyberattacks, or other dangerous activities. Instead of relying solely on manual rules, they are trained from a written “constitution” of natural language principles that spell out what is allowed and what is not.
This constitution might include ideas such as: help with general education, avoid facilitating violent harm, avoid detailed guidance on building weapons, respect privacy, and follow platform-specific safety policies. The classifier uses these rules to generate synthetic training examples, which are then used to train the safety model to recognize risky versus acceptable behavior at scale.
Why Jailbreaks Are Such a Critical Problem
Jailbreaking refers to techniques that persuade or trick an AI model into ignoring its safety policies and providing harmful or restricted content. Attackers use indirect prompts, role-play, obfuscation, or multi-step strategies to bypass safeguards. As models become more capable, the potential harm from successful jailbreaks grows, especially in areas like chemical, biological, radiological, and nuclear (CBRN) risks, cybercrime, or targeted harassment.
A particularly worrying scenario is the “universal jailbreak”, a single strategy that reliably bypasses defenses across many different dangerous prompts. Defending against such universal jailbreaks requires robust, general safeguards, not just ad hoc filters for individual prompts.
Anthropic’s First-Generation Constitutional Classifiers
In earlier research, Anthropic introduced the first generation of Constitutional Classifiers as a practical method to defend large models from jailbreaks. These systems:
- Used a written safety constitution to generate synthetic training data.
- Employed separate classifiers for inputs and outputs to catch harmful queries and responses.
- Showed strong robustness under extensive human red teaming, particularly for high-risk scientific domains.
- Substantially reduced successful jailbreak attempts compared with unguarded models.
However, this initial setup had two main limitations. First, it introduced significant compute overhead, which made deployment expensive at large scale. Second, it tended to “over-refuse”, meaning it blocked or declined too many benign, harmless requests. This created friction for everyday users who were not attempting any misuse but still encountered safety refusals.
Next-Generation Constitutional Classifiers++: What’s New
Anthropic’s new research presents next-generation Constitutional Classifiers, sometimes called Constitutional Classifiers++, which directly address the earlier trade-offs between robustness, cost, and user experience. The new system is designed to be more accurate, more efficient, and easier to run in production.
1. A Unified “Exchange” Classifier
Instead of separate input and output classifiers, the improved design uses a single exchange classifier that sees the full context of a conversation. The classifier considers both what the user asked and how the model responded, which makes it better at judging whether a specific answer is harmful, nuanced but safe, or entirely benign.
Human red teaming shows that this exchange classifier alone can cut successful jailbreak attempts by more than half compared with previous-generation setups, significantly raising the difficulty of bypassing safeguards.
2. Activation-Based Probe Classifiers
A major innovation in the new work is the use of internal probe classifiers, also called linear probes. These probes operate on the internal activations of the language model, reusing computations that the model has already performed instead of running a large external classifier for every message.
By learning to read risk-relevant signals from the model’s internal representations, the probes can quickly flag obviously dangerous traffic at a very low cost. This means heavy-duty classification is reserved only for the riskiest cases, while most normal interactions are checked with almost no extra compute.
3. Cascaded Two-Stage Safety Architecture
The full production-grade system uses a cascade:
- Stage 1: A lightweight probe classifier screens all exchanges using model activations. It quickly decides whether an interaction appears safe or potentially risky.
- Stage 2: Only if Stage 1 flags an interaction does a more powerful exchange classifier, or an ensemble of classifiers, perform deeper analysis using the full text context.
This cascade combines broad coverage with efficiency. The vast majority of harmless user prompts never trigger the expensive second stage, so the total overhead remains very low, while high-risk attempts are thoroughly scrutinized.
4. Dramatically Reduced Compute Overhead
One of the headline results of Anthropic’s new research is that Constitutional Classifiers++ add only around 1 percent additional compute cost compared with running an unguarded model. Earlier versions had much higher overhead, which limited their practicality for always-on production deployments.
By leveraging internal activations, narrowing when heavy classifiers are invoked, and optimizing the cascade, Anthropic demonstrates that strong safety safeguards can be deployed at scale without prohibitive costs.
5. Lower Over-Refusal of Benign Content
The new system substantially reduces over-refusals. Anthropic reports an approximate 87 percent reduction in unnecessary refusals of harmless requests compared with earlier classifier setups. This indicates that the new classifiers are better at distinguishing genuine risk from benign or borderline content that should be allowed.
For users, this means fewer frustrating “I cannot help with that” responses when asking for normal educational, professional, or creative assistance, while still maintaining strong defenses against clearly harmful requests.
Results from Red Teaming and Evaluations
Anthropic rigorously evaluates its safety systems using a mix of automated tests and human red teaming. Red teaming involves experts and external participants who attempt to bypass defenses using creative and adversarial prompts.
Across thousands of hours of cumulative red teaming, earlier Constitutional Classifiers already showed strong robustness to universal jailbreaks, although trade-offs in cost and refusals remained. With Constitutional Classifiers++, the company reports:
- Meaningful reductions in successful jailbreak attempts, especially in exchange-level evaluations.
- No universal jailbreak discovered against the latest defenses after substantial red-teaming effort, including large bug bounty–style programs.
- Improved alignment with safety policies, especially on high-risk scientific and security topics.
Although no system can be perfectly jailbreak-proof, these results indicate that the new architecture raises the bar significantly, making systematic misuse far more difficult.
How Constitutional Classifiers Work in Practice
In deployment, Constitutional Classifiers typically operate as a guard layer around a base model like Claude. A simplified flow looks like this:
- The user sends a prompt to the system.
- The base model generates a draft response, with internal activations computed as usual.
- The probe classifier quickly inspects the activations, and possibly the exchange text, to assess risk.
- If the interaction seems safe, the response is returned with minimal delay.
- If it seems risky, the heavier exchange classifier or ensemble analyzes the full prompt and response.
- If harmful content is detected, the system can refuse, partially redact, or reformulate the answer in a safer way, for example by giving high-level advice while omitting dangerous details.
This design makes safety checks nearly invisible during normal use but strongly active when conversations approach policy boundaries.
Implications for AI Safety and Industry Standards
Anthropic’s work on Constitutional Classifiers and their improved, efficient variants has broader implications beyond a single product line. The research showcases a concrete, scalable pattern for model-agnostic safety layers that other organizations can adapt.
Key implications include:
- Scalable governance tooling: A written safety constitution can be updated as policies, regulations, or societal norms evolve, without retraining the entire base model from scratch.
- Interpretability-informed safety: Activation probes bridge interpretability research and applied safety, using signals from within the model to guide classification.
- Better trade-offs: Demonstrating strong robustness with low overhead and fewer unnecessary refusals helps reconcile safety with usability and commercial viability.
- Regulatory relevance: As policymakers explore requirements for robust safeguards around frontier models, classifier-based architectures like this provide an example of concrete technical controls that can be evaluated and audited.
Limitations and Open Research Questions
Despite the progress, Anthropic emphasizes that Constitutional Classifiers are not a complete solution to AI safety challenges. Some of the key limitations and open questions include:
- Non-universal jailbreaks: Even if no single universal jailbreak exists, there may still be many specialized jailbreaks that work for particular topics or edge cases.
- Obfuscation and novel attack vectors: Attackers might invent new prompt styles, use code, or exploit multimodal inputs in ways that current classifiers do not fully anticipate.
- Distribution shift: As models are applied in new domains, the classifier may encounter content far from its training distribution, which can reduce its accuracy.
- Over-reliance on external guards: Classifiers are an outer layer of defense. Future work needs to combine them with deeper model-level alignment so that the base model itself becomes intrinsically more resistant to misuse.
Anthropic’s researchers outline several promising directions, such as integrating classifier signals directly into the generation process, automating red teaming to continuously discover new attack patterns, and training models that are inherently better at resisting obfuscated or adversarial prompts.
What This Means for Everyday Users of Claude
For everyday users, the technical details of Constitutional Classifiers mostly stay behind the scenes. The practical impacts appear as:
- Fewer unsafe answers: It becomes more difficult to coax the model into providing dangerous or clearly policy-violating content.
- Fewer false refusals: Users asking for normal educational, professional, or creative help should see fewer unnecessary safety blocks.
- More consistent behavior: The model’s responses align more predictably with published safety policies, even when prompts are complex or ambiguous.
In short, the improvements aim to make Claude safer and more trustworthy while keeping it broadly helpful, rather than overcautious.
How Constitutional Classifiers Fit into Anthropic’s Broader Alignment Agenda
Anthropic’s research portfolio combines several strands of work: scalable oversight, interpretability, red teaming, and alignment techniques. Constitutional Classifiers sit at the intersection of these themes, turning high-level ethical and safety principles into concrete, testable, and updateable systems around large models.
As Anthropic continues to study issues like alignment faking, deceptive behavior, and long-term risk, robust external safeguards like Constitutional Classifiers++ provide a practical line of defense that can evolve in parallel with core model training methods.
Conclusion
Anthropic’s latest research on next-generation Constitutional Classifiers marks a significant step in making powerful language models safer, more reliable, and more practical to deploy at scale. By combining an exchange-level view of conversations, activation-based probes, and cascaded classifiers, the new system substantially raises the bar against jailbreaks while sharply reducing over-refusals and compute overhead. Although open challenges remain, this work offers a concrete blueprint for production-grade AI safeguards and highlights how explicit constitutions, interpretability insights, and rigorous red teaming can be woven together to keep advanced AI systems aligned with human values and safety requirements.
