Security

Microsoft Unveils Tool to Detect Hidden Backdoors in Open AI Models

Microsoft has developed a lightweight security scanner designed to detect hidden backdoors in open-weight large language models (LLMs), marking a significant step toward strengthening trust in AI systems. The tool, built by Microsoft’s AI Security team, can identify malicious tampering without requiring prior knowledge of how the backdoor was implanted or retraining the model.

Open-weight LLMs are increasingly popular but vulnerable to manipulation. Attackers can poison a model during training by embedding “sleeper agent” behaviors into its weights. These backdoors remain dormant under normal use and activate only when specific trigger inputs are encountered, making them difficult to detect with traditional testing.

Microsoft’s scanner relies on three observable signals that reliably indicate model poisoning while keeping false positives low. First, when exposed to trigger phrases, backdoored models show a distinctive “double-triangle” attention pattern, sharply focusing on the trigger and producing unusually deterministic outputs. Second, poisoned models tend to memorize the malicious training data, which can be extracted using memory-leak techniques. Third, even if a trigger is slightly altered, the backdoor can still activate through approximate or “fuzzy” variations.

The scanning process extracts memorized content from a model, analyzes it to isolate suspicious substrings, and scores them using loss functions aligned with the three indicators. The result is a ranked list of potential trigger candidates, enabling security teams to flag compromised models at scale. Importantly, the approach works across common GPT-style architectures and does not require additional training.

However, Microsoft acknowledges limitations. The scanner requires access to model weights, making it unsuitable for proprietary or closed models. It also works best for trigger-based backdoors that generate deterministic responses and cannot detect every form of malicious behavior.

This development aligns with Microsoft’s broader effort to expand its Secure Development Lifecycle to address AI-specific risks such as prompt injection, data poisoning, and unsafe model updates. As AI systems blur traditional security boundaries, Microsoft says collaborative research and shared defenses will be essential to securing the next generation of AI.