AI SecurityMay 12, 2026

Defense at AI speed: Microsoft’s new multi-model agentic security system tops leading industry benchmark

Microsoft announces MDASH, an AI agentic system that discovered 16 new Windows vulnerabilities including four Critical

Summary

Microsoft's Autonomous Code Security team has developed MDASH, a multi-model agentic scanning harness that orchestrates over 100 specialized AI agents to discover and validate exploitable vulnerabilities. The system identified 16 new vulnerabilities in Windows networking and authentication components, including four Critical remote code execution flaws in the Windows kernel TCP/IP stack and IKEv2 service. MDASH achieved 88.45% on the public CyberGym benchmark with zero false positives on internal testing, demonstrating production-grade AI-powered vulnerability discovery at enterprise scale.

Full text

Share Link copied to clipboard! Today Microsoft announced a major step forward in AI-powered cyber defense: our new agentic security system helped researchers find 16 new vulnerabilities across the Windows networking and authentication stack—including four Critical remote code execution flaws in components such as the Windows kernel TCP/IP stack and the IKEv2 service. They used the new Microsoft Security multi-model agentic scanning harness (codename MDASH) which was built by Microsoft’s Autonomous Code Security team. Unlike single-model approaches, the harness orchestrates more than 100 specialized AI agents across an ensemble of frontier and distilled models to discover, debate, and prove exploitable bugs end-to-end. Learn more and sign up to join the preview The results speak for themselves: 21 of 21 planted vulnerabilities found with zero false positives on a private test driver; 96% recall against five years of confirmed Microsoft Security Response Center (MSRC) cases in clfs.sys and 100% in tcpip.sys; and an industry-leading 88.45% score on the public CyberGym benchmark of 1,507 real-world vulnerabilities—the top score on the leaderboard, roughly five points ahead of the next entry. The strategic implication is clear: AI vulnerability discovery has crossed from research curiosity into production-grade defense at enterprise scale, and the durable advantage lies in the agentic system around the model rather than any single model itself. Codename MDASH is being used by Microsoft security engineering teams and tested by a small set of customers as part of a limited private preview. This post explains how codename MDASH works, what we shipped today, what we learned along the way, and how you can sign up for the private preview. AI-powered vulnerability discovery at hyper-scale The Microsoft Autonomous Code Security (ACS) team was assembled to take AI-powered vulnerability research from a research curiosity to production engineering at enterprise scale. Several members of this team came to Microsoft from Team Atlanta, the team that won the $29.5 million DARPA AI Cyber Challenge by building an autonomous cyber-reasoning system that found and patched real bugs in complex open-source projects. The lessons from that work, especially the level of engineering required to make the frontier language models perform professional-level security auditing, are what our new multi-model agentic scanning harness (codename MDASH) is built around. Microsoft’s code base is challenging for security auditing for a few reasons: Massive proprietary surface. Windows, Hyper-V, Azure, and the device-driver and service ecosystems around them are private Microsoft codebases—not part of any commodity language model’s training corpus, and genuinely hard to reason about: kernel calling conventions, IRP and lock invariants, IPC trust boundaries, and component-internal idioms do not yield to pattern matching. On this surface, a model has to actually reason. DevSecOps at scale. Every finding has a real owner, a triage process, and a Patch Tuesday to land on. There is no quiet drawer for speculative findings; if a tool produces noise, the noise is everyone’s problem. High-value targets. Windows, Hyper-V, Xbox, and Azure serve billions of users. The payoff for finding a single hard bug is unusually high—and so is the cost of a false positive in a tier-one component. The findings in this post are the result of close collaboration between ACS, Microsoft Offensive Research & Security Engineering (MORSE), and Microsoft Windows Attack Research and Protection (WARP). WARP and MORSE own the deep, hard end of Windows offensive research; ACS brings the AI-powered discovery and validation pipeline. Together, the teams have collaborated to build a mature harness. Codename: MDASH—Microsoft Security’s new multi-model agentic scanning harness Codename MDASH is, at its core, an agentic vulnerability discovery and remediation system. The model is one input. The system is the product. A useful mental model is to think of it as a structured pipeline that takes a code base and emits validated, proven findings: Prepare stage: Ingests the source target, builds language-aware indices, and then draws the attack surface and threat models by analyzing the past commits. Scan stage: Runs specialized auditor agents over candidate code paths, emitting candidate findings with hypotheses and evidence. Validate stage: Runs a second cohort of agents—debaters—that argue for and against each finding’s reachability and exploitability. Dedup stage: Collapses semantically equivalent findings (for example, patch-based grouping). Prove stage: Constructs and executes triggering inputs where the bug class admits it. The prove stage validates the pre-condition dynamically and formulates the bug-triggering inputs to prove existence of vulnerability (for example, ASan in C/C++). Three properties make this work in practice: An ensemble of diverse models that are effectively managed by codename MDASH. No single model is best at every stage. The multi-model agentic scanning harness runs a configurable panel of models. That includes SOTA models as the heavy reasoner, distilled models as a cost-effective debater for high-volume passes, and a second separate SOTA model as an independent counterpoint. Disagreement between models is itself a signal: when an auditor flags something as suspect and the debater can’t refute it, that finding’s posterior credibility goes up. Specialized agents. An auditor does not reason like a debater, which does not reason like a prover. Each pipeline stage has its own role, prompt regime, tools, and stop criteria. We don’t expect one prompt to do everything; we don’t expect one agent to recognize, validate, and exploit a bug in a single pass. Codename MDASH has more than 100 specialized agents, constructed through deep research with past common vulnerabilities and exposures (CVEs) and their patches, working independently to discover the bugs, and their auditing results will be ensembled as a single report. End-to-end pipeline with extensible plugins. The pipeline is opinionated, but it is not closed. Plugins let domain experts inject context the foundation models can’t see on their own—kernel calling conventions, IRP rules, lock invariants, IPC trust boundaries, codec state machines. The CLFS proving plugin we describe below is one such example: a domain plugin that knows how to construct a triggering log file given a candidate finding. For example, the Windows team extended reasoning with custom code analysis database, or CodeQL database can be also leveraged. The payoff for this architecture is portability across model generations. The pipeline’s targeting, validation, dedup, and prove stages are model agnostic by construction, which allows the harness to get the best of what any model has to offer. When a new model lands, A/B testing it against the current panel is one configuration flip. When a model improves, the customer’s prior investment—scope files, plugins, configurations, calibrations—all carry over, allowing customers to ride the frontier of security value. Using codename MDASH for security research To evaluate bug-finding capabilities of the multi-model agentic scanning harness you need to first ground on code that has never been seen by a model. This eliminates the possibility that a model “learned the answers to the test.” We scanned StorageDrive, a sample device driver used in Microsoft interviews for offensive security researchers. The driver contains 21 deliberately injected vulnerabilities, including kernel use-after-frees (UAFs), integer handling issues, IOCTL validation gaps, and locking errors. Because StorageDrive is a private codebase that has never been published, we can safely assume it was not included in the training data of modern language models. We ran the harness on StorageDrive using its default configuration. The results were striking: all 21 ground-truth vulnerabilities

Entities

Microsoft (vendor)Windows (product)MDASH (Microsoft Security multi-model agentic scanning harness) (product)Agentic AI systems (technology)Hyper-V (product)Azure (product)