AI SecurityJun 9, 2026

XBOW tests Anthropic's Mythos Preview for offensive security

XBOW evaluates Anthropic's Mythos Preview AI model for offensive security capabilities.

Summary

XBOW has tested Anthropic's Mythos Preview AI model, finding it to be a significant advancement for offensive security, particularly in source code analysis for vulnerability discovery. While impressive in identifying potential weaknesses and providing technically precise analysis, its effectiveness diminishes in live-site penetration testing where human expertise remains crucial for validation and exploitability.

Full text

XBOW tests Anthropic's Mythos Preview for offensive security Sponsored by XBOW June 9, 2026 12:16 PM 0 We received early access to Mythos Preview for early capability testing a few weeks back. Below are the details on how we tested Mythos Preview, what we found, and what it means. About three months ago, Anthropic invited us to help them assess the capability of a new model they thought represented a significant shift in capability. So we put it through our security gauntlet. Benchmarks, workflows, interactive use, and integrations. Today, we can finally share details on how we tested Mythos Preview, what we found, and what it means. Spoilers: This model is a major advance. It is substantially better than prior models at finding vulnerability candidates, especially when source code is available. It communicates with unusual technical precision, reasons well about code, and shows strong promise in complex domains such as native-code analysis and reverse engineering. Our takeaway: Mythos Preview is a powerful tool for generating strong vulnerability leads and technically precise analysis. It is especially adept at analyzing source code with a security mindset. It's not magic, though: a model is a brain without a body. While source code audits are mostly a brain activity, live site pentests like the ones XBOW performs very much need a body whose skill and control can match the brain's power. Testing methodology The first thing we did was assemble a diverse team of 10 experts from different parts of the company that could assess the model from different directions. We test all models with the same internal benchmarking system we have used to analyze Opus 4.7 and GPT 5.5. In this system, we take open source applications where vulnerabilities were previously discovered, freeze them at the vulnerable version, and run our agents against them. But this time, we expanded our testing to analyze other angles as well: The model’s judgment with regard to threat modeling, vulnerability validation, and safety The model’s ability to read source code versus interact with live systems Its ability to find exploits we’re not yet looking for in our standard assessments, e.g., native app vulnerabilities A note on terminology: When people say “Mythos,” they sometimes refer to the raw model. In this evaluation, we explored Mythos Preview both inside Claude Code, and as a raw model, using it via its API as an engine for XBOW’s agents. We separate those cases because orchestration, tools, prompting, and live-site access materially affect outcomes. Results Our testers who tried out Mythos Preview in interactive use were quite impressed. “This is a lot closer to `just go and find something` than anything I’ve seen so far,” said one of them. We tried giving it our own source code, and it found weaknesses – nothing truly terrible, thankfully, but there were several items we wanted to repair. We tried it on open source software, and at the end of week one, we had quite a few new vulnerabilities we had to disclose. Our testers who tried out Mythos Preview on benchmarks were also quite impressed, but their appreciation was a slightly different kind: impressed _with data_. Their results also laid bare the difference between areas where the model was runaway powerful, and where it presented only a modest advance. See How XBOW Validates Which Vulnerabilities Are Actually Exploitable Finding a vulnerability isn't the same as proving it's exploitable. See how XBOW orchestrates frontier models with live-site validation to prove which findings are real, with working exploit evidence. Request a Demo Mythos Preview Benchmark Performance Our key takeaways after analyzing Mythos Preview include: It’s extremely powerful for source code audits. It’s good, but less powerful, at validating exploits. Its judgment is mixed. It can be too literal and conservative, and also tends to overstate the practical relevance of its findings. It’s strong in native-code vulnerability discovery and reverse engineering. Next-level vulnerability discovery Mythos Preview presents a significant step up over all existing models, regardless of provider, on XBOW’s web exploit benchmark. This benchmark is designed to test whether a model can help XBOW find validated, actionable vulnerabilities in live website environments. A case is counted as passed only when the system finds a validated way to act on the vulnerability (PoC||GTFO) after a series of 80 “actions,” where an action might be a shell or a Python script using standard commands or XBOW’s suite of attack tools. Note: We haven't included Opus 4.7 in this chart because that model interacts with our system in a unique way, making this particular stat less relevant for it – we’ve written up the full story here. Compared to the newest model at the time (Opus 4.6), this was a strong increase: The number of false negatives was cut by 42%. In a variation where we gave both models the site’s source code, it was even cut by 55%. This was the first instance of a theme that would surface again and again: Mythos Preview is impressive at writing code, but even more impressive at reading it. Below are the pass rates of Mythos Preview, Opus 4.6, and GPT 5.5 as a function of the allowed number of actions (executed scripts). Mythos Preview finds vulnerabilities in significantly fewer iterations than Opus 4.6, although the difference to GPT-5.5 is less pronounced. It becomes more clear when adding two considerations: Models could choose many small steps or few large steps (more details here) – and that shouldn’t matter so much. Instead of giving a budget of actions, let’s consider a budget of output tokens. Instead of mean pass rate, i.e., the probability of finding a vulnerability, it’s often more instructive to look at the odds for discovery, i.e., what ratio you would bet on the model getting a discovery right. Computationally, that's the hit rate divided by the miss rate. Under these considerations, the picture becomes much more clear: Token-for-token, Mythos Preview hones in on the vulnerability with absolutely unprecedented precision. Live-site validation is the hard part Mythos Preview is excellent at source-code reasoning, but our evaluation reinforced a practical truth: many exploitable issues do not appear as obvious defects in application source code. They emerge from configuration, dependencies, deployment choices, or the way otherwise safe components are combined. For instance, a dependency on its own could be safe. The source code on its own could be safe. But the source code uses the dependency in an unsafe way and creates a vulnerability. As Gary McCraw famously declared, you won’t find the majority of defects by “staring at code” alone. That’s of particular interest to us. XBOW performs pentests, where our target is a live site (the way an attacker sees it), whereas Mythos Preview as used, for example, by Project Glasswing excels at auditing source code (the way a developer sees it). Interacting with the live site can be very powerful, but it brings a completely new, very delicate dimension into the mix. Does Mythos Preview change the balance here? Due to the way we harvest our web benchmarks set, you can actually find the vulnerability from the code alone on that set. So it’s fair to ask: For these benchmarks, can Mythos Preview find an exploit without being allowed to interact with the live site? It turns out that even for these benchmarks, where the vulnerability is purely in the code, removing access to the live site hurts performance more than removing access to source code. In many ways, live-site access matters more than source-code access. That, of course, is the XBOW value proposition: it gives frontier models a safe, structured way to interact with real application behavior and prove which findings are actually exploitable. The results of XBOW powered by Mythos Preview are shown below. We now have a solid answer to the question, “Can a model find someth

Entities

Mythos Preview (product)Anthropic (vendor)Claude Code (product)Opus 4.7 (product)GPT 5.5 (product)