The announcement by Anthropic in April about its Claude Mythos model certainly created a stir, but now that the dust is settling, NCC Group has observed notable patterns in how organizations are preparing for application of the latest frontier LLMs to vulnerability management in the enterprise.
With security leaders hungry for insight and opinions, and some of the hype dying down around Mythos and Glasswing (we’ve seen good hands-on analyses emerge from Davi Ottenheimer and Cloudflare for example), we’ve observed organizations broadly falling into two camps:
- “First movers” embracing AI-enhanced vulnerability identification, paralleling what was described in the Mythos announcement (and has now been superseded by Anthropic’ s Cyber Verification Program, CVP, OpenAI’s Trusted Access for Cyber, TAC, and other programs), focused on supplanting human expertise, driving efficiency improvements, and flattening or slashing pen testing budgets
- “Second movers” taking a more hands-on-tools, data-driven approach that emphasizes human expertise in the right places, identifies the best leverage points for efficiency, and are considering investing above-budget in the short term to build more sustainable, future-resilient programs
We’re working with security leaders in both camps who, while acknowledging the reality of headline-driven, top-down “adopt AI now!” directives, are mainly leaning towards a “wait and see” approach given the ongoing pace of change in the underlying technology. Here’s what we’ve learned, and how we can apply it for your organization:
- While highly compelling at first glance, using AI effectively and evaluating AI output turns out to be much more nuanced when triaged by true experts. This process actually creates a greater need for human triage initially, as even agentic AI workflows are not fully competent yet at intuiting human expectations for outcomes. Errors, hallucinations, drift, and other issues with AI output remain a concern. Effective AI-driven vulnerability discovery incorporates human expert oversight of testing, validation of output, and risk prioritization/remediation processes. For example, we see huge volumes of what initially appear to be CVSS 9+ issues that ultimately get deprioritized due to lack of real-world reachability, exploitability, compensating control effectiveness, or even outright hallucination. Culling false positives spares significant downstream remediation effort.
- Future resiliency patterns become more observable in these engagements, which is critical given the likely continued increase in the power and capability of frontier/open weight models and skills/plugins wrapped around them e.g. Claude Code, CoWork, etc. Consider: if we found 3 Critical findings using Mythos, how many more would we find using the next release, Opus 4.7? Reframing expectations around this new reality is as important as the output itself.
- Cost including token usage, which while low in the published examples around handfuls of vulnerabilities, starts to become a consideration at enterprise scale of tens of thousands or millions of issues, especially when integrated into end-to-end vulnerability discovery processes.
We’re not surprised by these results, having published our first research on adversarial ML way back in 2017, and have been working on advancing AI security ever since. Here’s how we’ve applied these learnings for Clients via an AI-augmented consulting engagement that we’ve now performed multiple times:
- Scaffolding Tuning. As AISLE commented, “the moat is the system into which deep security expertise is built, not the model itself.” This means that the scaffolding built around powerful frontier and open-weighted LLMs is what truly extracts value. NCC Group will evaluate your current state and quickly help you develop (from scratch if needed) highly-evolved scaffolding around inputs to and outputs from your chosen LLM. We’ll cover key areas like prompting, skills/plugins, data pipelining, scripting, feeding to/from external tools, focused on pre/post-processing and presentation. We can also quickly deploy our own AI-augmented solutions including network pen testing and code review where more “from scratch” deployment is required. You can expect dramatically higher signal-to-noise ratio in the tuned findings output.
- Prioritization Refactoring. Even with tuned outputs, the increased volume of findings produced by AI-augmented assessments can be overwhelming (we saw one vendor claim to have discovered 45 million latent vulnerabilities on an enterprise network!). This glut of discovery is driving many organizations to adopt a “radical prioritization” approach that focuses on reachability, exploitability, compensating controls, and other factors to drastically filter findings to what is truly worth investing time and money to remediate.
- Remediation Reinforcements. All the focus on AI-driven vulnerability discovery misses a critical issue: remediation likely costs several times more than discovery! Additionally, AI automation fails to fully solve the problem due to (best practice) human-in-the-loop change management processes. Once again, in the short term, human expertise will be required to walk the complex path from opened tickets to in-production fixes, which inevitably requires deft navigation of not just technology, but also people and processes. NCC Group can deploy dozens of consultants on short notice to uplift organizations to AI-speed risk management.
Reach out to us today if NCC Group can help your organization navigate the myths and reality of AI cyber security, and elevate your vulnerability management with AI before the threat actors do.