CIO Influence
CIO Influence News Machine Learning Security

Simbian Publishes World’s First Cyber Defense Benchmark; Finds Frontier LLMs Alone Do Poor Job at Attack Discovery

Simbian Publishes World’s First Cyber Defense Benchmark; Finds Frontier LLMs Alone Do Poor Job at Attack Discovery

Simbian Logo

LLMs Find and Exploit Vulnerabilities but Fail at Defense Out-of-the-Box without a Sophisticated Harness

Simbian®, the self-improving SecOps company, announced formation of the Simbian Research Lab and released the Simbian Cyber Defense Benchmark to test large language models (LLMs) on detecting MITRE ATT&CK chains in complex realistic scenarios.

Also Read: CIO Influence Interview with Gihan Munasinghe, CTO of One Identity

The Simbian Research Lab released the Simbian Cyber Defense Benchmark to test large language models on detecting MITRE ATT&CK chains in complex realistic scenarios.

Frontier models are good at finding and exploiting software vulnerabilities. However, when it comes to cyber defense, none of the tested models earned a passing score. Anthropic Claude Opus 4.6, the best of the group of 11 models tested, detected an average of 46% of attack evidence per MITRE tactic. Every model effectively missed entire attack categories. For a complete summary of results see Simbian’s blog post published today. The full research is also available on arXiv.

Simbian Research Lab developed this Cyber Defense Benchmark to represent realistic but advanced attacks – something that when solved by LLMs would represent a fundamental pivot point in Security Operations. Creating an advanced benchmark for any task is often considered the first step towards enabling LLMs to solve that task.

Existing cyber benchmarks to date ask models to answer questions about attacks. The Cyber Defense Benchmark is the first to use real attack telemetry in an agentic investigation format. Models from Anthropic, OpenAI, Google, as well as leading open weight models by Alibaba, Minimax, DeepSeek, and Moonshot AI were tested operating a simple ReAct loop and were asked to find the attacker and its tactics. Anthropic Opus 4.6 found 3x more flags than Google Gemini 3 Flash, but at roughly 100x the cost.

“Our research shows you can’t throw an LLM dart in the dark and expect to hit the cyber defense bullseye,” said Ambuj Kumar, Founder and CEO of Simbian. “The same frontier models that perform strongly during cyberattacks struggle on the defense side. Defense is fundamentally harder than offense as it requires reasoning across noisy, partial evidence rather than executing against a known target. The LLMs must be accompanied by outside intelligence in the form of a sophisticated harness. Simbian has been able to get 95% accuracy in production enterprise environments on cyber defense SecOps following some of these techniques.”

“We know the large models can do amazing things, but can we measure their efficacy in analyzing machine logs for security events?” said Richard Stiennon, Chief Research Analyst, cybersecurity industry analyst firm IT-Harvest. “This benchmark answers that question. In contrast to existing AI security benchmarks, this benchmark was designed to be difficult to game. It uses real telemetry rather than curated questions, mutates context to prevent memorization, enforces deterministic scoring against ground truth, and tracks detection cost alongside accuracy.”

Catch more CIO Insights: CIO as Orchestrator of Cross-Functional Digital Strategy

[To share your insights with us, please write to psen@itechseries.com ]

Related posts

BearingPoint and IonQ Bring Quantum Consulting to the European Market

Business Wire

Hotmail Co-Founder Sabeer Bhatia joins Cyber Dive as an Advisor

Cision PRWeb

Ribbon Enables New Cloud Connect for Webex Calling

CIO Influence News Desk