CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents

Source: Microsoft Security Blog

Author: Arjun Chakraborty

URL: https://www.microsoft.com/en-us/security/blog/2026/03/20/cti-realm-a-new-benchmark-for-end-to-end-detection-rule-generation-with-ai-agents/

CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents

ONE SENTENCE SUMMARY:

Microsoft’s CTI-REALM open-source benchmark evaluates AI agents’ end-to-end ability to turn threat reports into validated detections across environments.

MAIN POINTS:

CTI-REALM benchmarks real-world detection engineering, not memorization of threat-intelligence trivia.
Agents must read CTI reports, explore telemetry, iterate KQL, and generate Sigma rules.
Ground-truth scoring validates outputs across Linux endpoints, AKS, and Azure cloud environments.
Benchmark extends prior investigation-focused evals by targeting detection rule generation workflows.
Dataset includes 37 curated public CTI reports suitable for sandboxed telemetry simulation.
Checkpoint scoring measures intermediate steps like technique mapping and data-source identification.
Tooling mirrors analyst environments: CTI repositories, schema explorers, Kusto engine, ATT&CK, Sigma databases.
Business value comes from objective proof of AI impact on detection coverage and analyst productivity.
Results on CTI-REALM-50 show Claude leading; GPT-5 medium reasoning beats high reasoning.
Removing CTI-specific tools reduces performance notably, especially final detection rule quality.

TAKEAWAYS:

Effective security agents must operationalize CTI into detections, not just classify TTPs.
Intermediate workflow metrics reveal whether failures stem from comprehension, queries, or specificity.
Cloud detection tasks remain substantially harder than Linux and AKS scenarios.
Human-authored workflow guidance can meaningfully improve smaller models’ performance.
Open-sourcing enables shared benchmarking, safer adoption decisions, and community-driven improvements.