Source: Microsoft Security Blog
Author: Arjun Chakraborty
URL: https://www.microsoft.com/en-us/security/blog/2026/03/20/cti-realm-a-new-benchmark-for-end-to-end-detection-rule-generation-with-ai-agents/
CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents
ONE SENTENCE SUMMARY:
Microsoft’s CTI-REALM open-source benchmark evaluates AI agents’ end-to-end ability to turn threat reports into validated detections across environments.
MAIN POINTS:
- CTI-REALM benchmarks real-world detection engineering, not memorization of threat-intelligence trivia.
- Agents must read CTI reports, explore telemetry, iterate KQL, and generate Sigma rules.
- Ground-truth scoring validates outputs across Linux endpoints, AKS, and Azure cloud environments.
- Benchmark extends prior investigation-focused evals by targeting detection rule generation workflows.
- Dataset includes 37 curated public CTI reports suitable for sandboxed telemetry simulation.
- Checkpoint scoring measures intermediate steps like technique mapping and data-source identification.
- Tooling mirrors analyst environments: CTI repositories, schema explorers, Kusto engine, ATT&CK, Sigma databases.
- Business value comes from objective proof of AI impact on detection coverage and analyst productivity.
- Results on CTI-REALM-50 show Claude leading; GPT-5 medium reasoning beats high reasoning.
- Removing CTI-specific tools reduces performance notably, especially final detection rule quality.
TAKEAWAYS:
- Effective security agents must operationalize CTI into detections, not just classify TTPs.
- Intermediate workflow metrics reveal whether failures stem from comprehension, queries, or specificity.
- Cloud detection tasks remain substantially harder than Linux and AKS scenarios.
- Human-authored workflow guidance can meaningfully improve smaller models’ performance.
- Open-sourcing enables shared benchmarking, safer adoption decisions, and community-driven improvements.