The Fundamental Limits of LLMs in Security Operations: Why Coverage Can't Be Prompted

There's a thought experiment I often return to when thinking about security operations. Imagine a SOC with perfect visibility: every packet, every process, every file modification across your entire infrastructure, all observable in real-time. This omniscient SOC would never miss an attack. It would see everything, always.

Of course, this is impossible. The sheer volume of data generated by modern infrastructure makes comprehensive observation a fantasy. We're talking about a lattice of information so vast that even storing it would bankrupt most organizations, let alone analyzing it in real-time.

So we've built something else entirely.

Alerts Are Just Entry Points

Modern SOCs operate on a fundamental compromise. Since we can't observe everything, we create detection rules — smart filters that surface potentially interesting events from the ocean of noise. An unusual login here, a suspicious process there, an anomalous network connection.

But here's what we rarely acknowledge: these alerts aren't answers. They're questions. When a detection rule fires, it's not saying "here's a threat." It's saying "start looking here." The alert is merely an entry point into that vast lattice of information, a suggested starting position for an investigation that must ripple outward.

This is where the real work begins: systematic exploration. Check authentication logs. Examine network connections. Review process history. Verify file modifications. Investigate related systems. Each step potentially reveals new threads to follow.

Playbooks Guarantee Coverage, At a Cost

Mature SOCs codify this exploration into playbooks — not because analysts are incompetent, but because the space of possible checks is vast and human attention is finite. Even the best analyst might forget something critical in the heat of investigation.

Yet playbooks don't scale. Each alert type demands its own custom choreography. A suspicious login requires different checks than malware detection, which differs from data exfiltration. You quickly end up managing hundreds of constantly evolving documents. The solution to human limitations becomes its own operational nightmare.

The Three Brains of the SOC

SOC work involves three fundamentally different cognitive tasks:

Information retrieval: "What's the hash of this file?" Simple lookups where the answer exists; you just need to find it.

Reasoning about evidence: "Given these network connections and this process behavior, what's happening?" Pattern recognition, narrative building, and connecting disparate facts.

Systematic coverage: "Have I checked all necessary places?" This isn't about intelligence, it's about methodical completeness.

Modern AI excels at the first two. LLMs retrieve information with stunning accuracy and reason about complex scenarios better than many humans. Show an LLM malware, and it will explain what it does, how it works, and its likely objectives.

But systematic coverage exposes a fundamental architectural limitation.

What Happens When You Ask an LLM to Be Thorough

I gave O3 (OpenAI's most advanced reasoning model) a realistic investigation challenge. Not a checklist to follow, but a decision problem: here's a security alert and 30 possible investigation actions. Decide what to check.

Hidden in those 30 actions were 10 critical ones — the core of what an experienced analyst would recognize as essential. The other 20? Valid but non-critical paths that might yield interesting information but aren't necessary for a thorough investigation.

Analysts face dozens of possible paths. But only some are critical. LLMs don't know which.

Across numerous runs, O3's behavior was revealing. It would select 8-12 actions each time, showing reasonable investigation instincts. But the specific actions varied wildly between runs. More concerning: critical checks were randomly omitted. In one run investigating a suspicious IP connection, it didn't check if that IP was a known threat indicator. In another run with a potential phishing alert, it forgot to verify if any users actually visited the malicious URL.

The model wasn't wrong, its choices were defensible. But defensible isn't sufficient when you're the last line of defense.

Why LLMs Can't Guarantee Coverage

Here's the architectural truth: LLMs generate text by sampling from probability distributions. Each decision about what to check next is probabilistic, not deterministic.

Sampling can never guarantee exhaustive coverage. That's true even with perfect memory, chain-of-thought, or temperature zero because the core loop remains probabilistic.

When faced with 30 possible actions, an LLM doesn't systematically evaluate which subset guarantees coverage. It selects actions that seem relevant in the moment, building a plausible investigation narrative. This isn't an oversight, it's the design.

The SOC Doesn’t Need Smart. It Needs Certainty

The real goal isn’t speed or efficiency. It’s correctness.

When an analyst closes a case with “no compromise found,” that statement needs to rest on full, exhaustive analysis — not a plausible guess, not a smart shortcut. You need guarantees. Every time.

LLMs can help prioritize. They can help interpret. But when it comes to coverage — ensuring every critical check has been made — probabilistic outputs don’t cut it. You can’t secure an enterprise on “likely.”

In security, missing one path is enough. 74% correct isn’t safer than zero.

The Path Forward

At Qevlar, we don’t use LLMs to orchestrate investigations. We built a system from first principles to guarantee mechanical completeness. Every critical path. Every time.

This isn’t about smarter AI — it’s about different AI. One that separates reasoning from coverage, because the second demands structure, not inference.

Boring Still Wins

LLMs will transform how we interpret what we find. But ensuring we’ve looked in the right place, every time? That’s a different problem. That’s a job for boring. And boring still wins.

The Fundamental Limits of LLMs in Security Operations: Why Coverage Can't Be Prompted