LLM API Abuse: The CVE Pattern Nobody Patches
LLM API Abuse: The CVE Pattern Nobody Patches
There’s a class of vulnerability targeting AI systems that doesn’t have a CVE number, won’t get a patch Tuesday, and isn’t tracked by your SIEM. It’s called LLM API abuse — and in 25 years of enterprise security, I’ve rarely seen a threat category mature this fast while flying this far under the radar.
This isn’t theoretical. In 2024 and into 2025, security researchers and red teams documented a repeatable attack pattern: adversaries targeting the APIs that front large language models, not to steal model weights, but to manipulate model behavior, extract sensitive data from retrieval-augmented generation (RAG) systems, and burn through inference budgets at scale. The attack surface is the API itself — authentication, rate limiting, input validation, and context handling. Every weakness enterprise teams spent the last decade hardening in web APIs exists here, plus a new layer of vulnerabilities that are unique to probabilistic systems.
Let me break down what’s actually happening and what a real response looks like.
TL;DR
LLM API abuse is a vulnerability class — not a single CVE — where attackers exploit weak authentication, missing rate limits, and unvalidated inputs against AI inference endpoints. Unlike traditional software vulnerabilities, these don’t get patched by a vendor. Your team owns the control layer. This post covers the attack patterns and the controls that actually stop them.
What LLM API Abuse Actually Looks Like
The attack pattern breaks down into three distinct sub-classes. They’re often chained.
1. Unauthenticated or Weakly Authenticated Inference
This is the most common entry point. A team ships an internal AI tool, adds an API key for the LLM provider, and forgets that the internal endpoint accepting user prompts has no authentication of its own. The attacker doesn’t need to steal your Anthropic or OpenAI key — they just need your application’s exposed endpoint. From there, every request they send runs on your key, your billing, and your context window.
I’ve seen this in enterprise deployments where the team treated the LLM integration like an internal microservice — assuming network perimeter controls were sufficient. They weren’t. Cloud-native AI deployments rarely sit cleanly inside a network perimeter.
2. Context Window Extraction via Crafted Inputs
RAG-augmented systems are particularly vulnerable here. The attack works like this: a user sends a carefully structured prompt designed to make the model reveal what’s in its context window — system prompts, retrieved documents, conversation history, or injected data from a tool call. This isn’t prompt injection in the traditional sense (that’s a separate post). This is specifically targeting the context layer to extract data that was never meant to be user-visible.
In a regulated environment — financial services, healthcare, legal — the context window often contains data the user querying the model has no authorization to see. You built row-level security in your database. Did you build equivalent controls on what gets retrieved and injected into the model’s context?
3. Inference Budget Exhaustion
Less glamorous but operationally damaging: attackers deliberately trigger expensive, long-running inference requests to exhaust usage limits or spike costs. Unlike a traditional DDoS, this doesn’t require massive traffic volume. A small number of well-crafted requests — long contexts, multi-step chain-of-thought triggers, recursive tool calls — can generate disproportionate compute and billing impact. One red team exercise I’m aware of generated $4,000 in inference costs with fewer than 200 requests.
Why This Class of Vulnerability Has No CVE
The CVE system assumes a specific, patchable software flaw in a specific product version. LLM API abuse doesn’t fit that model for several reasons.
First, the “vulnerability” is often a configuration or architectural decision, not a code defect. There’s no patch for “you didn’t rate limit your inference endpoint.” Second, the attack exploits the intended behavior of the model — asking an LLM to summarize what’s in its context isn’t a bug, it’s a feature. The risk emerges from how that feature is deployed. Third, the attack surface varies entirely by implementation. Two companies using the same model provider with different application architectures have completely different exposure profiles.
This means your vulnerability management program — however mature — won’t track this automatically. No feed, no CVE, no CVSS score. You have to own the assessment yourself.
The Security Controls That Actually Block These Attacks
I’ll skip the generic advice. Here’s what I’ve found actually works at the architecture level.
Authentication on every inference-adjacent endpoint, full stop. Not network perimeter controls. Actual auth at the application layer. Every endpoint that accepts a prompt or triggers an LLM call should require a valid, scoped token. Service accounts making internal calls should use short-lived credentials rotated by your secrets manager, not static API keys.
Rate limiting by user identity, not just by IP. IP-based rate limiting is trivially bypassed via distributed sources. Tie rate limits to authenticated user identity. Set hard caps on request frequency, context size per request, and daily inference budget per user or role. Most API gateway products support this. Most teams haven’t configured it for their AI endpoints specifically.
Context access controls that mirror your data authorization model. If a user at permission level X can’t see a document in your document management system, that document should not be retrievable into a context window when that user queries your RAG system. This requires building your retrieval layer to be authorization-aware — filtering retrieved chunks based on the querying user’s permissions before injection. This is not a feature most off-the-shelf RAG frameworks give you for free. You build it.
System prompt confidentiality hardening. You can’t fully prevent a determined adversary from inferring your system prompt through repeated probing. You can make it significantly harder by structuring prompts to avoid self-referential instructions, using output guardrails that flag responses revealing system-level instructions, and monitoring for patterns consistent with extraction attempts. Log and alert when model output length suddenly spikes for a given user — it’s a weak signal but worth capturing.
Inference cost monitoring with automated circuit breakers. Set billing alerts and automated spend caps at the provider level. More importantly, monitor at the application level: requests per user per hour, average tokens per request, and flagging requests with anomalously large context sizes. A request arriving with a 40,000-token context from a user whose average is 800 tokens is worth investigating.
What a Realistic Threat Model Looks Like
Not every organization deploying AI is equally exposed. Here’s how I’d tier it.
Highest risk: Organizations with externally accessible AI endpoints, RAG systems indexing sensitive internal data, and no application-layer auth on inference calls. If this is you, this is a P1 gap.
High risk: Organizations with internal-only AI tooling but weak identity controls on those tools — shared service accounts, static keys in application configs, no rate limiting. The blast radius is limited by the perimeter, but insider threat and credential theft scenarios are wide open.
Moderate risk: Organizations with proper auth and rate limiting in place but no monitoring on context access patterns or inference cost anomalies. The controls exist but the detection layer is missing.
Lower risk: Organizations that have auth, rate limiting, context access controls, monitoring, and have actually tested these controls via red team or structured review. This is a small percentage of teams deploying AI today.
Key Takeaways
- LLM API abuse is a vulnerability class — authentication weaknesses, missing rate limits, unvalidated context access — not a patchable CVE. Your team owns the control layer.
- RAG systems are specifically exposed to context window extraction when retrieval isn’t authorization-aware. Most aren’t, by default.
- Inference budget exhaustion is a real operational risk, not just a theoretical one. A small number of crafted requests can generate significant cost impact.
- The remediation stack is well-understood: application-layer auth, identity-scoped rate limiting, authorization-aware retrieval, system prompt hardening, and cost monitoring with circuit breakers.
- If your AI deployment isn’t on your vulnerability management radar because it has no CVE, that’s the gap to fix first.
Traditional vulnerability management was built for a world where vendors ship patches and you apply them. AI deployment doesn’t work that way. The attack surface is largely in your architecture — and that means security teams need to own the assessment, not wait for a feed to tell them what’s broken.
The organizations getting this right are the ones treating AI inference endpoints with the same rigor they’d apply to any externally-exposed API. That’s not a new discipline. It’s just new scope.
Comments