← all writing

AI Security

Threat Modeling Generative AI: What 11,658 Incidents and the Research Actually Show

Most real-world harm from generative AI does not come from prompt injection. Analysis of 11,658 incidents shows output handling and misinformation dominate.

Threat modeling generative AI: what 11,658 incidents and the research actually show#

Read the security press and you would think generative AI has one problem: prompt injection. Conference talks, vendor pitches, and every refreshed OWASP list keep returning to the same fear. An attacker slips instructions into the context window. The model does something it should not.

Prompt injection is real. I have found it in production. I have reported it in red-team work. But if it dominates your threat model, you will miss the failures that actually hurt people.

This article rests on two things. First, I analyzed the emmanuelgjr/genai-incidents dataset on Hugging Face: 11,658 documented AI incidents covering real-world failures, vulnerability disclosures, research demonstrations, threat reports, and red-team findings. Second, I checked those patterns against recent academic work on LLM security, AI incident analysis, and generative AI misuse. The dataset and the literature mostly agree. Where they differ, the difference is useful.

The answer is not what the hype says. Most real-world harm from generative AI does not come from clever jailbreaks or adversarial suffixes. It comes from plain output handling, plain misinformation, and plain trust in a machine that produces plausible text. The cases everyone remembers, Samsung engineers leaking code into ChatGPT, the Chevrolet chatbot offering a $1 SUV, Air Canada’s chatbot inventing a refund policy, and the NYC chatbot telling businesses to break the law, have little to do with prompt injection. They have everything to do with how organizations deploy systems they do not fully control.

I walk through what the record and the research show, where they clash with industry assumptions, and how to turn that into a threat model you can use.

Methodology: reading 11,658 incidents and the papers around them#

The emmanuelgjr/genai-incidents dataset is a public corpus of AI security and safety incidents. At the time of writing it held 11,658 records. Each includes a title, description, date, severity, confidence level, category, attack vector, affected sector, and mappings to MITRE ATLAS tactics, OWASP LLM Top 10 risks, NIST AI RMF functions, and the OWASP Agentic Security Initiative.

I care most about these categories:

  • Real-world incidents: failures that hit actual users, organizations, or societies.
  • Vulnerability disclosures: security issues found before exploitation.
  • Research demonstrations: proof-of-concept attacks.
  • Threat reports: intelligence on AI-enabled threats.

The split matters. A supply-chain vulnerability in a model hub is a pre-ship finding. A chatbot giving dangerous advice to a vulnerable user is a live incident. They point to different controls.

The academic work I use falls into four groups. Surveys such as Fung et al.’s Security Concerns for Large Language Models: A Survey (2025) and the ACM survey Unique Security and Privacy Threats of Large Language Models (2025) organize threats into inference-time attacks, training-time attacks, misuse, and agentic risks. Empirical studies such as An Empirical Study of Production Incidents in Generative AI Cloud Services (arXiv:2504.08865) analyze real production failures. Misuse and taxonomy research such as Marchal et al.’s Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data and the Standardized Threat Taxonomy for AI Security, Governance, and Regulatory Compliance (arXiv:2511.21901v1) classify real-world harms. Disclosure studies such as the analysis of AI vendor vulnerability policies (arXiv:2509.06136) show which failure modes the industry is willing to call bugs.

The dataset gives scale. The literature gives rigor. I use both.

The dataset at a glance#

Of 11,658 records, about 53% are real-world incidents and 43% are vulnerability disclosures. The rest are research, threat reports, red-team findings, and regulatory items. That split alone should shape how you think about risk. For every vulnerability found before ship, there is a real-world failure in the record.

Distribution of documented AI incidents by category

Incidents are also accelerating. Before 2015 the record is thin. From 2015 to 2019 it grows. From 2020 onward it climbs fast. Some of that is better reporting, but the directional signal is hard to ignore: the surface is expanding faster than most security programs can adapt.

Documented AI security incidents by year

Severity is lopsided. Among real-world incidents, 42.8% are Medium, 37.2% High, and 20.0% Critical. Only 0.1% are Low. Generative AI systems sit in high-leverage roles such as customer support, content generation, decision support, and code assistance, so even a wrong sentence can carry legal, financial, or safety weight.

Incident severity by category

What the research agrees on#

Before the findings, a quick check: do the dataset’s patterns show up in academic work? Yes, with one caveat about vocabulary.

The Standardized Threat Taxonomy (arXiv:2511.21901v1) validated 133 AI Incident Database records from mid-2025 to late 2025 and found 61% misuse, 27% unreliable outputs, and 5% supply chain. The Google/Jigsaw Generative AI Misuse study looked at roughly 200 misuse incidents and identified opinion manipulation, monetization and profit, scam and fraud, harassment, and reach as the main attacker goals. The Empirical Study of Production Incidents in Generative AI Cloud Services found that production GenAI incidents cluster around reliability and output quality more than adversarial input attacks.

Our dataset, mapped to OWASP LLM Top 10, says the same thing in different words. Output handling and misinformation together cover about 77% of real-world incidents. Supply chain is small in live incidents but huge in disclosures. Prompt injection is real but not dominant.

Real-world generative AI harm: dataset findings vs. academic corroboration

The convergence matters. It means the pattern is not an artifact of one dataset’s labels. It is how generative AI systems actually fail.

Finding 1: prompt injection is over-indexed#

Here is the correction the dataset makes most forcefully. Prompt injection is treated as the defining threat of the LLM era. It is technically interesting, adversarially elegant, and fits a familiar story: attacker input bypasses control.

In the real-world corpus, prompt injection is not the dominant failure mode. Among 6,181 real-world incidents, it shows up in about 2.3% of cases. Jailbreaks are under 1%. Adversarial inputs are about 0.5%. The whole class of “attacker manipulates the model through input” is a single-digit share of documented harm.

OWASP LLM Top 10 risks: real-world incidents vs. vulnerability disclosures

This does not mean prompt injection is unimportant. It means threat models should be built from the incident record, not from the attack class that red teamers find most interesting.

The research confirms the technical importance while showing the operational gap. Greshake et al.’s 2023 ACM AISec paper demonstrated indirect prompt injection in Bing Chat and other integrated applications. Liu et al.’s USENIX Security 2024 work formalized and benchmarked the attack. Debenedetti et al.’s Agentdojo (NeurIPS 2024) built an evaluation environment for agent prompt injection. Fung et al. and the ACM survey both list prompt injection as a major inference-time threat.

The distinction is between technical importance and operational frequency. Prompt injection matters because it breaks the instruction-data boundary that LLMs assume. It is operationally rare because most production failures do not need an attacker. They need only a model that produces plausible but wrong text and a user who trusts it.

The dataset contains real prompt-injection cases. OpenHands was hijacked by indirect prompt injection to download and execute remote code, turning it into a compromised “ZombAI.” Early Bing Chat testers extracted its initial prompts through prompt injection. Anthropic’s Claude.ai was found vulnerable to silent data exfiltration and malicious redirects via prompt injection combined with API misuse and open redirects. These are serious. They are also not the majority.

Stop treating prompt injection as the center of the LLM threat model. Treat it as one input-vector risk among several, and spend at least as much effort on what happens after the model generates output.

Finding 2: output handling is the biggest source of real-world harm#

The largest category of real-world generative AI harm is OWASP LLM05: Improper Output Handling. It appears in 42.0% of real-world incidents. The system said something wrong, harmful, or actionable, and someone treated it as true.

This is where the famous cases live.

Air Canada’s chatbot made up a bereavement fare policy. The airline tried to argue in court that the chatbot was “a separate legal entity responsible for its own actions.” The court disagreed and held Air Canada liable. A federal judge sanctioned lawyers who submitted briefs with hallucinated case citations in Whiting v. City of Athens. Google’s Bard and Gemini were sued for generating defamatory claims linking an activist to sexual assault and extremism. Cursor’s AI support agent invented a single-device login policy and told multiple users about it confidently, triggering cancellations and backlash until humans stepped in.

None of these are prompt-injection attacks. No attacker crafted input. The system produced bad output, and a human, workflow, or downstream system acted on it. The threat is not in the context window. It is in the gap between generation and verification.

The literature uses different words for the same thing. The Standardized Threat Taxonomy calls it “Unreliable Outputs” and finds it in 27% of validated AIID incidents. The Empirical Study of Production Incidents in Generative AI Cloud Services finds output-related symptoms and hallucinations among the most common production issues. Li et al.’s taxonomy of GenAI failure modes, cited in What People See (and Miss) About Generative AI Risks, classifies these as design-related, development-related, and use-related failures. In other words, sociotechnical failures, not pure technical vulnerabilities.

Classical application security thinking helps here. We have known for decades that untrusted output must be encoded, validated, and never acted on without authorization. We built output encoding to stop XSS. We built canonicalization to stop path traversal. We built schema validation to stop deserialization bugs. Generative AI needs the same discipline applied to natural language: every LLM output is untrusted until proven otherwise, especially when it drives a workflow, a support decision, or a legal claim.

The problem also appears in quieter forms. A coding assistant suggests a deprecated library. A documentation generator misstates an API behavior. A support chatbot gives a refund policy that does not exist. Each is a small output-handling failure. Each becomes a security or liability incident when trusted and acted on.

If your threat model starts and ends with prompt injection, you will build input filters and miss the larger problem: the model can lie, hallucinate, or confabulate on its own, and your organization is not ready to catch it.

Finding 3: misinformation is a security outcome#

The second major real-world category is OWASP LLM09: Misinformation. It appears in 35.2% of real-world incidents. False, misleading, or contextually wrong content, amplified at machine scale.

Traditional security teams struggle to own this because it does not look like a vulnerability. There is no CVE for “the model generated a convincing falsehood.” There is no patch. But it is a security outcome because it erodes trust, drives harmful decisions, and can be weaponized.

Top specific attack vectors in real-world generative AI incidents

Deepfakes are the most visible form. The record includes voice-cloning frauds, synthetic videos of politicians, fabricated endorsements, and AI-generated news pages that rewrite real stories with synthetic images to farm engagement. During India’s 2024 general elections, deepfakes of deceased politicians were used to influence voters. A YouTube crime page was found to be entirely AI-generated. A UK firm’s CEO was impersonated by AI voice technology in a €220,000 fraud.

The academic work is extensive. Marchal et al.’s Generative AI Misuse taxonomy analyzes roughly 200 observed misuse incidents and identifies opinion manipulation, monetization, scam and fraud, harassment, and reach as the main attacker goals. The Standardized Threat Taxonomy collapses much of this under “Misuse” and finds it in 61% of validated incidents. The framing is broader than OWASP’s “Misinformation” because it includes intentional abuse, but the pattern is the same: generated content becomes the attack vector.

Misinformation does not require malice. Grok falsely suggested police misrepresented London far-right rally footage. A Bing Chat demo contained hallucinated financial statements for Gap and Lululemon. The NYC chatbot told businesses they could break the law. These are systems producing confident wrong answers in high-stakes contexts, then trusted because they looked authoritative.

You cannot treat generative AI outputs like a SQL result. A SQL result is correct or incorrect relative to the database. An LLM summary is probabilistically plausible. Your threat model must include the output consumer: who acts on it, what authorization they have, what verification exists, and what happens when the output is wrong.

If you are building a RAG-based internal search tool, the important question is not “can someone inject a prompt?” It is “what happens when the retrieved chunk is out of context and the model confidently mis-summarizes it?” That failure mode is far more likely to hurt you than a jailbreak.

Finding 4: disclosure and real-world harm are inverted#

The split between vulnerability disclosures and real-world incidents tells almost opposite stories about where risk lives.

In disclosures, OWASP LLM03, Supply Chain, dominates at 97.6%. Researchers spend their time on model hubs, notebooks, plugins, adapters, and training pipelines, and they find supply-chain issues everywhere: poisoned models on Hugging Face, typosquatted repositories, backdoored MCP servers on npm, malicious LoRA adapters, compromised training pipelines.

In real-world incidents, supply chain is about 6.2%. The risks that actually materialize are output handling (42.0%), misinformation (35.2%), system prompt leakage (11.4%), sensitive information disclosure (11.0%), and excessive agency (10.9%). Prompt injection is 2.3%.

This inversion matters. If your goal is to find bugs before release, weight supply chain heavily. If your goal is to prevent operational incidents, weight output handling and misinformation heavily. Most organizations need both, but they currently over-invest in the first and under-invest in the second.

The gap also explains why vendor research and practitioner experience feel disconnected. Vendors show impressive supply-chain findings because researchers find them. Practitioners live through output-handling failures because users trigger them. A good threat model triangulates both.

The literature adds a third lens. A 2025 study of AI vendor vulnerability disclosure policies (arXiv:2509.06136) found that 36% of AI vendors provide no disclosure channel at all, and only 18% explicitly mention AI security in their policies. More importantly, hallucinations are accepted as in-scope by only 17% of vendors, and jailbreaking by only 27%, while traditional security objectives like confidentiality and integrity are accepted by 90%. The industry is structurally less willing to treat the failures that dominate real-world incidents as bugs worth reporting.

If the people building the systems do not classify output-handling failures and misinformation as security issues, it is no surprise that threat models underweight them.

Finding 5: classic AppSec still kills AI systems#

One of the humbling findings from the record is how often generative AI systems fail through ordinary application security issues. The model is new. The surrounding code often is not.

A production LLM application is not just a model. It is a web service, an API gateway, a retrieval pipeline, a vector database, a tool-calling layer, a memory store, and usually several third-party integrations. Each has the same attack surface as any other web application. The novelty of the model distracts teams from securing the plumbing.

The dataset contains XSS in chatbot frontends, authentication bypass, path traversal, SSRF in tool-calling layers, deserialization in model artifacts, SQL injection through agent-generated queries, and over-permissive SharePoint indexing in Copilot deployments. The Microsoft 365 Copilot data exposure, where confidential SharePoint files surfaced to employees through inherited permissions, is not an LLM attack. It is a data-governance failure wearing an LLM interface.

I tell clients to threat model the AI system as two things at once: a machine-learning artifact and a distributed web application. STRIDE applies to the latter. PASTA applies to the latter. The OWASP Top 10 applies to the latter. If you hire an LLM security specialist who does not understand application security, you will find prompt injection and miss the SSRF in the tool-calling layer.

The surveys by Fung et al. and the ACM survey both stress that LLM-integrated applications inherit traditional web security problems plus new ML-specific risks. The Empirical Study of Production Incidents in Generative AI Cloud Services finds that root causes of production GenAI incidents often include misconfigurations, dependency failures, and resource exhaustion, classic cloud operations issues, alongside model-specific failures.

Your threat model should treat the wrapper with the same suspicion as the model inside it.

Finding 6: supply chain is the sleeping giant#

The supply-chain finding deserves its own section because the gap between disclosure and real-world incident is so large.

In disclosed vulnerabilities, supply chain risk is enormous. Researchers showed how to upload a model that looks like EleutherAI but answers factually wrong questions. PoisonGPT demonstrated LLM supply-chain disinformation through a Hugging Face typosquat: a modified GPT-J variant uploaded to ‘EleuterAI’ instead of ‘EleutherAI’ was downloaded over 40 times before takedown. A malicious MCP server backdoor on npm contained dual reverse shells, one at install time and one at runtime, giving persistent remote access to agent environments. OpenClaw, an open-source AI agent with 135,000 GitHub stars, became the first major AI agent security crisis of 2026 with critical vulnerabilities, malicious marketplace exploits, and over 21,000 exposed instances.

These are serious, credible, demonstrated attacks.

In real-world incidents, supply chain is still a small fraction. That does not mean it is overblown. It means it is latent. The ratio will flip as AI agents gain the ability to fetch, install, and execute code on behalf of users. When an agent can browse to a package manager and install a tool because a user asked for a feature, every supply-chain attack from the past decade becomes an LLM attack surface.

I treat supply chain as a high-confidence future risk. Assume any model, adapter, embedding, tool, or plugin the system consumes could be compromised. Controls should include artifact verification, model signing, sandboxed execution, least-privilege tool access, and inventory of third-party AI components. The fact that it has not yet produced a wave of public breaches is not a reason to deprioritize it. It is a reason to prepare before the wave arrives.

Finding 7: excessive agency is the multiplier#

OWASP LLM06, Excessive Agency, appears in about 10.9% of real-world incidents. That sounds small, but it is the multiplier that turns every other failure category into a breach.

A hallucination is embarrassing. A hallucination followed by an automated email to a customer, a database write, or a payment instruction is a security incident. A prompt injection against a read-only chatbot is a research finding. A prompt injection against an agent that can read email, call APIs, and post to Slack is a breach waiting to happen.

This is the lesson of EchoLeak and the broader class of agentic attacks. In June 2025, Microsoft patched a CVSS 9.3 vulnerability in Microsoft 365 Copilot that let an attacker exfiltrate sensitive corporate data, emails, SharePoint files, and Teams messages with a single crafted email. No clicks. No malware. No code execution. The payload was plain natural language hidden in an ordinary-looking business document.

The breach did not happen because the model was tricked. It happened because the model was trusted with action. The root cause was not a missing input filter. It was the combination of retrieval, reasoning, and action in one chain, with no human authorization at the action step.

The research on agentic risk is expanding fast. Fung et al.’s survey emphasizes autonomous LLM agents, goal misalignment, strategic deception, and “sleeper agent” behaviors. The Systematic Analysis of MCP Security and related work on agent tool markets show that the tool layer is already becoming a supply-chain and agency problem. The Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents documents how benchmarks struggle to keep up with multi-agent and memory-injection risks.

The principle is Least Agency: constrain not just what an agent can access, but what it can do without human confirmation. Every tool call, database write, outbound request, and email should be classified by blast radius and require approval above a threshold. This is the agentic equivalent of least privilege, and it belongs at the center of any generative AI threat model.

Finding 8: harm concentrates in specific sectors#

Real-world AI incidents are not evenly distributed. The dataset tags each incident by sector, and the concentration is clear.

Most affected sectors in real-world AI incidents

The most affected sector is media, entertainment, sports, and arts, followed by politics, automotive, technology, education, health, and banking. This reflects where generative AI has been deployed most aggressively and where the consequences of bad outputs are most visible.

Media and politics are dominated by deepfakes and misinformation. Automotive is dominated by autonomous driving incidents and adversarial-input failures. Health is dominated by unsafe advice, bias, and privacy violations. Banking is dominated by fraud, impersonation, and data exposure. A sector-aware threat model asks better questions than a generic one.

In health, the threat is not primarily prompt injection. It is the model giving unsafe advice to a vulnerable user, as in the reported case of a 16-year-old who allegedly received suicide method guidance from ChatGPT-4o before his death. In politics, the threat is synthetic media used to manipulate elections. In banking, it is voice-cloning fraud and unauthorized transactions. A threat model that treats every deployment as a chatbot-security problem will miss the risks that matter in the domain.

Finding 9: confidence and severity do not line up neatly#

The dataset includes a confidence field: high, medium, or low. Among real-world incidents, high-confidence records dominate at about 82%. Medium-confidence is about 11%, and low-confidence about 7%.

Confidence vs. severity in real-world AI incidents

Confidence does not correlate cleanly with severity. High-severity and critical-severity incidents appear across all confidence levels. You cannot dismiss a reported incident just because the initial sourcing is thin. Some of the most consequential failures, especially in politics, health, and law, start as single reports before they are confirmed.

Treat incident intelligence as a signal, not a court case. You do not need absolute attribution to ask: “If this happened to us, what would break?” That question is the point of threat modeling.

Finding 10: the incident surface is accelerating#

The temporal trend in the dataset is clear. Before 2015, documented AI incidents are sparse. From 2015 to 2019 the count grows. From 2020 onward it climbs sharply. Better reporting plays a role, but the directional signal is hard to ignore: the surface is expanding faster than security maturity.

This has two implications. First, historical baselines understate future risk. If you calibrate your threat model to 2022 incidents, you will miss the agentic and supply-chain risks emerging in 2025 and 2026. The dataset already contains agentic incidents, MCP supply-chain compromises, and autonomous AI agent crises that did not exist as categories a few years ago.

Second, the window for proactive threat modeling is closing. Organizations that build AI threat models now, while the record is still readable, will have an advantage over those that wait for a failure to force reactive work. The cost of being early is small. The cost of being late is a public incident.

Finding 11: MITRE ATLAS tactics point to impact#

Mapping real-world incidents to MITRE ATLAS tactics confirms that generative AI attacks are outcome-oriented. The most common tactics are Impact, Defense Evasion, Initial Access, Exfiltration, and ML Model Theft, in that order.

MITRE ATLAS tactics observed in real-world AI incidents

The dominance of “Impact” fits the OWASP finding that output handling and misinformation drive most harm. These are not attacks that linger in the network or escalate privileges in the traditional sense. They produce an immediate effect: a wrong decision, a leaked secret, a defamatory statement, a fraudulent transfer.

“Defense Evasion” ranks high because many generative AI failures hide in plain sight. A hallucinated legal citation looks normal until someone checks it. A deepfake video looks normal until it is forensically examined. A chatbot’s wrong answer looks normal until it is compared against policy. The evasion is semantic, not technical.

“Initial Access” and “Exfiltration” remind us that generative AI systems are also targets for traditional adversaries. A model endpoint, vector database, memory store, and RAG pipeline all contain data worth stealing. Attackers treat these as assets.

Do not only ask “how does someone break in?” Ask “what does a successful outcome look like for the attacker?” The answer will often be a generated artifact, not a compromised host.

Finding 12: NIST AI RMF mappings show thin controls#

The dataset maps incidents to NIST AI RMF functions: Govern, Map, Measure, and Manage. Most real-world incidents map cleanly to Map and Manage, where organizations are supposed to understand their systems and respond to failures. Relatively few map cleanly to Govern and Measure.

Organizations are failing at the operational end of AI risk management more often than at the policy end. Policies exist. Measurement and governance are weaker. A company can have an AI acceptable-use policy and still deploy a chatbot that gives unsafe advice, because no one measured the failure modes or governed the deployment against the policy.

Controls must be operational, not just documented. It is not enough to write “we will not deploy unsafe AI.” You need a Map function that finds where unsafe outputs can occur, a Measure function that tests for them, and a Manage function that responds when they happen. Threat modeling is the bridge between policy and operations.

What this data is good for, and what it misses#

No dataset is complete. This one has real value, but you should know its edges before you bet a security program on it.

The dataset is strong on breadth. Eleven thousand records across categories, sectors, frameworks, and years is enough to see patterns that a smaller sample would hide. The mappings to MITRE ATLAS, OWASP LLM Top 10, and NIST AI RMF let you ask structured questions. The split between real-world incidents and vulnerability disclosures is especially useful; it keeps pre-ship research findings separate from live failures, and the two corpora point to different controls.

It is also strong on recency. The record includes incidents from 2025 and 2026 that would not appear in older academic datasets, so it captures agentic risks, MCP supply-chain issues, and open-weight model crises as they emerge.

But the dataset has gaps.

Reporting bias. Public, severe, and interesting incidents are over-reported. Quiet operational failures, internal incidents, and small-scale harms are under-reported. A customer-support chatbot that gives ten bad answers a day is probably not in the dataset unless one of those answers becomes a lawsuit or a viral post.

Category looseness. Some labels are broad. The “other” attack vector is the largest single bucket in real-world incidents, which means a significant share of the corpus resists clean classification. The “rce” label in particular seems to capture a wide range of incidents, not all of which are classical remote code execution. The OWASP and ATLAS mappings are useful but not always precise.

Attribution uncertainty. The confidence field helps, but not every high-severity incident is equally well-sourced. Some records rely on single news reports or preliminary disclosures.

Geographic and sector skew. The record is weighted toward English-language media, Western organizations, and consumer-visible sectors like media, politics, and technology. Incidents in enterprise back-office systems, manufacturing, and non-English-speaking regions are likely undercounted.

No counterfactual. The dataset tells you what happened. It does not tell you what was prevented by good controls, nor does it tell you the base rate of deployments. A category with few incidents might be rare, or it might just be rarely reported.

Future-dated and projected entries. Some records extend to 2026 because the dataset includes future-dated incidents and threat reports. I focused temporal analysis on 2015, 2025 to avoid skew.

The right way to use this data is as directional intelligence, not as a precise frequency table. The patterns are robust: output handling and misinformation dominate real-world harm, supply chain dominates disclosures, prompt injection is real but smaller than the discourse suggests. The exact percentages should be read with the usual incident-database caveats.

The academic literature has its own limits. Surveys are retrospective and lag the newest attacks. Empirical incident studies rely on publicly reported cases and face the same reporting biases. Taxonomy papers disagree on category boundaries, making direct percentage comparisons approximate. I have noted where mappings are loose.

Correlation is not causation. The fact that output handling appears in 42% of real-world incidents does not mean it causes 42% of all AI harm. It means that, among documented incidents, output handling is the most common failure category. That is still the right place to focus, because it is the most observable and reproducible failure mode.

Common mistakes in generative AI threat modeling#

After working through the record, the literature, and client engagements, I see the same mistakes repeatedly.

Treating the LLM as the only threat surface. The model is a component. The system is the surface. Threat model only the model and you will miss the API, the RAG pipeline, the tool layer, the memory store, and the downstream consumers.

Starting with prompt injection. It is real but not the main source of real-world harm. Start with output handling and misinformation, then work back to input risks.

Confusing research demonstrations with operational risk. A jailbreak paper is interesting. A chatbot giving a wrong refund policy is expensive. Both belong in the threat model, but they need different controls and priorities.

Ignoring the consumer of the output. The model does not cause harm by generating text. It causes harm when someone acts on that text. Your threat model must include the human, workflow, or system downstream.

Under-investing in supply chain because it has not broken yet. The disclosed-vulnerability corpus and the literature both warn that this is coming. Prepare now.

Applying STRIDE without adaptation. STRIDE works for deterministic systems. It is not enough for agentic AI, where the breach crosses a chain of legitimate actions. Use STRIDE for the wrapper, and add layer-crossing and agency analysis for the AI-specific parts.

Measuring only security, not safety. A model can be secure against prompt injection and still produce harmful outputs. Your threat model must include misinformation, unsafe advice, bias, and confabulation.

A four-layer threat model#

From these findings, I use a four-layer model with clients. It does not replace STRIDE or PASTA. It is a lens you hold up to those methods to make sure you ask the right questions.

Layer 1: the wrapper#

The application around the model: web UI, API, authentication, network boundaries, secrets management, logging. Threat model it like any other web application. The OWASP Top 10 applies. STRIDE applies. Your existing SDLC controls apply.

Key questions: Where is input validated? Where is output encoded? What can an unauthenticated caller do? What can an authenticated low-privilege caller do? Are model endpoints exposed to the internet? Where are secrets stored?

Layer 2: the model interface#

The boundary where user input, retrieved context, system prompts, and tool responses meet. This is where prompt injection, indirect prompt injection, jailbreaks, and system prompt leakage live. Important, but not the whole game.

Key questions: What data enters the context window? Which of it is attacker-influenced? Is there a clear separation between system instructions and untrusted input? Can an attacker poison retrieved context? Can they exfiltrate the system prompt? What happens if they succeed?

Layer 3: the output channel#

This is where most real-world harm originates. The model produces text, code, a summary, a recommendation, or a decision. A human, workflow, or another system consumes it. This layer is about truthfulness, harm, and authorization.

Key questions: Who acts on the output? What verification happens before action? Can the output trigger irreversible operations? Is there a human in the loop for high-stakes decisions? How do you detect and recover from hallucination or misinformation?

Layer 4: the action layer#

This is where agency multiplies risk. Tools, plugins, function calls, database writes, API invocations, agent-to-agent communication. This is the layer that turns a bad output into a breach.

Key questions: What tools can the model invoke? What authorization does each tool have? Is there approval for high-impact actions? How are tool responses validated? Can one compromised tool escalate to others? How do you attribute an action to an agent versus a user?

A complete threat model traces attacks across all four layers. A prompt injection at Layer 2 becomes exfiltration at Layer 4. A poisoned document at Layer 1 becomes misinformation at Layer 3. A backdoored tool at Layer 1 becomes excessive agency at Layer 4. Threats do not stay in their boxes.

What this means for your threat model#

If you are threat modeling a generative AI system this month, here is how I would apply the evidence.

Do not start with prompt injection. Start with output handling. Ask what happens when the model is wrong, not just when it is attacked. The highest-frequency real-world failure is a bad output being trusted, not a clever input bypassing a filter.

Model misinformation as a security threat, not a quality issue. If your system generates content that influences decisions, contracts, public statements, or safety-critical operations, you need controls comparable to those for any other high-risk output: verification, provenance, human approval, and rollback.

Secure the wrapper with the same rigor as the model. The incident record is full of AI systems compromised through ordinary AppSec failures. Do not let the novelty of the model distract you from the familiarity of the surrounding code.

Treat supply chain as inevitable. It has not produced the same volume of public breaches as output handling, but the research findings are clear. Any system that can fetch and execute models, adapters, or tools on behalf of users needs artifact verification, sandboxing, and least-privilege execution.

Apply Least Agency. The most dangerous generative AI systems are not the ones that say wrong things. They are the ones that do wrong things without asking. Classify every action by blast radius and require human approval for high-impact operations.

Calibrate by sector. A healthcare chatbot, a political-media monitoring tool, a coding assistant, and an autonomous vehicle perception system have different dominant failure modes. Generic LLM threat models will miss the risks that matter most in the domain.

A threat modeling checklist for generative AI#

Here is the checklist I use when reviewing a generative AI threat model.

Design stage#

  • Have you identified every consumer of model output?
  • Have you classified outputs by blast radius if they are wrong?
  • Have you defined human-in-the-loop requirements for high-impact actions?
  • Have you specified which tools the model can invoke and under what conditions?
  • Have you documented the trust boundaries between system prompt, user input, retrieved context, and tool responses?

Build stage#

  • Are input validation and output handling applied to all model interfaces?
  • Are model artifacts, adapters, and tools sourced from trusted, verifiable origins?
  • Are tool calls sandboxed and least-privilege?
  • Are secrets separated from model context?
  • Is the wrapper application covered by your existing AppSec program?

Deploy stage#

  • Is there monitoring for hallucination, misinformation, and unsafe outputs?
  • Is there an incident response plan for AI-specific failures?
  • Are model actions attributable to a user, a session, or an agent identity?
  • Is there a rollback path for bad model behavior?
  • Have you tested the system with adversarial inputs and out-of-distribution scenarios?

Operate stage#

  • Are output channels periodically audited for trust assumptions?
  • Are new tool integrations reviewed for agency expansion?
  • Is the threat model updated when the model, tools, or use cases change?
  • Are real-world incidents from your sector reviewed for relevance?

If you cannot check most of these boxes, your threat model has gaps. That is normal. The point is to know where the gaps are and prioritize them.

What you can do this week#

Theory is useful only if it becomes action. Three concrete things for the next five working days.

Day one: inventory your output channels. For every generative AI system your organization runs, list where the output goes. Does a human read it? Does a workflow act on it? Does it write to a database, send an email, or call an API? Mark each channel with a trust level. Any channel that drives action without verification is your highest priority.

Day two: run one misinformation scenario. Pick your most business-critical AI-generated output. Construct a scenario where the model produces a plausible but wrong answer in a high-stakes context. Trace what happens next. Who sees it? Who acts on it? What controls catch it? Most teams discover the answer is “no one and nothing” until a customer or regulator complains.

Day three: map the four layers. Take your architecture diagram and overlay the four layers from this article. Mark every trust boundary. Mark every place where external data enters the context. Mark every tool the model can call. Mark every action that happens without human approval. The gaps you find are your roadmap.

Conclusion#

Generative AI security is not one new problem. It is a bundle of old and new problems spread across an architecture that most organizations do not yet understand. The incident record and the academic literature agree on the central fact: the failures that actually happen are, by volume, failures of output handling, misinformation, and excessive trust in generative systems. The failures researchers most like to demonstrate, prompt injection, jailbreaks, adversarial suffixes, are real and important, but they are not the whole story.

A good threat model must be grounded in what breaks, not in what is interesting. The evidence points to a simple principle: treat the model as a source of plausible, unverified text; treat the surrounding system as a web application; treat every action as a privilege; and treat every output as a potential falsehood until proven otherwise.

The organizations that get this right will not be the ones with the most exotic red-team findings. They will be the ones that built their threat models from the incident record and the research, and then did the boring work of controlling what the system is allowed to say and do.


References#

  1. emmanuelgjr/genai-incidents dataset, Hugging Face. A curated corpus of 11,658 documented AI incidents with mappings to MITRE ATLAS, OWASP LLM Top 10, NIST AI RMF, and OWASP Agentic Security Initiative frameworks.

  2. Fung, B. C. M. et al. Security Concerns for Large Language Models: A Survey. arXiv:2505.18889v5, 2025.

  3. Unique Security and Privacy Threats of Large Language Models: A Comprehensive Survey. ACM Computing Surveys, 2025. doi:10.1145/3764113.

  4. An Empirical Study of Production Incidents in Generative AI Cloud Services. arXiv:2504.08865v2, 2025.

  5. Marchal, N., Xu, R., Elasmar, R., Gabriel, I., Goldberg, B., & Isaac, W. Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data. arXiv:2406.13843v1, 2024.

  6. Standardized Threat Taxonomy for AI Security, Governance, and Regulatory Compliance: A Unified Taxonomy of Threat Vectors in Generative and Agentic AI and Machine Learning Systems. arXiv:2511.21901v1, 2025.

  7. Standardised Schema and Taxonomy for AI Incident Databases in Critical Digital Infrastructure. arXiv:2501.17037v1, 2025.

  8. Seeking Human Security Consensus: A Unified Value Scale for Generative AI Value Safety. arXiv:2601.09112v1, 2026.

  9. What People See (and Miss) About Generative AI Risks: Perceptions of Failures, Risks, and Who Should Address Them. arXiv:2604.22654v1, 2026.

  10. AI Vulnerability Disclosure: A Study of Vendor Policies and Practices. arXiv:2509.06136, 2025.

  11. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79, 90.

  12. Liu, Y., Jia, Y., Geng, R., Jia, J., & Gong, N. Z. Formalizing and benchmarking prompt injection attacks and defenses. 33rd USENIX Security Symposium, 2024.

  13. Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. NeurIPS 37, 2024, pp. 82895, 82920.

  14. Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents. arXiv:2605.16282v1, 2026.

  15. Systematic Analysis of MCP Security. arXiv:2508.12538, 2025.

esc

Type to search. to navigate, to open.