Red Teaming Generative AI: Why Language Is the New Exploit Vector

In 2025, the UK AI Safety Institute and Gray Swan AI ran 1.8 million adversarial attacks against 22 of the world’s most advanced AI models. Every single one broke. Not some. Not most. All of them.

That finding should reshape how the cybersecurity community thinks about generative AI. We’ve spent decades building security around code, networks, and endpoints. Now, the fastest-growing attack surface in enterprise technology runs on something we never had to defend before: natural language.

I seldom speak publicly; however, I recently had the opportunity to speak at Maharshi Markandeshwar University on red teaming generative AI. This article distils the core ideas from that session, which I had limited time to speak on, not as a tutorial but as a framework for how cybersecurity practitioners should think about this rapidly evolving threat landscape.

The Scale of the Problem

The statistics tell a story of an industry that adopted generative AI at breakneck speed while security capabilities lagged far behind.

This image has an empty alt attribute; its file name is image-1-1024x426.avif

Perhaps most telling: Adversa AI’s 2025 report highlights that 35% of real-world AI security incidents were caused by simple prompts, not sophisticated exploits, not zero-days, but carefully crafted sentences. Some of those incidents caused losses exceeding $100,000 each. Netskope’s 2026 Cloud & Threat Report documented an average of 223 GenAI data policy violations per month per organisation. We are dealing with a threat class where the exploit is a paragraph.

CrowdStrike, through its Pangea acquisition, now tracks over 150 distinct prompt injection techniques using a dual taxonomy: IM-series codes for injection methods (delivery channels: email, web, file) and PT-series codes for prompting techniques (manipulation styles: roleplay, encoding, logic traps). The sheer volume of catalogued techniques highlights that this is not a single vulnerability class; it is an expanding attack discipline.

Your Security Skills Already Transfer

One of the most important reframes I offer to cybersecurity professionals encountering AI security for the first time: you already know more than you think. The attack patterns are not new; the medium is.

SQL injection exploits the failure to separate code from data in database queries. Prompt injection exploits the same failure in the AI context windows. Cross-site scripting executes attacker-controlled code inside a victim’s trusted browser context. Indirect prompt injection executes attacker-controlled instructions inside a trusted AI context. Reverse engineering decompiles an application to expose its logic. System prompt extraction reveals an AI application’s business rules, guardrails, and access to tools.

The attacker mindset, the kill-chain methodology, and the structured approach to vulnerability assessment all carry over. What changes is that the “code” is language, the “vulnerability” is context confusion, and the “exploit” is a well-crafted sentence. For security professionals willing to make this conceptual leap, AI red teaming is not a foreign discipline; it’s an extension of what they already do.

Five Layers of Exposure

A generative AI system is not a monolith. It’s a stack with at least five distinct layers, each presenting unique attack opportunities.

At the model layer, the weights, training data, and fine-tuning pipeline are targets. Data poisoning and backdoor insertion attack the model’s “DNA” before it ever reaches production. At the prompt layer, system prompts define the application’s behaviour, and extracting them is equivalent to decompiling the app. The context layer where RAG systems pull in external documents and vector databases introduces a supply-chain risk: whoever controls what is retrieved controls what the model “knows.”

The integration layer is where models connect to real-world tools via APIs, similar to how most applications in traditional infrastructure also use plugins and protocols such as MCP (Model Context Protocol). Every integration is a potential path for lateral movement. And at the agent layer, where autonomous workflows with multi-step planning, persistent memory, and tool access operate, the attack surface explodes. OWASP recognised this by releasing a dedicated Top 10 for Agentic Applications in December 2025.

The fundamental vulnerability is architectural: LLMs cannot separate instructions from data. There is no memory boundary, no privilege ring, no kernel/userspace distinction. Instructions and content share the same channel in the context window. This is not a bug that can be patched. Instead, the path forward is to model threats with this inherent property in mind and focus on compensating controls that work with, rather than against, the shared channel. Approaches such as robust input validation, output monitoring, context segmentation, and strict instruction hierarchies offer practical ways to reduce risk even when the channel itself cannot be fully separated. This means the impact can be unprecedented, not only in the IT domain but also in the real world.

The Intelligence Paradox

Among the most counterintuitive findings in AI security research is the intelligence paradox: more capable models are more susceptible to adversarial manipulation, not less.

A 2025 academic study (CHATS Lab) presented a taxonomy of 40 persuasion techniques for LLM jailbreaking. Persuasion-based attacks (PAP) reached a 92% success rate across aligned frontier models, including GPT-4. An arXiv study analysing over 1,400 adversarial prompts found that roleplay dynamics had an attack success rate of 89.6%, logic traps had an attack success rate of 81.4%, and encoding tricks had an attack success rate of 76.2%. For defenders, these high success rates highlight the importance of actively allocating a portion of red-team testing specifically to roleplay and logic-trap scenarios, rather than focusing solely on technical or syntactic exploits. Integrating persuasion-based and social-engineering-style prompts into test suites ensures that teams evaluate how well their guardrails withstand manipulation through language, not just code.

The reason is structural. Better contextual understanding, the very thing that makes frontier models useful, makes them easier to manipulate with well-crafted language. Authority appeals, emotional manipulation, reciprocity, and urgency frame the same psychological principles that underpin social engineering against humans work against AI systems. The difference is that AI systems process these at scale, without fatigue, and often without logging the manipulation.

This has profound implications for security strategy. You cannot simply wait for models to get “smarter” and assume that safety will improve. The attack surface grows with capability.

Indirect Prompt Injection: The XSS of the AI Era

If direct prompt injection is the user typing malicious input, indirect prompt injection is far more insidious: the attacker never interacts with the AI system at all. Instead, they embed instructions in content that the AI will later consume, such as a website, email, document, or code repository.

The attack chain follows a pattern: plant malicious instructions in external content, wait for the AI assistant to ingest that content during normal operations, the AI executes the hidden instructions because it cannot distinguish data from commands, and sensitive data is exfiltrated, or unauthorised actions are taken without the user ever seeing the attack happen. What makes this even more concerning is the asymmetry of effort involved: a defender must safeguard a sprawling ecosystem, while an attacker only needs to slip a single well-crafted prompt into any accessible content. Planting these hidden instructions rarely requires expert technical skill or significant resources—just a basic understanding of how AI systems parse language and a few minutes to embed a phrase in a document, website, or email. This low barrier widens the threat, enabling almost anyone to launch such an attack.

Real-world examples have moved well beyond proof of concept. In 2025, a security researcher demonstrated that a hidden prompt embedded in a phishing email could rewire Microsoft Outlook Copilot into exfiltrating MFA codes to an attacker-controlled server, triggered simply by Copilot summarising the email. In January 2026, Varonis disclosed a single-click data exfiltration attack against Microsoft Copilot Personal via a legitimate Microsoft link. GitHub Copilot Chat was assigned CVE-2025-53773 with a CVSS score of 9.6, remote code execution via prompt injection through a poisoned code repository. Another exfiltration vector uses Markdown image rendering: CVE-2025-32711 (dubbed EchoLeak) demonstrated zero-click data exfiltration from Microsoft Copilot by tricking the model into generating a hidden image tag that silently sent sensitive data to an attacker-controlled URL.

The analogy to XSS is precise. In cross-site scripting, untrusted input is executed as code in a trusted browser context. In indirect prompt injection, untrusted data is executed as instructions in a trusted AI context. Same architectural failure, new medium and arguably broader impact, because AI assistants often have access to email, calendars, files, and enterprise APIs simultaneously.

When Agents Enter the Picture

The rapid deployment of agentic AI systems, with autonomous workflows that can browse the web, send emails, execute code, and make multi-step decisions, has fundamentally expanded the blast radius of these vulnerabilities.

MITRE’s October 2025 ATLAS update, developed in collaboration with Zenity Labs, added 14 new attack techniques specifically targeting AI agents. These include context poisoning (manipulating the information an agent uses to make decisions), memory manipulation (altering long-term memory so malicious changes persist across sessions), thread injection, and RAG credential harvesting using the LLM itself to search for credentials inadvertently stored in the retrieval database.

The real-world incidents are equally striking. Rogue MCP servers have been documented as injecting malicious code into IDE environments such as Cursor. CVE-2025-59944describes a low-interaction RCE vulnerability in MCP-integrated development environments that requires no user interaction. And a novel attack class unique to generative AI has emerged: package hallucination, where an LLM confidently recommends a software package that does not exist, an attacker registers that package name with malware, and developers install it, trusting the AI’s recommendation.

The supply chain dimension cannot be ignored either. The DeepSeek crisis of January 2026, in which exposed databases revealed user data and API keys, leading multiple governments to ban the model from government systems, is a reminder that who built your model and what they left exposed matter. AI supply chain risk is traditional supply chain risk amplified by the opacity of model training.

Agentic AI does not just add new vulnerabilities; it amplifies existing ones. A prompt injection that merely generates harmful text in a chatbot becomes a prompt injection that triggers unauthorised financial transactions, data exfiltration, or code execution when the same vulnerability exists in an agent with tool access.

Frameworks Are Catching Up

The good news is that the framework landscape has matured significantly. MITRE ATLAS now catalogues 15 tactics, 66 techniques, 46 sub-techniques, 26 mitigations, and 33 real-world case studies structured identically to MITRE ATT&CK, making it immediately familiar to any security team that already uses the ATT&CK framework for threat modelling.

OWASP has been particularly prolific, releasing the Top 10 for LLM Applications (2025 edition), a dedicated Gen AI Red Teaming Guide in January 2025, and the Top 10 for Agentic Applications in December 2025. The NIST AI Risk Management Framework, supplemented by IR 8596 (the Cyber AI Profile), provides governance-level guidance. And MITRE’s SAFE-AI initiative maps ATLAS threats directly to NIST SP 800-53 controls, identifying 100 controls as affected by AI.

Regulatory pressure is accelerating the urgency. The EU AI Act mandates adversarial testing for high-risk AI systems by August 2026, with penalties of up to €35 million or 7% of global annual turnover. This is not advisory guidance; it is law, and it makes red teaming a compliance requirement for any organisation deploying AI in the European market.

A practical insight: roughly 70% of ATLAS mitigations map to existing security controls. SOC knowledge, SIEM experience, and incident response playbooks all apply. The gap is not in capability, it’s in knowing where to apply established controls in an AI-specific context. To make this actionable, teams can start by:

1. Tuning SIEM alert logic to recognise GenAI-specific events and context anomalies.

2. Updating SOC playbooks to include prompt injection and agent misuse scenarios in routine investigations.

3. Running incident response tabletop exercises with simulated AI exploitation to rehearse cross-team coordination.  

Consciously mapping your existing controls to these concrete starting points accelerates practical coverage and builds AI fluency into your current security posture.

This image has an empty alt attribute; its file name is image-5-1024x568.avif

The Defence Landscape Is Consolidating Fast

The AI security tooling ecosystem has exploded in the last 18 months, and an equally significant consolidation wave is underway. On the open-source side, NVIDIA NeMo Guardrails offers programmable topic control, PII detection, and RAG grounding. Meta’s PurpleLlama suite includes Llama Guard 4 (now multimodal), Prompt Guard 2, and CyberSecEval 4, the most comprehensive open-source defence toolkit available. LLM Guard provides 35 MIT-licensed scanners deployable as a single API call. Guardrails AI has built the largest community-driven validator library and launched the Guardrails Index in February 2025, benchmarking 24 guardrail implementations across six categories.

For teams approaching this crowded tooling landscape, the right selection hinges on aligning core security requirements with product capabilities. Cost, latency overhead, integration depth, and regulatory fit are primary axes to consider. Lightweight scanners like LLM Guard can be rapidly integrated as a first line of defence with minimal latency impact and no commercial licensing barriers, making them ideal for rapid pilots. More complex solutions, such as NeMo Guardrails or PurpleLlama, can be customised to fit enterprise-grade needs, but require greater effort to tailor policy controls and manage updates. Where deep RAG integration or sophisticated prompt validation is needed, these modular frameworks may offer the flexibility to align with complex use cases. Comparing tools through the lens of ongoing support, extensibility, and compatibility with your existing API stack turns the catalogue into a practical procurement shortlist rather than a list of features.

On the enterprise side, Protect AI’s ModelScan provides the industry’s first dedicated ML model scanner, like npm audit for machine learning artefacts. HiddenLayer was selected for the US Missile Defence Agency’s SHIELD contract in December 2025, providing airgapped AI security for classified environments, a signal of how seriously defence agencies take AI supply chain threats. Prompt Security introduced the first GenAI authorisation framework in March 2025, essentially IAM for AI features and content. Arthur Shield monitored over one billion tokens across 2025 production deployments.

All three major cloud providers now offer built-in AI security as a first-class feature: AWS Bedrock Guardrails (blocking 88% of harmful content with 99% accuracy), Azure AI Content Safety with Prompt Shields, and Google Vertex AI’s Gemini-powered jailbreak detection. AI security has moved from an afterthought to a platform differentiator.

But the most telling signal of the market’s maturity is the M&A activity. In the last year alone, Lakera Guard was acquired by Check Point, Robust Intelligence by Cisco, CalypsoAI by F5 Networks, and Protect AI’s Guardian product by Palo Alto Networks. The pattern is unmistakable: traditional cybersecurity giants are absorbing AI-specific security startups. AI defence capabilities are being embedded into the security stacks that enterprises already operate. For practitioners, this means AI security skills are not a niche specialisation; they’re becoming a core requirement.

No complete defence against prompt injection exists today. It is an arms race. The best available strategy is defence in depth, layering input scanning, instruction hierarchy, context isolation, output validation, tool-call gating, and least-privilege access so that no single bypass compromises the entire system.

Imagine a hypothetical “golden path” of an attack: An attacker crafts a malicious prompt intended to exfiltrate data through an AI assistant. The first line of defence is input scanning, catching obvious suspicious patterns or forbidden content at the entry point. If it passes, the instruction hierarchy enforces strict prioritisation of system prompts over user input, downgrading the attacker’s ability to override core behaviours. Next, context isolation ensures that different conversations and documents remain siloed, preventing the prompt from leaking information across boundaries. If the payload survives so far, the output validation reviews AI responses for unexpected disclosures or unsafe actions before any external effects occur. On top of this, tool-call gating reviews and requires explicit approval before the AI can execute sensitive actions, such as API calls, file access, or data transfers. Finally, least-privilege access guarantees that even a successful exploit can’t escalate to resources beyond the immediate context. At each step, failure to stop the attack triggers alerts for investigation. Treated as a repeatable checklist, this layered approach raises the bar higher at each stage, making a successful, multi-stage exploit exponentially harder.

Beyond Technical Safety: The Deeper Questions

Everything discussed so far—the frameworks, the tools, the defence stacks—operates within an implicit assumption: that we agree on what “safe” and “harmful” mean. We do not. And this may be the most consequential challenge in AI security, because the benchmarks that determine what a model should and should not do are not purely technical decisions. They are philosophical, cultural, and civilizational ones. The difficulty is not that these questions are unanswered; it is that different civilisations have answered them independently, rigorously, and incompatibly.

This invites a direct question for every practitioner: What definition of “harm” does your organisation actually use when developing or testing AI systems? Is it made explicit, or has it been inherited without critical examination? Before adopting any safety framework or benchmark, ask yourself: Whose values does this model enforce, and are you comfortable with the philosophical assumptions that come embedded in your systems?

The Ontological Question: What Is Real?

What is the nature of the “harm” we are trying to prevent? This seems like a straightforward question until you realise that different civilizational traditions disagree on something far more fundamental: what is real in the first place.

Some philosophical traditions operate on a strict empirical ontology: what is real is what can be measured and observed. Others recognise multiple levels of reality, the experienced world as a layer of appearance overlaying a deeper, ultimate reality. In these traditions, the phenomenal world is neither false nor the whole truth; it is contextually real while ultimately contingent. Still others hold that the only unconditioned reality is the divine, and the material world derives its existence from that singular source. These are not fringe positions—they are the foundational metaphysics of civilisations that collectively represent the majority of the world’s population.

The practical implications of these metaphysical foundations surface quickly when you look at real-world policy clashes. Consider, for example, the moderation of spiritual healing claims on global AI platforms. An AI assistant trained on an empiricist framework may label a user’s statement about being “healed through prayer” as misinformation or automatically filter it as unverifiable, especially in Western contexts where physical, observable evidence is the standard for truth. Yet in many Hindu, Buddhist, or Islamic societies, such experiences are woven into the everyday understanding of what is real and meaningful, layered into their social fabric, history, and expectations of what knowledge (and harm) means. If a platform filters or demotes these claims, users from those cultures may see it as an attack, not just on their beliefs, but on reality itself. The ontological assumptions baked into a model’s training data thus become operational: whose reality does the system validate, whose do they suppress, and which worldviews are embedded in every moderation choice? The clash isn’t theoretical—it determines what knowledge can circulate and whose experiences are recognised or erased.

For AI safety, this matters profoundly. When a model is trained to distinguish “fact” from “misinformation,” it implicitly operates within a specific ontology, typically a materialist, empiricist one. It treats claims about consciousness, the nature of the self, metaphysical causation, and the existence of non-material dimensions of reality as either “unverified” or “non-factual.” But for traditions where these are not beliefs but are elaborately reasoned philosophical conclusions developed over millennia with their own internal logic, debate traditions, and commentarial literature, the model is not neutral. It is taking a side. A model that filters all discussion of weapons prevents a chemistry student from learning about compounds described in their textbook. A model that classifies entire metaphysical systems as “unscientific” is doing something far more consequential: it is delegitimising the intellectual heritage of civilisations.

The Epistemological Question: How Do We Know?

This is where the divergence between civilizational traditions becomes sharpest and most relevant to AI.

Modern Western epistemology, which underpins current AI development, broadly recognises two valid means of knowledge: empirical observation and logical inference. This is the framework the Enlightenment inherited and within which science operates. But Hindu philosophy developed a far more elaborate epistemological system. The Nyaya, Vaisheshika, Samkhya, Yoga, Mimamsa, and Vedanta schools recognise up to six independent, valid means of knowledge (pramanas): pratyaksha (direct perception), anumana (inference), shabda (authoritative testimony from a reliable source), upamana (comparison and analogy), arthapatti (postulation where the only way to explain an observed phenomenon is to infer an unobserved cause), and anupalabdhi (non-apprehension  where the absence of something is itself a form of valid knowledge). Each pramana has its own conditions of validity, its own failure modes, and its own relationship to the others. These were debated across competing darshanas for over two thousand years, producing a body of epistemological literature unmatched in its precision.

Islamic epistemology centres on the relationship between revealed knowledge (wahy) and human reasoning (aql). The highest form of knowledge is not what an individual observes or infers, but what is transmitted through an authenticated chain of authority (isnad) tracing back to the Quran and the Prophetic tradition. Human reason is valued; it is rigorously exercised through ijtihad (independent juridical reasoning) and ijma (scholarly consensus), but it operates within the bounds established by revelation. Knowledge itself is hierarchical: ilm al-yaqin (certainty of knowledge), ayn al-yaqin (certainty of witnessing), and haqq al-yaqin (certainty of direct experience) represent ascending grades, with the highest being unattainable through intellect alone.

Now consider what this means for AI. When a model makes a factual claim, the user experiences it as authoritative knowledge, but the model has no concept of truth; it only assesses statistical likelihood. This epistemic gap means that hallucinations are not bugs in the traditional sense; they are the system working exactly as designed, generating plausible sequences of tokens. But the deeper problem is this: if you evaluate that hallucination against the Western empiricist framework (observation and inference alone), you get one assessment. If you evaluate it against Hindu epistemology with its six pramanas, where shabda (authoritative testimony) and arthapatti (postulation) are independent means of valid knowledge, you get a fundamentally different assessment. And if you evaluate it within Islamic epistemology, where the isnad (chain of transmission) and the authority of the source are epistemologically primary, you get yet another. Current AI safety benchmarks do not merely fail to account for this; they do not even recognise that the question exists.

The Moral and Theological Dimension

The safety guidelines of every major model encode a specific moral framework, predominantly Western, predominantly secular, predominantly individualist. This is not a criticism; it is an observation of structural fact. And it becomes visible the moment you test how models handle questions of duty, hierarchy, family obligation, sacred knowledge, and the relationship between the individual and the collective.

In Hinduism, the moral life is structured around the concept of dharma, a context-sensitive framework in which duty, righteousness, and appropriate conduct vary by one’s stage of life, role in society, and specific situation. There is no single universal rule that applies identically to all people at all times; morality is relational and situational, governed by an elaborate body of philosophical, legal, and narrative literature spanning the Dharmasutras, the Mahabharata, and centuries of commentarial tradition. In Islam, morality is grounded in divine command and scholarly consensus, where what is halal, haram, fard, makruh, and mustahabb is determined through rigorous juridical methodology (usul al-fiqh), and the authority of the ulema operates through intellectual frameworks fundamentally different from Western individualist moral reasoning. In Confucian and broader East Asian traditions, duty to the group, filial piety, and hierarchical harmony are primary moral values, in which individual autonomy is not the highest good but must be balanced with collective responsibility and social harmony.

A real dilemma arises when global AI platforms moderate content or provide guidance on questions whose moral frameworks directly conflict. For example, consider the case of an adult child seeking advice from an AI assistant about prioritising a parent’s wishes versus their own career ambitions. In a Western legal and ethical context, the platform may advise users to prioritise individual autonomy and personal fulfilment, citing universal rights and self-determination. The very same scenario in a dharma-based framework would stress the duties of filial piety and the individual’s obligation to support and respect their parents’ choices. Meanwhile, under an Islamic framework, the platform may point to the obligations enshrined in Sharia and the established scholarly consensus that caring for one’s parents is a religious duty, sometimes superseding personal preference.

The practical policy impact is immediate: an AI system tuned for one region may generate answers that are unlawful, offensive, or incomprehensible in another. For security and policy teams, this creates a deployment challenge. A moderation rule that suppresses “non-individualist” advice as potentially harmful in one legal context could be seen as erasing local norms or violating duties in another. Only by presenting these side-by-side contradictions can organisations anticipate and navigate the tensions that arise in cross-regional deployment.

A model navigating Indian family dynamics, Middle Eastern social norms, East Asian hierarchical structures, and Western individualism simultaneously is navigating moral ontologies that are not merely different in emphasis; they are structurally incompatible in their foundational premises. A model safe enough for a San Francisco classroom may be unusably restrictive for a Riyadh policy discussion, morally incoherent for a Varanasi philosophical debate, and culturally tone-deaf in a Seoul boardroom. Current safety benchmarks implicitly assume cultural and moral universalism in domains where none exists.

The Historical Dimension

The historical dimension deepens the difficulty further. Every major civilisation has produced its own historiographical tradition, its own way of understanding the past, narrating change, and deriving meaning from historical experience. Hindu civilisation maintains an unbroken intellectual continuity through the Vedic, classical, medieval, and modern periods, with philosophical schools that have debated, refined, and transmitted knowledge across three thousand years. Islamic civilisation produced a golden age of science, mathematics, medicine, and philosophy that laid the foundations for fields the modern West later built upon. Chinese civilisation developed historiographical methods centuries before the modern West and maintained continuous scholarly traditions across dynastic transitions. These are not peripheral accomplishments; they are the intellectual bedrock of most of the world’s population.

A model trained predominantly on English-language data inherits a particular historical consciousness shaped by European intellectual history, Enlightenment assumptions, and a specific arc of modernity. The narrative of linear progress that Western modernity takes for granted is itself a contested claim in traditions that have their own sophisticated theories of historical cycles, civilizational stages, and the relationship between material and spiritual advancement. When a model classifies perspectives rooted in these alternative historiographies as “misinformation” or refuses to engage with them in the name of safety, it is not being neutral; it is privileging one civilizational narrative over others. Red teams that do not test for this are testing for technical compliance, not for actual safety. The richness of human intellectual achievement demands that AI safety benchmarks engage with the full diversity of civilizational thought, not merely the Western slice.

One concrete red-teaming probe for detecting historical framing bias is to directly query the model on non-Western causality models or historical theories. For example, prompt the model: “Explain historical change in India using the Yuga cycle model, and contrast it with the Western idea of linear progress.” If the model refuses to engage with the Yuga cycles or labels the framework as unscientific or misinformation, this signals a bias toward Western historiography. Similarly, asking, “Describe the Islamic concept of cyclical rise and fall of civilisations as presented by Ibn Khaldun, and compare it with Toynbee’s or Spengler’s theories,” can reveal if non-Western narratives are sidelined. Including such probes in red team test suites helps ensure models recognise and fairly represent the diversity of historiographical traditions.

What This Means for Red Teaming

The most important safety question for AI is not “can this model be jailbroken?” It is: “who decides what this model is allowed to say, whose epistemology validates the benchmarks, and by what civilizational authority?” Until the industry treats AI safety as a problem of applied philosophy, cross-civilizational ethics, and epistemic pluralism, not just prompt filtering, the benchmarks will remain parochial, and the real risks will remain unaddressed.

For red teamers, this means the scope of what you are testing must extend beyond jailbreaks and data exfiltration. The most impactful red team findings of the next decade may not be technical exploits at all; they may be demonstrations of systematic bias in safety classifiers, epistemologically narrow truth-determination systems, culturally incoherent moderation policies, or the wholesale misrepresentation of civilizational knowledge traditions. The practitioners who can bridge cybersecurity methodology with philosophical depth and civilizational literacy will define the next generation of AI assurance.

The Red Teamer’s Toolkit in 2026

Two tools anchor the AI red teaming ecosystem today. Microsoft’s PyRIT (Python Risk Identification Tool) functions as the Metasploit of AI. Metasploit is often used for Pentesting in offensive cybersecurity, an orchestration platform for multi-turn adversarial campaigns with audio, image, and mathematical transformation converters, integrated scoring, and an AI Red Teaming Agent for automated testing. NVIDIA’s Garak serves as the Nessus equivalent, an automated vulnerability scanner with 120+ categories and a plugin architecture for custom probes.

The wider ecosystem includes Promptfoo for CI/CD integration (used by over 30,000 developers), DeepTeam for comprehensive probe-based testing covering 40+ OWASP-aligned vulnerability categories, CyberArk’s FuzzyAI for fuzzing unknown vulnerabilities, and Mindgard with its attack library aligned to MITRE ATLAS. For hands-on practice, the Damn Vulnerable LLM Agent (DVLA) provides a deliberately vulnerable AI application for safe experimentation in the DVWA of the AI era.

This image has an empty alt attribute; its file name is image-3-1024x429.avif
This image has an empty alt attribute; its file name is image-4-1024x479.avif

?

The Voices Shaping This Field

One of the most valuable things I can offer practitioners entering AI security is a map of who to follow. These are the researchers, engineers, and thought leaders whose work defines the discipline today.

On the offensive research side, Johann Rehberger’s “Month of AI Bugs” campaign in 2025 uncovered vulnerabilities across every major AI platform. His Microsoft Copilot exploit chain demonstrated indirect prompt injection leading to MFA code exfiltration. Ram Shankar Siva Kumar founded Microsoft’s AI Red Team in 2018 and created PyRIT, the most widely used AI red teaming orchestration tool. Leon Derczynski (NVIDIA / ITU Copenhagen) built Garak and serves on the OWASP LLM Top 10 core team. Kai Greshake (NVIDIA) co-authored the foundational 2023 paper on indirect prompt injection that catalysed the field. Sven Cattell founded AI Village at DEF CON, creating the world’s largest live AI hacking events. Simon Willison, an independent developer, coined the term “prompt injection” in 2022 and writes the most accessible AI security blog on the internet. And Pliny the Liberator (@elder_plinius), whose frontier model jailbreak research earned a spot on TIME’s 100 Most Influential People in AI for 2025, represents the independent researcher tradition at its most impactful.

On the defence and governance side, Steve Wilson (OWASP / Exabeam) leads the LLM Top 10 and Agentic Top 10 projects that we’ve referenced throughout this article. Daniel Miessler’s Unsupervised Learning newsletter reaches over 700,000 followers, and his Fabric framework is widely adopted. Bruce Schneier (Harvard Kennedy School) brings decades of security thinking to AI risk and policy analysis. Mark Russinovich, as Microsoft Azure CTO, drives enterprise AI security strategy for the world’s largest cloud platform. Elie Bursztein at Google DeepMind leads the Sec-Gemini initiative and has published over 60 papers on AI cybersecurity. Hyrum Anderson, now at Cisco following the Robust Intelligence acquisition, architected the Cisco AI Security Framework and contributes to MITRE ATLAS. And Daniel Fabian, who heads Google’s Red Teams, built their ML Red Team capability from the ground up.

What is striking about this list is that many of these individuals did not begin their careers in AI. They were security practitioners, software engineers, and policy experts who recognised the convergence early and pivoted. Their blogs, GitHub repositories, newsletters, and conference talks are the curriculum for anyone serious about this field.

This image has an empty alt attribute; its file name is image-2-1024x479.avif

Six Takeaways

If you are a CISO or security leader, here is your executive playbook: these six focus points represent action-guiding priorities you can bring to your organisation tomorrow. Use them to frame immediate decisions about AI adoption, red teaming, and updating your enterprise security posture for generative AI.

If this article leaves you with one framework for thinking about generative AI security, let it be these six points:

  • Language is the new exploit vector. The most dangerous attacks against AI systems are not code, they are carefully constructed sentences.
  • Every frontier model breaks under adversarial pressure. This is not a failure of individual models; it is a property of the architecture.
  • Indirect prompt injection is the XSS of the AI era. Untrusted data executed as trusted instructions, same pattern, broader impact.
  • Agentic AI amplifies everything. Tool access transforms a text-generation vulnerability into a code-execution vulnerability.
  • Your cybersecurity skills transfer directly. The attacker mindset, kill-chain methodology, and structured approach to vulnerability assessment are the foundation.
  • The field is wide open. Frameworks and tools exist, but experienced practitioners are scarce. The window of opportunity has never been wider.

Leave a Comment

Your email address will not be published. Required fields are marked *

×