Updated 2026-04-206 min read

Prompt injection in AI apps — how attackers hijack your agents

Your AI chatbot or tool-using agent can be tricked into leaking data, calling the wrong tools, or taking destructive actions — often through a single crafted email or document. Here is how prompt injection works and how to defend.

Prompt injection is the SQL injection of the AI era. It is the most prevalent attack vector against LLM-powered applications and is often the first thing a red team tries. This guide explains direct and indirect prompt injection, and the controls that matter.

What it is

Prompt injection is when untrusted input — a user message, a scraped web page, an email, a document — contains instructions that alter the behavior of an LLM. Direct injection tricks a chatbot. Indirect injection hides instructions inside content the model reads, such as a customer support ticket or a knowledge-base article.

Vulnerable example

// Vulnerable: untrusted email content is passed straight to the LLM with tool access
async function handleIncomingEmail(body: string) {
  return llm.call({
    system: "You are a helpful support agent. You have access to the refund tool.",
    user: body, // attacker-controlled
    tools: [refundTool, escalateTool],
  });
}

Fixed example

// Partial mitigation: classify before acting, delimit untrusted content,
// and restrict the tool scope based on the classification.
async function handleIncomingEmail(body: string) {
  const classification = await llm.classify(body, { policy: "triage-only" });
  if (classification.intent !== "refund") {
    return llm.call({
      system: "You are triage-only; do NOT invoke tools.",
      user: "UNTRUSTED CONTENT FOLLOWS:\n<<<" + body + ">>>",
      tools: [], // no tools at all for untrusted content
    });
  }
  // still gates tool execution via human-in-the-loop downstream
}

How Securie catches it

Securie findinghigh

apps/web/lib/llm/chat.ts:34

Prompt injection in AI apps

Securie's AI-feature security specialist tracks every LLM call in your code, maps each call to the tools it can invoke, and tests that scope against a corpus of known injection patterns (MITRE ATLAS, public exploit writeups, custom regression cases). Findings emit as pull-request comments with the specific injection payload that succeeded and a proposed scope reduction.

Suggested fix — ready as a PR

// Partial mitigation: classify before acting, delimit untrusted content,
// and restrict the tool scope based on the classification.
async function handleIncomingEmail(body: string) {
  const classification = await llm.classify(body, { policy: "triage-only" });
  if (classification.intent !== "refund") {
    return llm.call({
      system: "You are triage-only; do NOT invoke tools.",
      user: "UNTRUSTED CONTENT FOLLOWS:\n<<<" + body + ">>>",
      tools: [], // no tools at all for untrusted content
    });
  }
  // still gates tool execution via human-in-the-loop downstream
}

Catch this in my repo →Securie reviews every PR · proves the issue · ships a verified fix PR

Checklist

Never grant untrusted input direct access to destructive tools (delete, refund, send-email)
Delimit untrusted content explicitly in the prompt; prefer typed tool schemas over free-form interpretation
Use a classifier or guard model upstream of any tool-using agent
Keep a prompt-injection regression corpus and run it in CI
Monitor tool-call traces for anomalies — scope escalation is a red flag
Log every agent action for forensic review

FAQ

Is a system prompt enough to prevent prompt injection?

No. Every frontier model has been shown to be jailbreakable via a determined attacker given enough tokens. System prompts are a first defense; tool-scope restriction is the real defense.

Can I just add a filter on the input?

Filters help against naive attempts. They do not defeat indirect prompt injection delivered through documents or web pages, which your filter may not even see.

Share:X Hacker News Reddit LinkedIn

Related guides

Guide

Preventing prompt injection in LLM features — Llama Guard 4 + sanitization

User input + LLM = prompt injection surface. Defense: pre-sanitize user input + Llama Guard 4 classify outputs + scope-bound egress.

Guide

Vibe coding security risks — the 2026 field guide

Vibe coding (AI-generated apps shipped with minimal human review) has a security problem. Here is a grounded look at what actually breaks, with dated public incidents, and the controls that work.

Guide

Defending MCP agents from indirect prompt injection (2026 playbook)

Indirect prompt injection — adversarial instructions embedded in data the agent reads — is the single most common attack class against MCP-using agents. Microsoft's Apr 2026 advisory + Unit42's MCP attack-vector taxonomy converged on the same defense: pre-prompt-output sanitization + scope-bounded egress + Llama Guard 4 classification. This guide ships the layered defense.

Guide

The lethal trifecta for AI agents — why three capabilities together turn agents into weapons

Simon Willison's framing (June 2025): an AI agent becomes weaponizable when it has private data + untrusted content + external communication, all at once. Any two are usually safe; all three is the catastrophic combination. Here's how to spot the trifecta in your stack and break the chain.