6 min read

Prompt injection in AI apps — how attackers hijack your agents

Your AI chatbot or tool-using agent can be tricked into leaking data, calling the wrong tools, or taking destructive actions — often through a single crafted email or document. Here is how prompt injection works and how to defend.

Prompt injection is the SQL injection of the AI era. It is the most prevalent attack vector against LLM-powered applications and is often the first thing a red team tries. This guide explains direct and indirect prompt injection, and the controls that matter.

What it is

Prompt injection is when untrusted input — a user message, a scraped web page, an email, a document — contains instructions that alter the behavior of an LLM. Direct injection tricks a chatbot. Indirect injection hides instructions inside content the model reads, such as a customer support ticket or a knowledge-base article.

Vulnerable example

// Vulnerable: untrusted email content is passed straight to the LLM with tool access
async function handleIncomingEmail(body: string) {
  return llm.call({
    system: "You are a helpful support agent. You have access to the refund tool.",
    user: body, // attacker-controlled
    tools: [refundTool, escalateTool],
  });
}

Fixed example

// Partial mitigation: classify before acting, delimit untrusted content,
// and restrict the tool scope based on the classification.
async function handleIncomingEmail(body: string) {
  const classification = await llm.classify(body, { policy: "triage-only" });
  if (classification.intent !== "refund") {
    return llm.call({
      system: "You are triage-only; do NOT invoke tools.",
      user: "UNTRUSTED CONTENT FOLLOWS:\n<<<" + body + ">>>",
      tools: [], // no tools at all for untrusted content
    });
  }
  // still gates tool execution via human-in-the-loop downstream
}

How Securie catches it

Securie's AI-feature security specialist tracks every LLM call in your code, maps each call to the tools it can invoke, and tests that scope against a corpus of known injection patterns (MITRE ATLAS, public exploit writeups, custom regression cases). Findings emit as pull-request comments with the specific injection payload that succeeded and a proposed scope reduction.

Checklist

  • Never grant untrusted input direct access to destructive tools (delete, refund, send-email)
  • Delimit untrusted content explicitly in the prompt; prefer typed tool schemas over free-form interpretation
  • Use a classifier or guard model upstream of any tool-using agent
  • Keep a prompt-injection regression corpus and run it in CI
  • Monitor tool-call traces for anomalies — scope escalation is a red flag
  • Log every agent action for forensic review

FAQ

Is a system prompt enough to prevent prompt injection?

No. Every frontier model has been shown to be jailbreakable via a determined attacker given enough tokens. System prompts are a first defense; tool-scope restriction is the real defense.

Can I just add a filter on the input?

Filters help against naive attempts. They do not defeat indirect prompt injection delivered through documents or web pages, which your filter may not even see.