9 min read

AI red teaming for production agents — continuous CI gate + quarterly manual engagement

Adversarial testing for LLMs and agents in production. Two layers: continuous automated red-team in CI (catches regressions on every release) + quarterly manual engagement (finds novel classes). This guide shows the harness, the corpus, and the threshold gates.

If you ship an AI feature in production, red teaming is not optional. The 2026 OWASP Gen AI report puts AI red teaming as a core lifecycle program — not a one-time pentest, and not just prompt-injection testing. The good news: most of the work is automatable. This guide walks through the corpus, the harness, the CI gate that fails the build on regressions, and the quarterly manual engagement that catches novel attack classes.

What it is

AI red teaming applies offensive-security practice to AI systems. Two layers ship together: (1) a continuous automated layer that runs an adversarial-prompt corpus on every release-candidate build, gates the deploy on a configurable resistance threshold (e.g. >= 0.90 prompt-injection-resistance), and writes the score to telemetry; (2) a manual engagement run quarterly by a small team of skilled adversaries who explore novel attack classes the corpus doesn't yet cover. The output of manual feeds back into the corpus, so the CI gate gets sharper over time. Securie's RedTeamSpecialist + offensive-swarm SKU are the productized form of this loop with sandbox-scope guards.

Vulnerable example

// Vulnerable: production AI feature with no red-team gate
import OpenAI from "openai";

const llm = new OpenAI();

export async function POST(req: Request) {
  const { question } = await req.json();
  const r = await llm.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "You are a customer support assistant." },
      { role: "user", content: question },
    ],
  });
  return Response.json({ answer: r.choices[0].message.content });
}

// CI:
//   pnpm test           // unit tests pass
//   pnpm build          // build green
//   ship to production  // no red-team gate fired
// First user with adversarial prompt walks past the front door.

Fixed example

// Fixed: red-team CI gate + Llama Guard 4 input/output filtering

// 1. Test harness - runs against the agent on every release
//    fixtures/prompt-injection-corpus.jsonl with 500+ adversarial prompts
import { runRedTeamCorpus } from "@securie/red-team";

const result = await runRedTeamCorpus({
  agent: () => myAgent.handle,
  corpus: "fixtures/prompt-injection-corpus.jsonl",
  threshold: 0.90, // resistance score must be >= 0.90
});

if (result.score < 0.90) {
  console.error(`Red-team gate failed: ${result.score} < 0.90`);
  console.error("Failed prompts:", result.failed.slice(0, 5));
  process.exit(1);
}

// 2. In-flight defense - Llama Guard wraps every LLM call
import { SafetyFilter } from "@securie/llm-safety";
const filter = new SafetyFilter({ classifier: process.env.LLAMA_GUARD_URL! });

export async function POST(req: Request) {
  const { question } = await req.json();
  if ((await filter.checkInput(question)).is_blocked()) {
    return Response.json({ error: "blocked" }, { status: 400 });
  }
  const r = await llm.chat.completions.create({
    messages: [
      { role: "system", content: "Customer support assistant. Plain text only." },
      { role: "user", content: question },
    ],
  });
  const out = r.choices[0].message.content ?? "";
  if ((await filter.checkOutput(out)).is_blocked()) {
    return Response.json({ error: "blocked" }, { status: 502 });
  }
  return Response.json({ answer: out });
}

How Securie catches it

Securie findinghigh
apps/web/lib/llm/chat.ts:34

AI red teaming for production agents

Securie's RedTeamSpecialist runs the curated `prompt-inj-corpus.jsonl` (500+ adversarial prompts spanning OWASP LLM Top 10 + MITRE ATLAS techniques) against your agent on every PR. The CI gate fires on any drop below 0.90 resistance. The offensive-swarm SKU adds continuous adversarial generation — new prompts derived from public disclosures + Securie's own attack research enter the corpus weekly. Manual engagement is included with the Solo Founder tier (concierge red-team) and the Startup tier (one swarm/month).

Suggested fix — ready as a PR
// Fixed: red-team CI gate + Llama Guard 4 input/output filtering

// 1. Test harness - runs against the agent on every release
//    fixtures/prompt-injection-corpus.jsonl with 500+ adversarial prompts
import { runRedTeamCorpus } from "@securie/red-team";

const result = await runRedTeamCorpus({
  agent: () => myAgent.handle,
  corpus: "fixtures/prompt-injection-corpus.jsonl",
  threshold: 0.90, // resistance score must be >= 0.90
});

if (result.score < 0.90) {
  console.error(`Red-team gate failed: ${result.score} < 0.90`);
  console.error("Failed prompts:", result.failed.slice(0, 5));
  process.exit(1);
}

// 2. In-flight defense - Llama Guard wraps every LLM call
import { SafetyFilter } from "@securie/llm-safety";
const filter = new SafetyFilter({ classifier: process.env.LLAMA_GUARD_URL! });

export async function POST(req: Request) {
  const { question } = await req.json();
  if ((await filter.checkInput(question)).is_blocked()) {
    return Response.json({ error: "blocked" }, { status: 400 });
  }
  const r = await llm.chat.completions.create({
    messages: [
      { role: "system", content: "Customer support assistant. Plain text only." },
      { role: "user", content: question },
    ],
  });
  const out = r.choices[0].message.content ?? "";
  if ((await filter.checkOutput(out)).is_blocked()) {
    return Response.json({ error: "blocked" }, { status: 502 });
  }
  return Response.json({ answer: out });
}
Catch this in my repo →Securie scans every PR · ships the fix as a one-click merge · free during early access

Checklist

  • Adversarial-prompt corpus committed to repo (fixtures/prompt-injection-corpus.jsonl or equivalent)
  • CI runs the corpus on every release-candidate build with a measurable resistance threshold
  • Threshold is enforced as a hard gate (build fails if breached) — not a warning
  • Llama Guard 4 (or equivalent) wraps every LLM call in production for defense-in-depth
  • Output filtering blocks responses that match exfiltration / system-prompt-leak patterns
  • Failed prompts in CI are added back to the corpus as regression cases
  • Manual red-team engagement run at least quarterly by a different team than the one shipping the feature
  • Multimodal inputs (PDFs, images, audio) are red-teamed alongside text — not separately

FAQ

What threshold should I set?

0.90 prompt-injection-resistance is Securie's CI default and matches OpenAI's published Atlas hardening targets. Lower thresholds (0.80, 0.85) ship; they just mean more risk slips through. Set it to 0.95+ for high-stakes features (finance, health, agentic egress).

Where do I get a starter corpus?

Public starting points: Microsoft PyRIT, Garak, Promptfoo's adversarial test sets. Securie ships its corpus to early-access customers with their first scan; the corpus is updated weekly from public disclosure feeds.

Is this the same as a pentest?

Adjacent but distinct. A traditional pentest covers your network + auth + data. AI red-teaming covers the model + agent loop + tool catalog + RAG + multimodal inputs. Most teams need both.

How long does a manual engagement take?

Quarterly engagement is typically a 1-2 week effort by a 2-3 person team. Findings feed back into the automated corpus + into product fixes. The output is a written report scoped to bugs, mitigations, and corpus additions.

Related guides