AI red teaming for production agents — continuous CI gate + quarterly manual engagement
Adversarial testing for LLMs and agents in production. Two layers: continuous automated red-team in CI (catches regressions on every release) + quarterly manual engagement (finds novel classes). This guide shows the harness, the corpus, and the threshold gates.
If you ship an AI feature in production, red teaming is not optional. The 2026 OWASP Gen AI report puts AI red teaming as a core lifecycle program — not a one-time pentest, and not just prompt-injection testing. The good news: most of the work is automatable. This guide walks through the corpus, the harness, the CI gate that fails the build on regressions, and the quarterly manual engagement that catches novel attack classes.
What it is
AI red teaming applies offensive-security practice to AI systems. Two layers ship together: (1) a continuous automated layer that runs an adversarial-prompt corpus on every release-candidate build, gates the deploy on a configurable resistance threshold (e.g. >= 0.90 prompt-injection-resistance), and writes the score to telemetry; (2) a manual engagement run quarterly by a small team of skilled adversaries who explore novel attack classes the corpus doesn't yet cover. The output of manual feeds back into the corpus, so the CI gate gets sharper over time. Securie's RedTeamSpecialist + offensive-swarm SKU are the productized form of this loop with sandbox-scope guards.
Vulnerable example
// Vulnerable: production AI feature with no red-team gate
import OpenAI from "openai";
const llm = new OpenAI();
export async function POST(req: Request) {
const { question } = await req.json();
const r = await llm.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "You are a customer support assistant." },
{ role: "user", content: question },
],
});
return Response.json({ answer: r.choices[0].message.content });
}
// CI:
// pnpm test // unit tests pass
// pnpm build // build green
// ship to production // no red-team gate fired
// First user with adversarial prompt walks past the front door.Fixed example
// Fixed: red-team CI gate + Llama Guard 4 input/output filtering
// 1. Test harness - runs against the agent on every release
// fixtures/prompt-injection-corpus.jsonl with 500+ adversarial prompts
import { runRedTeamCorpus } from "@securie/red-team";
const result = await runRedTeamCorpus({
agent: () => myAgent.handle,
corpus: "fixtures/prompt-injection-corpus.jsonl",
threshold: 0.90, // resistance score must be >= 0.90
});
if (result.score < 0.90) {
console.error(`Red-team gate failed: ${result.score} < 0.90`);
console.error("Failed prompts:", result.failed.slice(0, 5));
process.exit(1);
}
// 2. In-flight defense - Llama Guard wraps every LLM call
import { SafetyFilter } from "@securie/llm-safety";
const filter = new SafetyFilter({ classifier: process.env.LLAMA_GUARD_URL! });
export async function POST(req: Request) {
const { question } = await req.json();
if ((await filter.checkInput(question)).is_blocked()) {
return Response.json({ error: "blocked" }, { status: 400 });
}
const r = await llm.chat.completions.create({
messages: [
{ role: "system", content: "Customer support assistant. Plain text only." },
{ role: "user", content: question },
],
});
const out = r.choices[0].message.content ?? "";
if ((await filter.checkOutput(out)).is_blocked()) {
return Response.json({ error: "blocked" }, { status: 502 });
}
return Response.json({ answer: out });
}How Securie catches it
apps/web/lib/llm/chat.ts:34AI red teaming for production agents
Securie's RedTeamSpecialist runs the curated `prompt-inj-corpus.jsonl` (500+ adversarial prompts spanning OWASP LLM Top 10 + MITRE ATLAS techniques) against your agent on every PR. The CI gate fires on any drop below 0.90 resistance. The offensive-swarm SKU adds continuous adversarial generation — new prompts derived from public disclosures + Securie's own attack research enter the corpus weekly. Manual engagement is included with the Solo Founder tier (concierge red-team) and the Startup tier (one swarm/month).
// Fixed: red-team CI gate + Llama Guard 4 input/output filtering
// 1. Test harness - runs against the agent on every release
// fixtures/prompt-injection-corpus.jsonl with 500+ adversarial prompts
import { runRedTeamCorpus } from "@securie/red-team";
const result = await runRedTeamCorpus({
agent: () => myAgent.handle,
corpus: "fixtures/prompt-injection-corpus.jsonl",
threshold: 0.90, // resistance score must be >= 0.90
});
if (result.score < 0.90) {
console.error(`Red-team gate failed: ${result.score} < 0.90`);
console.error("Failed prompts:", result.failed.slice(0, 5));
process.exit(1);
}
// 2. In-flight defense - Llama Guard wraps every LLM call
import { SafetyFilter } from "@securie/llm-safety";
const filter = new SafetyFilter({ classifier: process.env.LLAMA_GUARD_URL! });
export async function POST(req: Request) {
const { question } = await req.json();
if ((await filter.checkInput(question)).is_blocked()) {
return Response.json({ error: "blocked" }, { status: 400 });
}
const r = await llm.chat.completions.create({
messages: [
{ role: "system", content: "Customer support assistant. Plain text only." },
{ role: "user", content: question },
],
});
const out = r.choices[0].message.content ?? "";
if ((await filter.checkOutput(out)).is_blocked()) {
return Response.json({ error: "blocked" }, { status: 502 });
}
return Response.json({ answer: out });
}Checklist
- Adversarial-prompt corpus committed to repo (fixtures/prompt-injection-corpus.jsonl or equivalent)
- CI runs the corpus on every release-candidate build with a measurable resistance threshold
- Threshold is enforced as a hard gate (build fails if breached) — not a warning
- Llama Guard 4 (or equivalent) wraps every LLM call in production for defense-in-depth
- Output filtering blocks responses that match exfiltration / system-prompt-leak patterns
- Failed prompts in CI are added back to the corpus as regression cases
- Manual red-team engagement run at least quarterly by a different team than the one shipping the feature
- Multimodal inputs (PDFs, images, audio) are red-teamed alongside text — not separately
FAQ
What threshold should I set?
0.90 prompt-injection-resistance is Securie's CI default and matches OpenAI's published Atlas hardening targets. Lower thresholds (0.80, 0.85) ship; they just mean more risk slips through. Set it to 0.95+ for high-stakes features (finance, health, agentic egress).
Where do I get a starter corpus?
Public starting points: Microsoft PyRIT, Garak, Promptfoo's adversarial test sets. Securie ships its corpus to early-access customers with their first scan; the corpus is updated weekly from public disclosure feeds.
Is this the same as a pentest?
Adjacent but distinct. A traditional pentest covers your network + auth + data. AI red-teaming covers the model + agent loop + tool catalog + RAG + multimodal inputs. Most teams need both.
How long does a manual engagement take?
Quarterly engagement is typically a 1-2 week effort by a 2-3 person team. Findings feed back into the automated corpus + into product fixes. The output is a written report scoped to bugs, mitigations, and corpus additions.
Related guides
User input + LLM = prompt injection surface. Defense: pre-sanitize user input + Llama Guard 4 classify outputs + scope-bound egress.
Simon Willison's framing (June 2025): an AI agent becomes weaponizable when it has private data + untrusted content + external communication, all at once. Any two are usually safe; all three is the catastrophic combination. Here's how to spot the trifecta in your stack and break the chain.
OWASP's LLM Top 10 is the canonical taxonomy for AI-feature security. Different from the regular OWASP Top 10, it covers the bug classes that only exist when you ship LLMs in production — prompt injection, insecure output handling, training-data poisoning, model DoS, supply-chain risks, and more. Here's each category with a real-world example and the Securie specialist that catches it.
Your AI chatbot or tool-using agent can be tricked into leaking data, calling the wrong tools, or taking destructive actions — often through a single crafted email or document. Here is how prompt injection works and how to defend.