How do I audit an AI agent's security?

Short answer

Inventory every tool the agent can invoke. Map each tool to the trust level of its input source. For any tool with destructive side effects (delete, refund, email), enforce human-in-the-loop approval regardless of the agent's confidence. Test with known prompt-injection payloads.

AI agent security audit checklist:

**1. Tool inventory** List every tool the agent can invoke. For each: - What does it do? - What's the blast radius if it's invoked with attacker-controlled input? - Is there a human approval step?

**2. Input trust classification** For every input source: - User direct message (medium trust — can be attacker) - Retrieved document / RAG content (low trust — can contain indirect prompt injection) - Tool output (medium — depends on the tool) - System prompt (high trust — you wrote it)

**3. Trust × Tool matrix** Destructive tools never invocable from low-trust inputs. Period.

**4. Prompt-injection regression corpus** Accumulate attack payloads from: - MITRE ATLAS taxonomy - Public jailbreak corpora (AI Village) - Bugs your customers report

Run them on every release. Any success = ship blocker.

**5. Observability** - Log every tool invocation with the triggering prompt - Alert on tool-call patterns that are statistically unusual - Build a bill of materials for your models (AIBOM)

**6. Output filtering** - PII scrubbing on model outputs to users - Instruction-leakage detection

Securie's AI-feature security specialist automates most of this.

People also ask