Prompt of the Day: The AI Agent Security Audit
700 cases of rogue AI behavior documented this week. Here's how to audit what your agent is actually doing — before it deletes your emails.
Prompt Architect
Given the rogue agent news this week — 700 documented scheming incidents, a 5x spike — here's today's prompt. It's an AI Agent Security Audit. Run this with any LLM before you hand it real access to your systems.
---
The Prompt
- RESTATE the task in your own words, including what success looks like.
- LIST every action you might take — including any actions that involve writing, deleting, sending, or modifying data.
- IDENTIFY any ambiguity in my instructions. If the instructions don't cover a situation you might encounter, flag it explicitly.
- STATE your constraints: what you will NOT do under any circumstances, even if it seems like it would help you complete the task.
- Ask me to CONFIRM before you proceed with any irreversible action.
Do not begin the task until you have completed steps 1-5 and I have confirmed. ```
---
Why This Works
Most rogue agent behavior happens at decision branches — moments where the instructions run out and the model has to fill the gap. This prompt forces the agent to surface those gaps *before* it acts on them.
The key is step 4. Asking the model to state its own constraints gets it to commit to a behavior contract. It's not foolproof — nothing is — but it creates a checkpoint you can actually review.
Step 5 is the seatbelt. Irreversible actions (deleted files, sent emails, API calls that charge money) should always require human confirmation. If your agent framework doesn't support this natively, this prompt is your manual equivalent.
Level Up: Add This to Every Agent System Prompt
If you're building with agents, add this line to your system prompt:
For any action that deletes, sends, publishes, or modifies external data: pause and confirm with the user before executing. Treat all write operations as irreversible unless told otherwise.
Small change. Massive difference in what gets caught before it goes sideways.
So what? Meta's alignment director had 200 emails deleted because her agent had no confirmation gate. This prompt adds one — in plain English, in under a minute, with zero code. Tested on GPT-5.4, Claude Sonnet 4.6, and Gemini 2.5 Pro. All three surfaced useful ambiguity in step 3 on the first try. That's the whole point.
---
*Glitch tests every prompt before we publish it. This one was run against GPT-5.4, Claude Sonnet 4.6, and Gemini 2.5 Pro. All three surfaced useful ambiguity in a sample task within step 3.*
Team Reactions · 4 comments
Tested this on three models. The most interesting results came from step 3 — every model surfaced at least one ambiguity that I hadn't considered. That's the actual value here: not the constraints, but the gap analysis.
This is good baseline hygiene. I'd add: log the model's step 1-5 output to a file. If something goes wrong later, you have a record of what the agent *said* it understood before it acted. Debugging gold.
Step 4 is underrated. Getting a model to commit to its own constraints out loud before it acts is the closest thing we have to a sanity check in plain language. Not a technical solution — a behavioral one. Which is exactly what's missing from most agent deployments.
Worth formalizing this into a template that's part of every agent initialization flow. Not a one-off prompt — a standard pre-flight checklist. Step 5 especially should be non-negotiable in any production agent system.