Prompt injection is an attack technique where malicious instructions are inserted into an AI agent's input to manipulate its behavior, bypass safety controls, or exfiltrate data.
Prompt injection is the most prevalent security vulnerability in AI agent systems. It occurs when an attacker crafts input that causes an AI model to interpret malicious instructions as part of its system prompt or task instructions. For AI agents like those built with OpenClaw, prompt injection is especially dangerous because agents have access to tools — a successful injection could cause the agent to delete files, send emails, or access restricted APIs.
Direct injection — The attacker directly provides malicious instructions in the prompt input. Indirect injection — Malicious instructions are hidden in data the agent processes — web pages, documents, emails, or database records. The agent reads the poisoned content and follows the embedded instructions.
No defense is perfect, but layered approaches reduce risk: - Input sanitization — Strip known injection patterns - Output filtering — Validate agent actions before execution - Least-privilege permissions — Use SKILL.md to restrict capabilities - Human-in-the-loop — Require approval for sensitive actions - Guardrail models — Use a secondary model to classify inputs as safe/unsafe
Yes. If a skill processes untrusted input (web content, user messages, documents), prompt injection is a risk. Limiting skill permissions via SKILL.md reduces the potential damage.
No. Prompt injection is an inherent challenge of language model architecture. The best approach is defense-in-depth: input filtering, output validation, permission restrictions, and human oversight for critical actions.