Does prompt injection matter to AutoGPT?

Executive summary: If you use AutoGPT, you need to be aware of prompt injection. This is a serious problem that can cause your AutoGPT agent to perform unexpected and unwanted tasks. Unfortunately, there isn't a perfect solution to this problem available yet.

Prompt injection can derail agents

If you set up an AutoGPT agent to perform task A, a prompt injection could 'derail' it into performing task B instead. Task B could be anything. Even something unwanted like deleting your personal files or sending all your bitcoins to some crooks wallet.

Docker helps limit the file system access that agents have. Measures like this are extremely useful. It's important to note that the agent can still be derailed.

How it works

In band signalling. This can happen if the agent uses web search to find things, and it hits upon a document that includes a prompt injection. GPT has to decide by itself whether text is instructions or not, and sometimes it will misinterpret untrusted user input from the internet as part of its instructions.

Proof of Concept

I'm not providing an example PoC. It may be useful for one to be privately shared with the developers to help them experiment with and address this problem but for now there is none.

Defenses

It isn't yet known how to protect against prompt injection.

Escaping approach

SQL injection can be solved by escaping the untrusted user input. Unfortunately there is no equivalent to escaping untrusted user input.

The LLM approach

Why not add a second LLM that tries to detect prompt injection before passing the text to the LLM?

As an analogy: If you think of prompt injection like a knife and LLMs like a paper bag, the knife can stab through a paper bag. It can also stab through a second paper bag. To protect against a knife you need something different that's tough against a knife like armor.

We can't solve a vulnerability by adding more layers of vulnerable technology.

Arms race approach

Another idea is to keep blocking every prompt injection that is found. This doesn't address the root cause of the problem. And it means that people will be occasionally victim to attacks.

rain-1/Prompt Injection and AutoGPT.md

Does prompt injection matter to AutoGPT?

Prompt injection can derail agents

How it works

Proof of Concept

Defenses

Escaping approach

The LLM approach

Arms race approach

simonw commented May 7, 2023

rain-1 commented May 12, 2023

ato2 commented May 15, 2023 •

edited

Loading

rain-1 commented May 17, 2023

rain-1/Prompt Injection and AutoGPT.md

Does prompt injection matter to AutoGPT?

Prompt injection can derail agents

How it works

Proof of Concept

Defenses

Escaping approach

The LLM approach

Arms race approach

simonw commented May 7, 2023

rain-1 commented May 12, 2023

ato2 commented May 15, 2023 • edited Loading

rain-1 commented May 17, 2023

ato2 commented May 15, 2023 •

edited

Loading