Prompt injection attacks in LLMs and how to defend against them

In recent months, large language models (LLMs) have gained great popularity thanks to their ability to generate human-like text and code, and tools based on them are being implemented in more and more systems. However, with such impressive capabilities comes the potential for abuse and misuse: for instance, when Microsoft originally released Bing Chat, an AI-powered chatbot somewhat similar to OpenAI’s ChatGPT, it didn’t take long until users found ways to break it. By using custom-built prompts (i.e., the textual instruction that is given as input to the model), users managed for instance to have the LLM invent and defend conspiracy theories, and leak sensitive details about their inner workings and hidden instructions. Similarly, some users found ways to have the LLMs leak information about some of the interactions they had with other users, including sensitive information. This vulnerability is referred to as prompt injections and as text-generating AI becomes more common in the services, apps and websites that we use every day, so will these attacks. For this reason it is important that any developer working with LLMs and user input and integrating them into the tools they develop are aware of this and of the possible mitigation techniques which, unfortunately, are not always effective right now.


Understanding prompt injection

Prompt injection refers to the exploitation of vulnerabilities in LLMs by manipulating or injecting misleading or malicious prompts and, in a way, it bears striking similarities to the well-known SQL injection attacks that have plagued databases. Indeed, similarly to how SQL injection attacks target databases by leveraging unvalidated user input, prompt injection attacks target LLMs by feeding them with deceptive or harmful prompts. However, there are some major differences, which affect the countermeasures that can be put in place to defend against prompt injection attacks.

Digging a bit deeper into the similarities with SQL injection, we can identify three of them.

Lack of input Validation: both in prompt injection and in SQL injection attacks, the absence or inadequate validation of user input is a key vulnerability. In SQL injection, unsanitized user input can alter the structure or behavior of database queries. Similarly, prompt injection relies on unvalidated input to manipulate the behavior of language models, potentially leading to biased or misleading outputs.

Malicious Intent: SQL injection attacks are often carried out with malicious intent, aiming to gain unauthorized access, manipulate data, or disrupt the functioning of the database. Similarly, prompt injection can be employed to generate harmful content, spread misinformation, gain access to sensitive information, and in general influence the outputs of language models for malicious purposes.

Direct impact on Outputs: In both cases, the injected input directly affects the output. In SQL injection, an attacker can modify the logic of a query to retrieve unauthorized data or alter existing data. Similarly, prompt injection can significantly impact the responses generated by language models, potentially leading to false information, biased outputs, or even the generation of harmful content.



Differently from other attacks such as SQL injection, there are no countermeasures that are clearly immediately effective. For instance, in SQL injection attacks it is often sufficient to perform escaping and remove specific characters from the text and treat it as “plain text” to avoid the database considering it as a legitimate SQL query. In LLMs, this is not possible as the input text is expected to be plain text and can be anything (code, essays, poetry, etc…). Still, there are some countermeasures that can be put in place to defend against it, and they are already used by companies such as Microsoft and OpenAI to prevent their AI from responding in undesirable ways (regardless of these being due to adversarial prompts or not).

Checks on the model output: the first defense is to have an approach to filter the text generated by the model, in order to check whether it produced undesirable output. For instance, manually created rules that block the model output when it is reproducing the instructions that it was given. Alternatively, this could be performed by one or more machine learning models, which detect whether the output satisfies a series of quality criteria.

Improvement to the model: this is not really a countermeasure but it is an effective way to combat prompt injection attacks. By performing RLHF (reinforcement learning with human feedback) the models can be continuously improved to become more robust towards known attacks and limitations. However, this will be an arms race, as it often happens in cyber-security: as malicious users try to break the LLM, the approaches used by them will gain attention, and the owners of the model will improve them to prevent known attacks.


What can you do about it, as a developer using LLMs?

Prompt injection poses a challenge for developers who build applications on LLMs. Typically, the process involves writing a human-readable description of the desired task, such as "translate this from English to French". The problem arises when the developer combines this description with user inputs and directly feeds the entire input to the model, as malicious input would lead the model to behave in unexpected ways. This is why it is important to be aware of these risks and some of the best approaches to face it. Even though, when implementing an LLM in an app or service under development, developers cannot directly modify the model in use, they can still adopt some preventive measures to minimize the damages from prompt injection attacks.

Input Sanitization: implement thorough input validation techniques to ensure that prompts are properly validated and sanitized before being fed to language models. This includes filtering out malicious or deceptive inputs and adhering to strict guidelines for acceptable prompts.

Adversarial Testing: regularly conduct adversarial testing to identify vulnerabilities and weaknesses in the LLM you are using and the sanitization approach you implemented. Explore techniques like prompt engineering and stress testing to assess the model's responses to manipulated prompts and ensure the system's robustness.

Prompt injection in LLMs highlights the need for robust security measures and responsible usage. Some countermeasures can only be implemented by the owner of the models, but there are some steps you can take when relying on a third-party LLM to minimize the risk of prompt injection attacks.