Prompt-based Attack on LLMs • Salika Dave

A paper review summary part of my coursework in IST597: Trustworthy Machine Learning

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez, Ian Ribeiro

Summary

This paper is about the vulnerabilities of transformer-based large language models, specifically GPT-3, and how they can be exploited by malicious users. The authors explore two types of attacks, prompt injection and prompt manipulation, and provide examples of how these attacks can be used to generate malicious outputs from the language model. The authors propose a framework for composing adversarial prompt scenarios and evaluation methods to measure the effectiveness of different attacking techniques, which can help enhance the common understanding of language model capabilities when faced with intentional misalignment.

Results

The paper presents the results of experiments on the vulnerabilities of GPT-3 to prompt injection and prompt manipulation attacks. The authors found that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3’s stochastic nature, creating long-tail risks. They also found that attack prompts, delimiters, temperature, harmful rogue strings, stop sequences, and prompt structure all affect the success rates of these attacks. Additionally, they found that prompt leaking is harder than goal hijacking, and that the text-davinci-002 model is the most susceptible to attacks.

Strengths

The paper highlights the need for enhanced robustness evaluation heuristics and testing methods to ensure the safe and responsible use of language models in product applications. The paper is well-written and easy to follow, with clear explanations of the concepts and techniques used. The authors provide detailed results and discussion, which are supported by tables and figures, making it easy for readers to understand the key findings and implications of the research. The paper provides a detailed analysis of the factors that affect the success rates of prompt injection and prompt manipulation attacks, including attack prompts, delimiters, temperature, harmful rogue strings, stop sequences, and prompt structure.

Possible directions for future work

While the paper stands strong in its claims, the framework relies on the assumption that the identified attacks, such as goal hijacking and prompt leaking, are comprehensive and cover the full spectrum of potential misalignment scenarios. However, it is possible that new and unforeseen attack vectors may emerge, rendering the framework incomplete in addressing all potential misalignment risks. Additionally, the framework’s effectiveness in measuring the robustness of LLMs against prompt injection attacks may be limited by the specific attack scenarios considered. It is important to ensure that the framework is adaptable and can accommodate evolving attack strategies to maintain its relevance and effectiveness in enhancing the safety of LLM applications.