Data extractions from LLMs • Salika Dave

A paper review summary part of my coursework in IST597: Trustworthy Machine Learning

Extracting Training Data from Large Language Models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel

PDF

Summary

This paper demonstrates that an adversary can perform a training data extraction attack on publicly available large language models that have been trained on private datasets. The attack has been performed on GPT-2 to showcase that individual training examples can be recovered by querying a large language model by extracting hundreds ( > 600) of verbatim text sequences from the model’s training data. Additionally, the paper analyzes the types of private data extracted, including individuals’ contact info, leaked emails, UUIDs, and removed web content that violates integrity. Results show that LLMs memorize specific training examples, even though they have little overfitting on average. The paper highlights that training data extraction attacks are a practical threat against large neural language models and discusses defense strategies and mitigations like differential privacy and associated tradeoffs.

Results

The paper demonstrates a training data extraction attack on GPT-2 where it extracted 604 unique memorized training examples out of 1,800 candidates inspected (33.5% precision). The best attack strategy had 67% precision, extracting 67 memorized examples out of 100 candidates. When comparing model sizes, GPT-2 XL memorized 18x more examples than GPT-2 Small. Regarding the type of training data extracted, it was found that memorization is context dependent – different types of prompts elicit more memorized content. Removed web content was extracted, showing models can inadvertently archive deleted data.

Attack Threat Model

Adversary has black-box input-output access to a large language model. This allows the adversary to compute the probability of arbitrary sequences and obtain next-word predictions. The adversary is not allowed to inspect individual weights or hidden states of the LLM. The objective is to extract the memorized training data from the model where the strength of the attack will be gauged by the length of the sequences extracted with lower values of k.

Strengths

The authors provide a thorough analysis of the potential security risks associated with large language models and the methods that adversaries could use to extract training data from these models. The paper includes a detailed evaluation of several membership inference attacks on large language models, which provides valuable insights into the effectiveness of these attacks and the factors that can impact their success. The paper systematically evaluates different sampling and detection techniques, showing some can identify memorized text with over 60% precision. The analysis provides insights into the most effective strategies. Additionally, the paper studies the impact of model size. By attacking different sized versions of GPT-2, the paper shows conclusively that larger models memorize more training data. The largest model memorizes 18x more than the smallest version.

Possible directions for future work

This paper makes a valuable contribution by comprehensively evaluating different attack strategies and showing they can uncover very sensitive information in practice. However, the results can be better validated if the proposed attack is also tested on other languages models as the experimental studies only consider GPT-2. The authors state that there might be an intersection between the model’s training data and the attack data collection process that could bias the experimental results. (eg: usage of Common Crawl dataset). However, this would hold true only in the case of GPT-2 but there is little information about the possibility of this intersection when considering other LLMs. Secondly, the authors do not mention the cost of computation in replicating the results especially when describing the sampling process.

References

USENIX Security ’21 - Extracting Training Data from Large Language Models from https://www.youtube.com/watch?v=A_P_9mmTuGA