Backdoor Attack Transferability • Salika Dave

A paper review summary part of my coursework in IST597: Trustworthy Machine Learning

Backdoor Pre-trained Models Can Transfer to All

Lujia Shen, Shouling Ji, Xuhong Zhang, Jinfeng Li, Jing Chen, Jie Shi, Chengfang Fang, Jianwei Yin, and Ting Wang

Summary

The paper presents a backdoor attack on pre-trained NLP models without the need for task-specific labels. It maps the input containing triggers directly to a predefined output representation [POR] of a target token and is transferable to any downstream task. The authors show that the trigger effectiveness is influenced by factors such as the poisoned sample percentage, fine-tuning dataset size, fine-tuning epochs, and number of trigger insertions. They also produce metrics to measure the effectiveness and stealthiness of the backdoor attack. The results show that the injected trigger does not significantly affect the normal capability of the model on clean samples, indicating that the backdoor can remain undetected. Additionally, the paper discusses the impact of trigger selection and provides insights on possible defenses against backdoor attacks in NLP. The proposed attack has been evaluated on BERT, XLNet, DeBERTa. Additionally, it outperforms other SOTA attacks RIPPLES and NeuBA.

Results

The authors conduct experiments on various pre-trained models, including BERT, XLNet, and BART, and evaluate the attack performance on different downstream tasks such as classification and named entity recognition. The results show that the injected trigger does not significantly affect the normal capability of the model on clean samples, indicating that the backdoor can remain undetected. The study also investigates the impact of various factors on the attack performance. The authors analyze trigger embedding and the corresponding pattern of response (POR), finding that the encoder (transformer layers) plays a crucial role in generating the POR. They examine the attention scores on the trigger in both the backdoor model and the clean model, revealing the importance of the attention mechanism in the transformer.The authors analyze trigger embedding and the corresponding pattern of response (POR), finding that the encoder (transformer layers) plays a crucial role in generating the POR. The paper also discusses the impact of the poisoned sample percentage, fine-tuning dataset size, fine-tuning epochs, and number of trigger insertions on trigger effectiveness . Larger fine-tuning datasets and more epochs generally lead to better performance, while triggers with a high effectiveness value (C value above 10) are considered good triggers.

Strengths

The paper proposes a new approach to backdoor attacks on pre-trained NLP models, which maps inputs containing triggers directly to a predefined output representation. This approach allows the backdoor to be transferred to a wide range of downstream tasks without prior knowledge. The paper introduces new metrics to measure the effectiveness and stealthiness of backdoor attacks in NLP. These metrics provide a quantitative evaluation of the attack performance and help in comparing different trigger types and settings. The paper provides insights into the factors that affect the attack performance, such as trigger embedding, fine-tuning dataset size, and appearance frequency of triggers. It also discusses possible defenses against backdoor attacks in NLP, contributing to the understanding of this security issue.

Possible directions for future work

One possible weakness of the paper is the limited exploration of potential countermeasures or mitigation strategies against the proposed backdoor attack. Another possible weakness is the lack of evaluation on the robustness of the proposed backdoor attack method against existing defense mechanisms. The paper primarily focuses on the attack itself and its effectiveness, but does not extensively discuss possible ways to detect or prevent such attacks. Future research could benefit from investigating robust defense mechanisms or countermeasures to mitigate the impact of backdoor attacks on pre-trained NLP models. Additionally, while the paper presents experimental results on the effectiveness of the proposed attack method, it does not provide a comprehensive analysis of the attack’s impact on the model’s performance on clean samples. While the paper mentions that the clean accuracy of the backdoor models is close to the accuracy of the clean model, further analysis and comparison of the model’s performance on clean samples before and after the backdoor attack would provide a more thorough understanding of the attack’s impact.