Neural Cleanse • Salika Dave

A paper review summary part of my coursework in IST597: Trustworthy Machine Learning

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, Ben Zhao, and Santa Barbara

Summary

The paper proposes a novel defense mechanism — Neural Cleanse to detect and mitigate backdoor attacks in deep neural networks. The paper proposes two methods for mitigation, unlearning and neuron pruning. The key contributions of the paper include a novel detection method that identifies backdoor triggers by analyzing the vulnerability of infected labels, another mitigation method that removed the backdoor trigger by retraining the model on a clean dataset. The defense method is able to achieve a high true positive rate of 99.7% when evaluated on MNIST, GTSRB datasets. It is shown to be robust against existing attack strategies like gradient masking, label leaking and Trojan attacks.

Results

The paper uses two datasets for evaluation — MNIST [dataset for handwritten digit recognition] and GTSRB [dataset for traffic sign recognition]. The evaluation results of the proposed defense mechanism show that it achieved a high accuracy in identifying backdoor triggers with an average true positive rate of 99.7% and a false positive rate of 0.1%. The method is also robust against various attack strategies, including gradient masking, label leaking and Trojan attacks. The mitigation method successfully removes the backdoor without significantly affecting the model’s performance on normal inputs. The average classification accuracy on normal inputs remains high, ranging from 95.69% to 98.54% across different datasets. The paper also evaluates the defense mechanism on various attack scenarios including both targeted and untargeted attacks along with both single and multi-label attacks. It is effective in all scenarios achieving high detection and mitigation accuracy.

Strengths

The paper successfully shows that the proposed defense mechanism can identify backdoor triggers with high accuracy and the mitigation method can nullify backdoor attack triggers without affecting the model’s performance on normal inputs. The mechanism appears to be robust against various attack strategies including gradient masking, label leaking and Trojan attacks. The paper also provides a detailed analysis of the computational cost of the proposed defense method highlighting the advantages of the unlearning method over the neuron pruning method.

Possible directions for future work

While the paper proposes a strong defense mechanism against trigger-based backdoor attacks, it is currently limited to classification tasks in the vision domain. Adapting the method to other domains would require formulating the backdoor attack process and designing a metric to measure the vulnerability of specific labels. The space of potential countermeasures that an attacker could use is large. The paper studies five different countermeasures that specifically target different components/assumptions of the defense, but further exploration of other potential countermeasures remains part of future work. Finally, the method assumes that the attacker has injected a single backdoor trigger into the model. If the attacker uses multiple triggers or more sophisticated attack strategies, the detection and mitigation methods presented in the paper may not be effective.