Shielding DL Models from Adversaries • Salika Dave

A paper review summary part of my coursework in IST597: Trustworthy Machine Learning

Towards Deep Learning Models Resistant To Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

PDF| Code MNIST Challenge | Code CIFAR Challenge

Summary

The paper proposes to explore the process of making neural networks resistant to adversarial loss through the framework of saddle-point problems. Two adversaries have been used as a benchmark — Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). It is shown that on a dataset (MNIST) a PGD adversary fits this framework. The authors first focused on guaranteeing the adversarially trained models to be robust. The authors use the saddle point/ min-max optimization to guarantee robustness against adversarial attacks. The authors explore empirical risk minimization and the shortcomings in applying it to adversarial training. A notable finding as explained by the authors is that for a network to reliably withstand a strong adversarial attack, the network would require a larger capacity to be able to cope with the perturbations in the adversarial samples as opposed to benign samples.

Results

The methods described have been applied to two datasets specifically – MNIST and CIFAR10. As the graphical results suggest, the MNIST model trained with adversarial samples achieves an accuracy of more than 89% and claims to be able to withstand white box attacks of an iterative adversary. The CIFAR10 model achieves an accuracy of 46% against simulated attacks of the same nature. In the case of weaker blackbox/transfer attacks, the networks achieved accuracies of 95%(MNIST)and 64%(CIFAR10).

Strengths

The techniques proposed in this paper prove that the deep learning models can be made robust against first-order adversarial attacks and by extension zeroth-order attacks. Additionally, the authors precisely and coherently present their results regarding the difference in the variation in loss function between standard and adversarial training as well as a difference in the decision boundaries based on the varied inputs.

Possible directions for future work

While the paper stands strong in the claims made, a factor possibly not considered is the computational cost incurred as there appear to be multiple steps involved in the optimization of each training sample. Additionally, the results are validated only on two datasets — MNIST and CIFAR10. It would be helpful to see these claims be validated on other datasets popularly used for benchmarking like COCO, Pascal VOC, ImageNet, etc.

Key Takeaways

This paper shows encouraging results for robust CIFAR-10 models, but with still much room for improvement. The paper explains that the “optimization-based” view on the power of adversaries relying on first-order attacks is more suitable. This paper provides an important insight that increasing network capacity and strengthening the adversary we train against improves resistance against transfer attacks