Robustness of Deep Neural Networks Against Adversarial Perturbations

Deep neural networks have been shown to be notoriously brittle to small perturbations in their input data, whether these pertubations arise from “natural variations”, such as cropping, rotations and scaling, or even downsampling or lossy compression such as JPG compression, or “adversarial variations”, which means the inputs have been carefully modified in order to fool the model. Fortunately, several methods have been developped in order to try to increase the robustness of models to these variations, that is their ability to be insensitive to such perturbations. Data augmentation is one of those widely used methods that is easy to deploy and efficient in the case of natural variations, because they are easy to compute (we are talking about rotations, scaling, brightness tweaking, etc). But unfortunately, this doesn’t quite do the job in the case of adversarial variations, because adversarial examples are costly to compute as the attack methods are often iterative and/or require heavy computations on the model. Today, we will focus on two other approaches that have proven their efficiency to increase the robustness of deep neural networks against adversarial attacks.

Adversarial attacks

First, let’s dive a little bit more on the way adversarial attacks work exactly, to know what we are talking about. We will focus on the case of a deep neural network in the context of to image classification, because that will be more visual. Papernot & al [1] proposed a general framework for adversarial attacks on which we will base our explanations:

We can see that given an input, an image in our case, adding a carefully chosen noise to it in order to fool the classifier is an (iterative) two-steps process. First, a computation is performed on the model in order to determine to most vulnerable pixels for our pupose. Then, a perturbation is chosen to be applied on those pixels. We can then check whether or not this was enough to fool the model, and repeat until it is (or until we go past a certain maximum number of iterations determined beforehand). Some of the most widely used methods can be analyzed under this general framework.

Fast Gradient Sign Method (FGSM)

The Fast Gradient Sign Method was originally proposed by Goodfellow & al, in 2015 [3]. It was at first a non-iterative method, so thus quite cheap to compute, but of course it can be turned into an iterative method for better results, but higher computational cost. It works by perturbing all pixels at once, by a small amount. This way, the noise added to an input image is small for the L^∞ metric. To know how to perturb the input pixels, they just compute the gradient of the loss function with respect to the inputs. An example can be seen below:

Jacobian-based Saliency Map Attack (JSMA)

JSMA is another widely used method, proposed in 2015 by Papernot & al [5]. The idea is to repeatedly compute the most sensitive pair of pixels in the input image, and then perturb them by a big amount. The objective is to perturb the least amount of pixels possible, but with little to no limitation on the magnitude of those perturbations. In other words, the noise is optimised for the L⁰ metric (which is computed by counting the number of non-zero pixels of an image). Some examples on MNIST can be seen below:

Increasing the robustness

So now that we’ve seen how these attacks works, let’s talk about how to defend against them, that is how to increase the robustness of our models to decrease their vulnerability to them. But first of all, why would we want to do that ? Well because deep neural networks are now used in various critical areas, such as autonomous vehicles and medical imaging. Let’s say someone with malicious intentions had a hand on the inputs of the model. That person would would to make the model unusable by adding noise in the inputs, but in a way that it would be “invisible” for a human observer, meaning you would have no trouble classifying the images compared to if nothing had happened, and you wouldn’t notice anything. But the model would perform really bad, and you wouldn’t necessarily understand why at first glance, which can be problematic. Some people have done that in real life in an experiment [4], by placing small colored stickers on traffic signs, which consistently fooled the models trying to read them, but represented no trouble for human understanding.

So, adversarial attacks can be critical for some systems, and that’s why we need ways to make them resilient against them. We’ve talked in the introduction about data augmentation and why it wasn’t perfectly suited for that particular purpose because of the computational cost of generating adversarial images. But also, the problem with such an approach is that the model trained on a augmented dataset will only learn how to counter the specific attacks the dataset was augmented with. The problem is, there is a wide variety of different attack methods beyond the two we presented above, and also you can’t always know beforehand with what attack your model will have to deal with. So, people have come up with methods to avoid this problem. The two approaches we will look at today can be summarised as follows: train the model differently (with what is called distillation), or teach him to detect adversarial samples so he can avoid them.



In Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks (2016) [1], Papernot & al proposed to use distillation to increase the robustness of models. This approach consists in performing the training in two steps. First a deep neural network is trained on the dataset of interest as usual. But in a second time, a second model (the distilled model) is trained with the particularity that the labels of the images he trains on, instead of being hard labels (one-hot vectors), are soft labels, and these soft labels correspond to the output of the first model on each example. For that purpose, the last layer of the first model should be a softmax layer so it outputs vectors that represent a probability distribution (they are normalized to sum to 1), and the whole process in controlled by an hyperparameter called the distillation temperature, which acts at the level on this softmax layer. Its influence can be seen on the equation below:

A regular softmax would be obtained with T=1, and we can see that as T increases, the probabilities tend to be more and more uniform (in the limit case, they are all 1/N).

The whole process in represented on the picture below:

Why does it work ?

The idea behind training a second “distilled” model with soft labels given by a first traditionnal model is that those labels encode more information than the hard labels the first model is trained on. In particular, they can carry the information that some classes can be close together, for example on MNIST if for the images that are of class 1 we observe that often, the probability for the image being of class 7 is significantly higher than for other classes, it means that 1’s and 7’s are quite close, and the distilled model can learn that. This allows him to learn a better representation of the input space and thus to be more resilient to small variations in the input images.


The authors tested the JSMA attack on some regular and distilled models trained on both MNIST and CIFAR-10, and measured its success rates. The results can be seen on the chart below:

We can see that models trained with a high distillation temperature are significantly more resilient against the attacks, and thus have an increased robustness compared to the base models. Equally interesting, the authors have also shown that “adversarial gradients” are cut down by several orders of magnitude on the distilled models, those same gradients that we saw earlier are used during the computation of adversarial examples with JSMA.

Detection of adversarial samples


In Detecting Adversarial Samples for Deep Neural Networks through Mutation Testing (2018) [2], Wang & al outlined a new approach that allows deep neural networks to directly detect adversarial samples, which is a different way of increasing the robustness of models (if they detect an adversarial sample, they can choose to discard it, send an alert to an administrator or whatever else, but they will not be blindly fooled by adversarial examples. For that purpose, for each sample, a quantity is computed, called the “sensitivity” which if higher than a certain threshold, will mean that the sample has to be considered adversarial. The expression of sensitivity can be seen below:

What this means is that the sensitivity of a sample x, is the percentage of mutated samples among a certain set of mutated samples that are classified differently than x. This set of mutated samples is a small set of images that are obtained from x, but which have undergone some mutations. We will see what this means later

Mutation testing: Algorithm

The algorithm for mutation testing can be described as follows. It uses SRPT (Sequential Probability Ratio Test), which is an hypothesis test. The hypothesis to be tested is whether or not the sensitivity of a given sample (K(x)) is significantly higher than the normal sensitivity for the dataset (K_nor). The algorithm is tuned so that it outputs a result with confidence 95%. Then, what the algorithm does is that it iteratively generates mutated sample and asks the model to classify them, until the sequence of labels obtained allows him to accept or reject the hypothesis with the desired confidence threshold. Later, we will see that it commonly takes about 50 mutated samples until the algorithm can decide, depending on the dataset and the attack. But why does it work ? That’s because adversarial attacks target “weak spots” in the representation learned by the model, and thus adversarial samples are in a sort of unstable equilibrium around the original sample, which is what sensitivity detects, as is represented on the picture below:

This is verified experimentally by measuring the sensitivity of the model on true examples from the dataset, and then on adversarial samples for different attack mehods. The results are in the table below:


For both MNIST and CIFAR-10, with several attack methods (FGSM, JSMA and others), the proportion of detected adversarial samples is displayed in the tables below, as well as other metrics:

We can see that adversarial samples are detected with pretty high accuracy (often more than 90%), while still classifying very few regular samples as adversarial. Also, as mentioned before, few mutations are necessary in general, independently of the attack ! That is a huge result, because even we could argue that generating such an amount of mutation is similar to the additionnal computational cost induced by data augmentetion, the latter only prevents against the attacks undergone by the augmented sample, whereas here it prevents the model from being fooled by every attack, from which the mutations are independent (we’re talking about mostly natural mutations here).


We have seen in this paper that robustness is a critical concern in today’s deep learning system to reduce their vulnerability to adversarial attacks, which are inputs generated by perposedly modifying original inputs so they are misclassified by the model. Such samples represent a serious threat to deep learning models, and the variety of attack methods available makes it almost impossible to just train these models with data augmentation with adversarial samples. However, several defense strategies have been proposed since, notably defensive distillation and mutation testing. While the first method consists in training a second “distilled” model from a first one, which has a higher generalization ability, the second opts for detecting adversarial samples directly, through evaluation their sensitivity to mutations, which if high may indicate that the sample is adversarial due to the inherent instability these samples have relatively to the model. Both these methods show great success on various datasets and against various attacks, which is promising and encouraging in building deep learning models that are resilient to adversarial attacks.


[1] Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, Papernot & al, 2016

[2] Detecting Adversarial Samples for Deep Neural Networks through Mutation Testing, Wang & al, 2018

[3] Explaining and Harnessing Adversarial Examples, Goodfellow & al, 2015

[4] Robust Physical-World Attacks on Deep Learning Visual Classification, Eykholt, 2018

[5] The Limitations of Deep Learning in Adversarial Settings, Papernot & al, 2015

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store