One pixel attack. Or how to trick a neural network

Let's get acquainted with one of the attacks on the neural network, which leads to classification errors with minimal external influences. Imagine for a moment that the neural network is you. And at the moment, while drinking a cup of aromatic coffee, you classify the images of cats with an accuracy of more than 90 percent without even suspecting that the “one-pixel attack” turned all your “cats” into trucks.

And now we will pause, move the coffee aside, import all the libraries we need and analyze how such one pixel attacks work.

The purpose of this attack is to make the algorithm (neural network) give an incorrect answer. Below we will see this with several different models of convolutional neural networks. Using one of the methods of multidimensional mathematical optimization - differential evolution, we find a special pixel that can change the image so that the neural network begins to incorrectly classify this image (despite the fact that earlier the algorithm “recognized” the same image correctly and with high accuracy).

Import the libraries:

# Python Libraries
%matplotlib inline
import pickle
import numpy as np
import pandas as pd
import matplotlib
from keras.datasets import cifar10
from keras import backend as K

# Custom Networks
from networks.lenet import LeNet
from networks.pure_cnn import PureCnn
from networks.network_in_network import NetworkInNetwork
from networks.resnet import ResNet
from networks.densenet import DenseNet
from networks.wide_resnet import WideResNet
from networks.capsnet import CapsNet

# Helper functions
from differential_evolution import differential_evolution
import helper

matplotlib.style.use('ggplot')

For our experiment, we will load the CIFAR-10 dataset containing real-world images divided into 10 classes.

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

Let's look at any image by its index. For example, here on this horse.

image_id = 99 # Image index in the test set
helper.plot_image(x_test[image_id])



We will have to look for the very powerful pixel that can change the response of the neural network, which means it's time to write a function to change one or more pixels of the image.

def perturb_image(xs, img):
    # If this function is passed just one perturbation vector,
    # pack it in a list to keep the computation the same
    if xs.ndim < 2:
        xs = np.array([xs])
    
    # Copy the image n == len(xs) times so that we can 
    # create n new perturbed images
    tile = [len(xs)] + [1]*(xs.ndim+1)
    imgs = np.tile(img, tile)
    
    # Make sure to floor the members of xs as int types
    xs = xs.astype(int)
    
    for x,img in zip(xs, imgs):
        # Split x into an array of 5-tuples (perturbation pixels)
        # i.e., [[x,y,r,g,b], ...]
        pixels = np.split(x, len(x) // 5)
        for pixel in pixels:
            # At each pixel's x,y position, assign its rgb value
            x_pos, y_pos, *rgb = pixel
            img[x_pos, y_pos] = rgb
    
    return imgs

Check it out ?! Change one pixel of our horse with coordinates (16, 16) to yellow.

image_id = 99 # Image index in the test set
pixel = np.array([16, 16, 255, 255, 0]) # pixel = x,y,r,g,b
image_perturbed = perturb_image(pixel, x_test[image_id])[0]

helper.plot_image(image_perturbed)



To demonstrate the attack, you need to download pre-trained models of neural networks on our CIFAR-10 dataset. We will use two models lenet and resnet, but you can use others for your experiments by uncommenting the corresponding lines of code.

lenet = LeNet()
resnet = ResNet()

models = [lenet, resnet]

After loading the models, it is necessary to evaluate the test images of each model to make sure that we attack only images that are correctly classified. The code below displays the accuracy and number of parameters for each model.

network_stats, correct_imgs = helper.evaluate_models(models, x_test, y_test)

correct_imgs = pd.DataFrame(correct_imgs, columns=['name', 'img', 'label', 'confidence', 'pred'])

network_stats = pd.DataFrame(network_stats, columns=['name', 'accuracy', 'param_count'])

network_stats
Evaluating lenet
Evaluating resnet

Out[11]:


	name        accuracy    param_count
0      lenet        0.748       62006
1      resnet       0.9231      470218


All such attacks can be divided into two classes: WhiteBox and BlackBox. The difference between them is that in the first case, we all reliably know about the algorithm, the model with which we are dealing. In the case of the BlackBox, all we need is input (image) and output (probabilities of being assigned to one of the classes). One pixel attack refers to the BlackBox.

In this article, we consider two options for attacking a single pixel: untargeted and targeted. In the first case, it will absolutely not matter to which class the neural network of our cat will belong to, most importantly, not to the class of cats. Targeted attack is applicable when we want our cat to become a truck and only a truck.

But how to find the very pixels whose change will lead to a change in the class of the image? How to find a pixel by changing which one pixel attack becomes possible and successful? Let's try to formulate this problem as an optimization problem, but only in very simple words: with an untargeted attack, we must minimize trust in the desired class, and with targeted, maximize the trust in the target class.

When carrying out such attacks, it is difficult to optimize the function using a gradient. An optimization algorithm must be used that does not rely on the smoothness of the function.

Recall that for our experiment we use the CIFAR-10 dataset, which contains real-world images, 32 x 32 pixels in size, divided into 10 classes. And this means that we have integer discrete values ​​from 0 to 31 and color intensities from 0 to 255, and the function is not expected to be smooth, but rather jagged, as shown below:



That is why we use the differential evolution algorithm.

But back to the code and write a function that returns the probability of the reliability of the model. If the target class is correct, then we want to minimize this function so that the model is sure of another class (which is not true).

def predict_classes(xs, img, target_class, model, minimize=True):
    # Perturb the image with the given pixel(s) x and get the prediction of the model
    imgs_perturbed = perturb_image(xs, img)
    predictions = model.predict(imgs_perturbed)[:,target_class]
    # This function should always be minimized, so return its complement if needed
    return predictions if minimize else 1 - predictions

image_id = 384
pixel = np.array([16, 13,  25, 48, 156])
model = resnet

true_class = y_test[image_id, 0]
prior_confidence = model.predict_one(x_test[image_id])[true_class]
confidence = predict_classes(pixel, x_test[image_id], true_class, model)[0]

print('Confidence in true class', class_names[true_class], 'is', confidence)
print('Prior confidence was', prior_confidence)
helper.plot_image(perturb_image(pixel, x_test[image_id])[0])

Confidence in true class bird is 0.00018887444
Prior confidence was 0.70661753



We will need the next function to confirm the criterion for the success of the attack, it will return True when the change was enough to trick the model.

def attack_success(x, img, target_class, model, targeted_attack=False, verbose=False):
    # Perturb the image with the given pixel(s) and get the prediction of the model
    attack_image = perturb_image(x, img)
    confidence = model.predict(attack_image)[0]
    predicted_class = np.argmax(confidence)
    
    # If the prediction is what we want (misclassification or 
    # targeted classification), return True
    if verbose:
        print('Confidence:', confidence[target_class])
    if ((targeted_attack and predicted_class == target_class) or
        (not targeted_attack and predicted_class != target_class)):
        return True
    # NOTE: return None otherwise (not False), due to how Scipy handles its callback function

Let's look at the work of the success criterion function. In order to demonstrate, we assume a non-target attack.

image_id = 541
pixel = np.array([17, 18, 185, 36, 215])
model = resnet

true_class = y_test[image_id, 0]
prior_confidence = model.predict_one(x_test[image_id])[true_class]
success = attack_success(pixel, x_test[image_id], true_class, model, verbose=True)

print('Prior confidence', prior_confidence)
print('Attack success:', success == True)
helper.plot_image(perturb_image(pixel, x_test[image_id])[0])
Confidence: 0.07460087
Prior confidence 0.50054216
Attack success: True



It's time to collect all the puzzles into one picture. We will use a small modification of the implementation of differential evolution in Scipy.

def attack(img_id, model, target=None, pixel_count=1, 
           maxiter=75, popsize=400, verbose=False):
    # Change the target class based on whether this is a targeted attack or not
    targeted_attack = target is not None
    target_class = target if targeted_attack else y_test[img_id, 0]
    
    # Define bounds for a flat vector of x,y,r,g,b values
    # For more pixels, repeat this layout
    bounds = [(0,32), (0,32), (0,256), (0,256), (0,256)] * pixel_count
    
    # Population multiplier, in terms of the size of the perturbation vector x
    popmul = max(1, popsize // len(bounds))
    
    # Format the predict/callback functions for the differential evolution algorithm
    def predict_fn(xs):
        return predict_classes(xs, x_test[img_id], target_class, 
                               model, target is None)
    
    def callback_fn(x, convergence):
        return attack_success(x, x_test[img_id], target_class, 
                              model, targeted_attack, verbose)
    
    # Call Scipy's Implementation of Differential Evolution
    attack_result = differential_evolution(
        predict_fn, bounds, maxiter=maxiter, popsize=popmul,
        recombination=1, atol=-1, callback=callback_fn, polish=False)

    # Calculate some useful statistics to return from this function
    attack_image = perturb_image(attack_result.x, x_test[img_id])[0]
    prior_probs = model.predict_one(x_test[img_id])
    predicted_probs = model.predict_one(attack_image)
    predicted_class = np.argmax(predicted_probs)
    actual_class = y_test[img_id, 0]
    success = predicted_class != actual_class
    cdiff = prior_probs[actual_class] - predicted_probs[actual_class]

    # Show the best attempt at a solution (successful or not)
    helper.plot_image(attack_image, actual_class, class_names, predicted_class)

    return [model.name, pixel_count, img_id, actual_class, predicted_class, success, cdiff, prior_probs, predicted_probs, attack_result.x]

It's time to share the results of the study (the attack) and see how changing only one pixel will turn a frog into a dog, a cat into a frog, and a car into an airplane. But the more image points are allowed to change, the higher the likelihood of a successful attack on any image.



Demonstrate a successful attack on a frog image using the resnet model. We should see confidence in the true decline in class after several iterations.

image_id = 102
pixels = 1 # Number of pixels to attack
model = resnet

_ = attack(image_id, model, pixel_count=pixels, verbose=True)

Confidence: 0.9938618
Confidence: 0.77454716
Confidence: 0.77454716
Confidence: 0.77454716
Confidence: 0.77454716
Confidence: 0.77454716
Confidence: 0.53226393
Confidence: 0.53226393
Confidence: 0.53226393
Confidence: 0.53226393
Confidence: 0.4211318



These were examples of an untargeted attack, and now we will conduct a targeted attack and choose to which class we would like the model to classify the image. The task is much more complicated than the previous one, because we will make the neural network classify the image of a ship as a car, and a horse as a cat.



Below we will try to get lenet to classify the image of the ship as a car.

image_id = 108
target_class = 1 # Integer in range 0-9
pixels = 3
model = lenet

print('Attacking with target', class_names[target_class])
_ = attack(image_id, model, target_class, pixel_count=pixels, verbose=True)
Attacking with target automobile
Confidence: 0.044409167
Confidence: 0.044409167
Confidence: 0.044409167
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.054611664
Confidence: 0.081972085
Confidence: 0.081972085
Confidence: 0.081972085
Confidence: 0.081972085
Confidence: 0.1537778
Confidence: 0.1537778
Confidence: 0.1537778
Confidence: 0.22246778
Confidence: 0.23916133
Confidence: 0.25238588
Confidence: 0.25238588
Confidence: 0.25238588
Confidence: 0.44560355
Confidence: 0.44560355
Confidence: 0.44560355
Confidence: 0.5711696



Having dealt with single cases of attacks, we will collect statistics using the architecture of convolutional neural networks ResNet, going through each model, changing 1, 3 or 5 pixels of each image. In this article, we show the final conclusions without bothering the reader to familiarize themselves with each iteration, since it takes a lot of time and computational resources.

def attack_all(models, samples=500, pixels=(1,3,5), targeted=False, 
               maxiter=75, popsize=400, verbose=False):
    results = []
    for model in models:
        model_results = []
        valid_imgs = correct_imgs[correct_imgs.name == model.name].img
        img_samples = np.random.choice(valid_imgs, samples, replace=False)
        
        for pixel_count in pixels:
            for i, img_id in enumerate(img_samples):
                print('\n', model.name, '- image', img_id, '-', i+1, '/', len(img_samples))
                targets = [None] if not targeted else range(10)
                
                for target in targets:
                    if targeted:
                        print('Attacking with target', class_names[target])
                        if target == y_test[img, 0]:
                            continue
                    result = attack(img_id, model, target, pixel_count, 
                                    maxiter=maxiter, popsize=popsize, 
                                    verbose=verbose)
                    model_results.append(result)
                    
        results += model_results
        helper.checkpoint(results, targeted)
    return results

untargeted = attack_all(models, samples=100, targeted=False)

targeted = attack_all(models, samples=10, targeted=False)

To test the possibility of network discredit, an algorithm was developed and its effect on the forecast quality of the pattern recognition solution was measured.

Let's see the final results.

untargeted, targeted = helper.load_results()

columns = ['model', 'pixels', 'image', 'true', 'predicted', 'success', 'cdiff', 'prior_probs', 'predicted_probs', 'perturbation']

untargeted_results = pd.DataFrame(untargeted, columns=columns)
targeted_results = pd.DataFrame(targeted, columns=columns)

The table below shows that using the ResNet neural network with an accuracy of 0.9231, changing several pixels of the image, we got a very good percentage of successfully attacked images (attack_success_rate).

helper.attack_stats(targeted_results, models, network_stats)
Out[26]:
	model	accuracy   pixels	attack_success_rate
0	resnet	0.9231	    1	        0.144444
1	resnet	0.9231	    3	        0.211111
2	resnet	0.9231	    5	        0.222222

helper.attack_stats(untargeted_results, models, network_stats)
Out[27]:
	model	accuracy   pixels	attack_success_rate
0	resnet	0.9231	   1	        0.34
1	resnet	0.9231	   3	        0.79
2	resnet	0.9231	   5	        0.79

In your experiments, you are free to use other architectures of artificial neural networks, as there are currently a great many of them.



Neural networks have enveloped the modern world with invisible threads. For a long time, services have been invented where, using AI (artificial intelligence), users receive processed photos stylistically similar to the work of great artists, and today the algorithms themselves can draw pictures, create musical masterpieces, write books and even scripts for films.

Areas such as computer vision, face recognition, unmanned vehicles, diagnosis of diseases - make important decisions and do not have the right to make mistakes, and interference with the operation of the algorithms will lead to disastrous consequences.

One pixel attack is one way to spoof attacks. To test the possibility of network discredit, an algorithm was developed and its effect on the forecast quality of the pattern recognition solution was measured. The result showed that the applied convolutional architectures of neural networks are vulnerable to the specially trained One pixel attack algorithm, which replaces one pixel, in order to discredit the recognition algorithm.

The article was prepared by Alexander Andronic and Adrey Cherny-Tkach as part of an internship at Data4 .

All Articles