Resumen
Through the addition of humanly imperceptible noise to an image classified as belonging to a category ????
c
a
, targeted adversarial attacks can lead convolutional neural networks (CNNs) to classify a modified image as belonging to any predefined target class ?????????
c
t
?
c
a
. To achieve a better understanding of the inner workings of adversarial attacks, this study analyzes the adversarial images created by two completely opposite attacks against 10 ImageNet-trained CNNs. A total of 2×437
2
×
437
adversarial images are created by EAtarget,??
EA
target
,
C
, a black-box evolutionary algorithm (EA), and by the basic iterative method (BIM), a white-box, gradient-based attack. We inspect and compare these two sets of adversarial images from different perspectives: the behavior of CNNs at smaller image regions, the image noise frequency, the adversarial image transferability, the image texture change, and penultimate CNN layer activations. We find that texture change is a side effect rather than a means for the attacks and that ????
c
t
-relevant features only build up significantly from image regions of size 56×56
56
×
56
onwards. In the penultimate CNN layers, both attacks increase the activation of units that are positively related to ????
c
t
and units that are negatively related to ????
c
a
. In contrast to EAtarget,??
EA
target
,
C
?s white noise nature, BIM predominantly introduces low-frequency noise. BIM affects the original ????
c
a
features more than EAtarget,??
EA
target
,
C
, thus producing slightly more transferable adversarial images. However, the transferability with both attacks is low, since the attacks? ????
c
t
-related information is specific to the output layers of the targeted CNN. We find that the adversarial images are actually more transferable at regions with sizes of 56×56
56
×
56
than at full scale.