A fascinating paper was put up on arXiv last month, “One pixel attack for fooling deep neural networks“.
Many computer scientists and casual users might agree that the most delightful successes of deep neural networks (DNNs) thus far has been image recognition. Though I am personally not much of photographer/social media autobiographer, I have heard colleagues and friends rave about the capabilities of apps like Google Photos. Apparently, without any human help, Photos lets you search through your pics for arbitrary queries. There are other cool features but I think this one already packs enough punch!! Is ML finally reaching the elusive AI distinction, where it can “understand” abstract concepts such as ‘sunsets’ or ‘hiking’ and find the right context in images despite the myriad variations of locations, positions, lighting, quality and people?
A lot of research has been pumped into understanding what makes DNNs (or CNNs convolutional neural networks as are used for images) such a powerful learning structure. There have been a few interesting papers which attempt to extract human readable explanations of the predictions made by a neural network. I personally find interpretability a very, if not the most, interesting part of AI/ML research, so I will delve into that tangent in future posts. Suffices to say here that the CNNs/DNNs currently being used in cutting edge research are very hard to interpret.
Unsurprisingly, “tech” companies have few qualms about using things they don’t fully understand. And for good reason. Somehow, DNNs just work! The physicist in me demands, but do they? How do we know? One good way to test these inscrutable beasts is to probe them with data and measure the response. ML practitioners might recognize this exercise as testing on holdout data set.
In my opinion, as a philosophy of testing, cross validation or testing on holdout set is far from satisfactory. The origins of cross validation lie in statistics from the straightforward rationale that predictions made by a model built on one data can be tested on different data to check if it generalizes. However, models in statistics are highly reductionist and always interpretable. Sure cross-validation was a good test, but the true test of statistical models is always in what it is able to explain. The statistical model is often simply a means to quantify the cause and effect where both the cause and effect are independently well understood.
Classic ML models such as logistic regression and SVMs were born in statistics journals, with highly interpretable parameters. Yet once the emphasis shifted from explanation to prediction, it made sense to create arbitrary features and parameters with scant justification as long as it improved the predictive accuracy. With the explosion of so-called black box models like Random Forests and DNNs in the last few years, the last explainable scaffolds were dropped and we took a collective leap of faith putting all our eggs in the data testing basket. In fact when I began learning about ML a few years ago, many top practitioners were proclaiming that as long as you had a sufficiently large and accurate data set and a sufficiently complex and trainable model, almost anything could be learnt. As a theorist studying the math at the time, I agreed wholeheartedly. Now as I am building practical models myself, I find that “sufficient” is easier said than done.
To draw an analogy from physics, if we are treating the model as a object that we need to understand, testing with a validation data set is akin to throwing data at the model and measuring what bounces back. In that sense it is not very different from scattering or tomography techniques such as CT scans. Now if your model has the complexity of a flat wall, throwing a few objects at a few different positions and speeds and measuring what comes back will tell you everything there is to know about the wall – its height, width, texture and strength. But if your model has millions of parameters and looks more like the Taj Mahal, then understanding the complete shape and structure of the model is not easy. “Sufficient” is not a helpful quantification of the number of objects you need to throw at Taj Mahal to understand it. Most statistical models are simple, like walls, while DNNs are anything but.
The answer is ‘one’
And that is quite apparent in this paper. The authors show that changing a single pixel in a 1024 pixel image can fool the VGG-16 model.
Three specific results of this paper stand out to me –
- The authors were able to fool the model for more than 66% of the images. To be completely fair, in a few of the above pictures I can almost see why the model might be confused, especially the misclassifications between dog and cat. However, some mistakes don’t seem to make any sense from a human’s perspective. e.g horse to automobile and airplane to dog.
- On average the model was more than 97% sure of its predictions when it was being fooled. This fact undoes a lot of the credit I mentioned in the previous point. A single pixel can almost never convince a human to change a prediction so decisively. Even when fooled most humans would have attached a much higher degree of uncertainty to their assessment.
- Some images can be made to look like any class. In the example below, I can hardly make out the original image which is supposed to be a dog. But apparently VGG can be convinced with near 100% certainty that this image is an airplane, bird, truck or frog, all with just one well placed pixel. Examples like the one below are baffling and really bring to focus the question, “What exactly has the model learnt?”
So do we AI or not?
So at this point, some might wonder, is it time to dismiss all AI/ML as smoke and mirrors? A more nuanced question is, are
all many neural networks horrendously overfit, to the point where they have memorized massive datasets? I don’t want to profess any authority here, I personally have not worked with VGG or the other state-of-the-art image recognition models so I do not have a very detailed understanding of how well these models generalize. Yet even in my experience, CNNs and DNNs are much more fragile than I would like. The models are incredibly powerful, achieving very high accuracy over millions of examples using seemingly robust validation and testing protocols. Nevertheless a relatively small systematic change in the input data can potentially render the model almost entirely useless. In many cases it is easy to include the new data in the training examples so that the model now performs well over the space of both types of data. Yet, the opacity of the learning and the inability of transferring learning without retraining does not inspire a lot of confidence in the model’s ability to represent human level abstractions.
Now this could be interpreted to be a classic case of the much touted “moving goalpost” problem with Artificial Intelligence. Once chess was thought to be the line that separated true intelligence vs just calculation. However we eventually realized that the chessmaster program was not very intelligent just a very good calculator, even if no human could beat it at chess. We have already heard that image recognition models are now at nearly human level capabilities. Am I claiming that image recognition could also be a problem that could be solved with programming rather than anything resembling intelligence?
No, I am actually claiming that image recognition is not solved, at least not with the finality of chess. The world of the chessmaster program is finite, 64 squares, 32 units and a dozen rules. The world of images is not finite. As this paper handily demonstrates, one likely reason we think we have reached human level image recognition may be because our testing data set is narrow and biased. Our understanding of the model is limited to its response of our testing data set and the bizarre gaps between human and machine response are revealed when the model is presented with unusual or adversarial data. We presume we know which images are the hardest to understand and test our model on those, whilst not realizing that a single defective pixel might already be too much to handle. And at this level of model complexity, I am not sure it is even possible to detect all the various failure modes of a model simply by throwing cleverly designed testing data at it. Fortunately, it seems to me that some top AI researchers are also moving away from the claim that intelligence can be completely validated with “reasonable” amounts of data, though there is nothing close to a consensus.
I don’t think the goalposts have moved yet; solving image recognition along with all the other advances in language processing, sequential learning and reinforcement learning could still put us on the right track to understanding intelligence. The VGG model is no longer the state-of-the-art, many other (deeper) architectures perform much better. However, the metric of their success is still accuracy on a testing data set. I think we need a more robust method of determining if an intelligence problem has been “solved” and my guess is, the “solution” will require more ingenuity than just deeper neural networks.