'Typographic attack': pen and paper fool AI into thinking apple is an iPod

<span>Photograph: Alexander Weickart/Alamy</span>
Photograph: Alexander Weickart/Alamy

As artificial intelligence systems go, it is pretty smart: show Clip a picture of an apple and it can recognise that it is looking at a fruit. It can even tell you which one, and sometimes go as far as differentiating between varieties.

But even cleverest AI can be fooled with the simplest of hacks. If you write out the word “iPod” on a sticky label and paste it over the apple, Clip does something odd: it decides, with near certainty, that it is looking at a mid-00s piece of consumer electronics. In another test, pasting dollar signs over a picture of a dog caused it to be recognised as a piggy bank.

OpenAI, the machine learning research organisation that created Clip, calls this weakness a “typographic attack”. “We believe attacks such as those described above are far from simply an academic concern,” the organisation said in a paper published this week. “By exploiting the model’s ability to read text robustly, we find that even photographs of handwritten text can often fool the model. This attack works in the wild … but it requires no more technology than pen and paper.”

Like GPT-3, the last AI system made by the lab to hit the front pages, Clip is more a proof of concept than a commercial product. But both have made huge advances in what was thought possible in their domains: GPT-3 famously wrote a Guardian comment piece last year, while Clip has shown an ability to recognise the real world better than almost all similar approaches.

While the lab’s latest discovery raises the prospect of fooling AI systems with nothing more complex than a T-shirt, OpenAI says the weakness is a reflection of some underlying strengths of its image recognition system. Unlike older AIs, Clip is capable of thinking about objects not just on a visual level, but also in a more “conceptual” way. That means, for instance, that it can understand that a photo of Spider-man, a stylised drawing of the superhero, or even the word “spider” all refer to the same basic thing – but also that it can sometimes fail to recognise the important differences between those categories.

“We discover that the highest layers of Clip organise images as a loose semantic collection of ideas,” OpenAI says, “providing a simple explanation for both the model’s versatility and the representation’s compactness”. In other words, just like how human brains are thought to work, the AI thinks about the world in terms of ideas and concepts, rather than purely visual structures.

But that shorthand can also lead to problems, of which “typographic attacks” are just the top level. The “Spider-man neuron” in the neural network can be shown to respond to the collection of ideas relating to Spider-man and spiders, for instance; but other parts of the network group together concepts that may be better separated out.

“We have observed, for example, a ‘Middle East’ neuron with an association with terrorism,” OpenAI writes, “and an ‘immigration’ neuron that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas, mirroring earlier photo tagging incidents in other models we consider unacceptable.”

As far back as 2015, Google had to apologise for automatically tagging images of black people as “gorillas”. In 2018, it emerged the search engine had never actually solved the underlying issues with its AI that had led to that error: instead, it had simply manually intervened to prevent it ever tagging anything as a gorilla, no matter how accurate, or not, the tag was.