The scariest cat you’ll see this Halloween
The Cat Sìth is a fairy creature from Celtic mythology, said to resemble a large black cat with a white spot on its chest. During Samhain, the Gaelic festival that the Catholic church converted into Halloween, it roams the lands and if you leave a saucer of milk out, your house will be blessed. If you don’t, your cows’ udders will dry up.
Halloween and cats have had a connection ever since. The internet is of course especially well suited for the delivery of cat videos, so with all the chatter around Artificial Intelligence, I was thinking what would be a good Halloween cat video? The rest of this post is about how to use a pre-trained image neural net to generate recursive movies, but that’s not what you came for, so without further ado, this is it:
Let’s retrace how we got here. It all begins with computer vision. Computer vision is the area where deep learning first started to produce outsized results. Before, deep learning researchers toiled to make computers recognize handwritten digits; now a Python script on a laptop can tell cats from dogs or indeed hot dogs from not hot dogs.
Deep learning worked on images because of that first word: deep. Neural networks had been around for a long time, but new hardware made it possible to add many more layers (and the internet made it possible to train on millions of images). The way these layers organize themselves is somewhat similar to how we think the visual cortex works; the lowest layer takes in the raw visual data, or the pixels, and does some processing and then hands off the results to the next layer. That layer also does some processing and hands the results off to the next layer and so on and so on.
With each layer, the level of abstraction increases. So a neuron on the lowest layer might be activated by a certain texture; a neuron at the highest level might get activated when the system sees a (hot) dog.
A few years ago, Google published a nice library called Lucid that allows us to visualize what these neurons do by figuring out an image that would maximally trigger a specific neuron. I copied their accompanying notebook and added the tiling to create the image below — you can find it here.
In the image above, each row contains images sampled from one layer in the visual network and each square contains a rendering of an image that would really trigger that neuron. It’s a good way to explore what makes a network like this tick. The lowest layers, as expected, detect textures while the highest are looking for more complicated patterns with reflections of eyes and faces popping up — and anything in between.
A way to summarize this is to say that the lower layers express textures, the middle layers style and the highest layer content. Gatys et al. explored in their 2015 paper A Neural Algorithm of Artistic Style how this insight allows us to take the style from one image and apply it to a different image, for example we can take a picture of the Neckar river in the city of Tübingen and render it as if Van Gogh had painted it:
The algorithm described in the paper uses four different loss components to get here, but a simplified version is to start with a bunch of noise and then change that noise such that the neurons in the network activate on the lower to middle levels as if they are looking at a painting of Van Gogh, while the middle to higher level neurons are activated as if they are looking at a picture of Tübingen. So you get the style of Van Gogh, but the contents of Tübingen.
So that’s what powers that snapchat filter. Another way to formulate what is happening here is to say that each layer in the network captures the essence of the picture at a particular level of abstraction and that by taking the essence from the highest levels from the content picture and the essence of the lowest levels from the style picture, we achieve style transfer. So what happens if we just take the essence from all layers from just one picture? Would that just reproduce that picture?
It turns out it doesn’t. Even at the highest level of abstraction, the layers of the network still only learn patterns that are fairly local. For example, if we take a picture of a pile of leaves:
It looks already fairly regular, but if we ask a network to generate something that makes all neurons fire in the same way as they do for this image, we get:
This definitely captures some of the leave-ness of the original picture, but nothing really stands out. One trick that Google used in their Lucid library to visualize what triggers a neuron is to start with a small image and while optimizing, slowly upsample that image until it reaches the target size. This gives the network a chance in the first iterations to set out the overall structure of the image and as the image expands, fill in the details. If we apply this trick to our leaves image, we get something a bit better:
At first glance this does look like a stack of leaves. Only if you look more closely and you try to follow the outlines of the leaves do you see that these aren’t actually leaves, but just patterns that are continuous and therefore create the illusion of an entity. We can capitalize on this effect by continuing to zoom and append each new frame to gif. The image below is a detail of a movie generated like that:
You can apply this process to any image. In general, the more the input image has aspects of a tessellation, the smoother the resulting image is. But sometimes you don’t want a smooth and soothing animation. And this brings us of course back to the Halloween cat this post opened with.
The network identifies catness and can reproduce catness, but not in a smooth way; there’s no smooth way to render a cat; there’s eyes and whiskers and fur and all kinds of bits and bobs that lead to something, well, scary!
How it technically works
The code used to generate the animations in this post can be found on github as gramzoom.
The gramzoom implementation is based on a somewhat older style transfer implementation in Keras. The traditional style transfer implementation has three loss components: local cohesiveness, content loss and style loss. The first one makes sure that the image doesn’t become too blurry, the second one that the image matches the content of the target, the last one makes sure that the style matches.
In gramzoom we only use the style match and start with a random image. With each generation the random image starts looking a bit more like the target image, but only on a style level — textures, colors and small elements like eyes and bits of fur, but nothing big, let alone the entire picture.
The other thing that we do, is to zoom in on the picture after a certain number of steps. The zooming tends to create some line artifacts, mostly horizontal and vertical, sometimes also diagonal and the code contains some trickery to suppress this, but that only seems to work partially.
The command line version then saves the result either as a movie, an animated gif or as an image — when saving as an image only the last frame is stored.