Wednesday, December 8, 2021

My PhD Thesis Title..

Yesterday, I posted an image of an AI-generated art (on Twitter and on LinkedIn). The image was generated by providing it my PhD thesis title (which is actually irrelevant for this post). Today, I will share the story about the “AI” software that generated that stunning image.

If you are on Twitter, you would have lately seen a deluge of such AI generated images all over your timeline. These pictures are being generated using a new app called Dream (wombo.art) which lets anyone create an AI-generated artistic image by simply typing a brief description of what they would like the generated image to depict. If you search on Twitter recently, you will see many examples of what people have already generated using this app. Many in the Twitter academics have been doing what I eventually did too. They provided their respective PhD thesis titles to generate their own art and shared on Twitter. It has kind of become a craze – and I fell for it too.

This type of software that generates such images is not totally new though. There have been DALL-E and VQGAN+CLIP algorithms before. The Dream app takes it further with its speed, quality, ease of use and probably tweaks to the algorithm itself. It’s available as a mobile app on Android and iOS and also on the web. The app is developed by a Canadian startup, Wombo.

The algorithm behind wombo.art could still be VQGAN+CLIP. It stands for a verbose “Vector Quantized Generative Adversarial Network and Contrastive Language – Image Pretraining). If I were to really explain this to a layman or someone not in the field, it simply is a piece of software that takes as input words and generates pictures based on trained datasets.

VQGAN+CLIP, as the “+” indicates is a combination of two deep learning models, both released earlier this year. VQGAN is a type of GAN, ie a type of generative adversarial network, to which you can pass a vector or a code and it outputs an image!

VQGAN has a continuous traversable latent space which means that vectors with similar values will generate similar images and following a smooth path from one vector to the other will lead to a smooth interpolation from one image to another.

CLIP is a model released by OpenAI that can be used to measure similarity between the input text and the image.

So in VQGAN+CLIP, we start with an initial image generated by VQGAN with a random vector and input text presented by user (e.g. my PhD title!). CLIP then provides a similarity measure between the input text and generated image. Through optimization (typically gradient ascent), the algorithm iteratively adjusts the image to maximize the CLIP similarity.

So CLIP guides the initial image to a nuanced version of itself, which can be considered as “close” to the input text as possible.

Ofcourse, Wombo has not specified they are using the VQGAN+CLIP algorithm specifically. Clearly they have added a few bells and whistles, but the basic concept remains the same.

So, try inputting any text, your PhD thesis title, your paper title, your dream destination and let wombo.art generate some aesthetic art for you!

No comments:

Post a Comment