[Everyone’s AI] AI drawing, DALL-E

In this article, I will introduce DALL-E, an AI model that can generate images from text published by Open AI. Let’s take a look at what data DALL-E was trained with and what process it went through. Also, just to clarify, Open AI’s DALL-E is not available, so I’ll try using DALL-E through the Pytorch version of DALL-E, which has been released by Phil Wang.

If you want to check out the project right away, please refer to the following link!

Demo : https://main-dalle-client-scy6500.endpoint.ainize.ai/

API : scy6500/DALLE-server

Github : GitHub - lucidrains/DALLE-pytorch: Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Everyone must have experienced drawing a painting or a portrait in school, at least once. The act of drawing a picture or creating a portrait using a photo or a model has been something only humans have been able to do until now. However, this idea changed entirely when Open AI unveiled a model called DALL-E.

DALL-E is an AI model that can generate images from text by combining natural language processing and computer vision, and it is said to be a model that can create images with any given text. The name DALL-E was inspired by the surrealist painter Salvador Dali and the character WALL-E in the robot animation.

Salvador Dali and WALL-E

Initially, there were various approaches, including stacking a Generative Adversarial Network (GAN) model, to effectively generate images from text. However, these approaches still provided unnatural features such as object distortion and illogical object placement. However, recently, generative models started to be successfully trained based on auto regressive transformers like GPT-3, and Text-to-Image generation models were also studied using this method. That model is the DALL-E model.


  • Conceptual Caption dataset

The DALL-E model can be broadly divided into two stages. But, before we get into the model, first, let’s take a look at the dataset used for training this model. The data used for training is 250,000,000 image-text pairs. It is known that it was trained by collecting three datasets. The first data set is the Conceptual Caption dataset released by Google, which consists of approximately 3,000,000 image-text data.

Conceptual Caption dataset

  • YFCC100M

The second dataset is YFCC100M, a dataset of 99,200,000 images and 800,000 videos.

YFCC100M dataset

  • Wikipedia images and caption

Finally, it is said that images from Wikipedia and captions for images were used as datasets.

Not all of these datasets have been used. Only data that have met the conditions were used for training after passing through several filters. Under these conditions, the captions should be in English, not short, and the aspect ratio of the image should be between 1/2 and 2 to filter the data.

Stage 1

Now let’s talk about the model. Can we train the model by putting images into the model pixel by pixel? No! This method converts the image to one-dimensional. However, if you convert a 256 * 256 image to one dimension, you will get 65,536(256 * 256) lengths, and if you consider RGB 3 channels, the length will be about 200,000(65,536 * 3).

If you put this into the model as it is, it will probably not be trained properly and will not be loaded into memory. DALL-E uses VQ-VAE (Vector Quantized Variational AutoEncoder) to solve this problem.

VQ-VAE puts the input image into the CNN model and extracts 32 * 32 feature maps. The embedding space is updated based on the selected feature map. After that, we get the image by decoding it again through this embedding space. As a result, through this process, an image of 256 * 256 is converted into a 32 * 32 image token represented by vectors in the Embedding Space.

VQ-VAE structure

This process reduces the spatial resolution by about 8 times. This may cause distortion or loss of some information, such as borders, textures, and thin lines of objects. In order to minimize this loss, information is minimized by using a large vocabulary size of 8192. As you can see in the photo below, the overall image quality becomes recognizable. (Top: original image, bottom: VQ-VAE result)

Original image and VQ-VAE result

Stage 2

After the above, image captions are made up to 256 tokens through BPE encoding, concatenated with the image tokens created earlier, and put into the Transformer Decoder for learning.

Concatenated Text Tokens and Image Tokens


DALL-E released by Open AI cannot be used at the time, so we will use DALL-E through the Pytorch version of DALL-E that was released by Phil Wang.

  • Using the DALL-E Pytorch Demo

Let’s try DALL-E Pytorch using the demo provided by Ainize.

First, enter the description of the image you want to create in the input box, then, adjust the number of images you want to create with the slides, and click Generate to get the desired result. (It takes about 20 seconds to create one image) The demo is available on Link.

  • Using the DALL-E Pytorch API

This time, let’s use DALL-E Pytorch using the API provided by Ainize. Information about the API can be found at Link.

By using the the API provided by Ainize I could learn about DALL-E. Due to the achievements of DALL-E, some people believe that AI will now be able to replace humans in areas that require creativity, such as art and design. I have to admit that this would be a little scary because as time goes on, AI seems to be able to invade the realm that was thought to be unique to humans, such as creativity.