Image-to-Text Synthesizer: Comparing Diffusion, CNNs, and GANs

Image-to-text conversion has become increasily popular—for answering questions about images and extracting specific data that can be transformative in fields like medicine and education. This project delves into three fundamental deep learning models for image-text: Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Diffusion Models (integrated with Vision Transformers and GPT-2).

Dataset Note

For this project, we used the well-known CIFAR-10 dataset, which comprises 60,000 32x32 color images across 10 classes (airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks). Although the low resolution limits fine-grained detail extraction, CIFAR-10 was chosen for its simplicity , fast training and its ability to test model robustness under low hardware capabilities.

Confusion Matrix 1

CNNs: The Backbone of Image Classification

Convolutional Neural Networks represent the earliest deep learning approach for image recognition.

How CNNs Work

After passing through several convolution and pooling layers, the resulting feature maps are flattened and fed into fully connected layers that perform the final classification or feature extraction.

Confusion Mtrix 1 Performance Per Label 2 Prediction Output

Even with a low training size (I trained around 10 epochs on CIFAR-10), the confusion matrix showed high performance in image classification, i.e Approximate 65% on average for all classes. However certain classes (e.g., airplanes and cars) were accurately recognized, while others (like birds, cats, and deer) lagged behind as shown in the bar chart per label performance.

GANs: Generative Adversarial Networks

Generative Adversarial Networks (GANs) are a powerful deep learning framework for creating realistic images by utilizing two competing neural networks: a generator and a discriminator. The generator starts by producing images from random noise, while the discriminator assesses whether an image is real (from actual data) or fake (created by the generator). As training progresses, the generator refines its ability to mimic real images, while the discriminator continuously improves at distinguishing real from fake.

How GANs Work

GANs operate with two distinct but interlinked neural networks:

Training is performed as a minimax game: the discriminator improves its ability to differentiate real from fake images, while the generator simultaneously learns to produce images that can fool the discriminator.

Output Performance 1Output Performance 2Output Performance 3Output Performance 4
Output Performanace

GANs are commonly used to generate realistic images from noisy data , though their training can be challenging due to issues like large training samples and the need to balance between the generator and discriminator losses. Training GANs proved to be particularly challenging as shown in the images above. While they are generally expected to surpass CNNs in performance, this particular instance the GAN underperform due to limited training size and limited hardware capacity. Looking at the confusion matrix that is trained on three set of image labels as shown, the results for image labeling apear to be random. This shows that the GAN requires additional training for more effective results. GANs are designed to work with continuous data (like images) where gradients can be smoothly propagated. Text generation, however, involves discrete tokens (words) and sequences, which makes it difficult to propagate gradients through the generator and discriminator.

Diffusion Models

Diffusion models, often famed for creating visually stunning AI-generated art, have now expanded their utility into text extraction and recognition. By gradually adding noise to an image and learning to reverse this process, they effectively reconstruct missing or blurry details—enabling richer textual synthesis.

Vision Transformers (ViTs) and GPT-2 Integration

In our application, we combined a Vision Transformer (ViT) with GPT-2 to create an advanced image-to-text synthesizer:

Output Performanace

This diffusion-based approach not only provided superior accuracy on CIFAR-10, but also generated comprehensive descriptions that enriched the understanding of the scene—a significant advantage in fields like medical diagnostics and educational content creation.

Final thoughts

Convolutional Neural Networks (CNNs) are highly effective for image classification, delivering good results with efficient training. CNNs are excellent at extracting local features and performing image classification, even with low-resolution images as shown within this project. On the other hand, Generative Adversarial Networks (GANs) are a creative approach to image reasoning by generating realistic images. They are ideal for handling noisy inputs, but require powerful tuning and training and are not well-suited for image-to-text synthesis. One of the biggest challenges in training GANs is finding the right balance between the generator and discriminator to minimize overall loss, which requires extensive training and careful parameter tuning. To classify images into predefined labels, I employed a simple classifier to generate outputs conditioned on specific image labels. However, without sufficient training the model could not perform very well. The Diffusion Model (ViT + GPT-2) deliver the best context-rich textual descriptions, which are even more descriptive than the datasets labels. A key recommendation is to integrate traditional CNN or GAN models with large language models (LLM) like GPT. This hybrid approach may improve the efficiency of CNNs and GANs, and their contextual richness of image reasoning and understanding.