Analysis-Paralysis

A comparative analysis of text-to-image models, setting the stage for working on text-to-scene generation.

Kaveesh Khattar

Dec 30, 2023

The Problem Statement

To identify the advantages and disadvantages of various text-to-image creation techniques and learn about the underlying mechanisms that contribute to their picture synthesis skills by investigating their architectural designs.

Datasets

The datasets used in the comparative analysis paper encompass a diverse range of applications in computer vision and multimedia research.

YFCC100M, a massive collection of 100 million Flickr images and videos, supports advancements in visual perception, while MS-COCO, with its rich annotations for object detection and segmentation, has fueled cutting-edge research in visual comprehension.

The CUB dataset focuses on fine-grained bird species recognition, and the Oxford-102 Flowers dataset aids in the fine-grained classification of floral species. Together, these datasets enable significant progress in object detection, classification, and attribute prediction tasks.

KTH Action Recognition: Evaluates human action recognition, featuring six activities like walking, running, and boxing, with spatio-temporal analysis focus.

UCF Sports: Highlights sports activity recognition in videos, with diverse action classes like basketball and diving, aiding sports video analysis.

Architectures

Our work emphasized deeper experimentation with GAN-based approaches, given their impact on image synthesis.

Key architectures included Multi-Stage AttnGAN, which improves text-to-image synthesis using attention mechanisms; CycleGAN + BERT, blending style transfer with text embeddings for nuanced transformations; and DF-GAN, which simplifies GAN training for text-to-image tasks. We also examined MirrorGAN, which leverages captions for bidirectional consistency, LSTM+GAN, integrating sequential modeling for dynamic image generation, and DALLE, OpenAI's model for generating high-quality images from text prompts.

Metrics

We explored several metrics used to evaluate AI generated images. The Inception Score (IS) measures image quality and diversity by comparing class probabilities of generated images. The Fréchet Inception Distance (FID) quantifies the similarity between real and generated image distributions, with lower FID indicating better quality.

For subjective evaluation, the Mean Opinion Score (MOS) involved human ratings of image fidelity on a numerical scale, with higher scores reflecting greater realism. These metrics collectively ensured a balanced assessment of quality, diversity, and user perception.

The Final Result

We submitted our paper to ACI 2023 for publication, and our paper was successfully published!

ACI 2023