Text-to-Image Generation
300 papers with code • 11 benchmarks • 19 datasets
Text-to-Image Generation is a task in computer vision and natural language processing where the goal is to generate an image that corresponds to a given textual description. This involves converting the text input into a meaningful representation, such as a feature vector, and then using this representation to generate an image that matches the description.
Libraries
Use these libraries to find Text-to-Image Generation models and implementationsDatasets
Subtasks
Most implemented papers
LAFITE: Towards Language-Free Training for Text-to-Image Generation
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs.
Vector Quantized Diffusion Model for Text-to-Image Synthesis
Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
Exploration into Translation-Equivariant Image Quantization
This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing.
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7. 9 on MS-COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models.
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
Once trained, the transformer can generate coherent image tokens based on the text embedding extracted from the text encoder of CLIP upon an input text.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge.
Diffusion Models: A Comprehensive Survey of Methods and Applications
This survey aims to provide a contextualized, in-depth look at the state of diffusion models, identifying the key areas of focus and pointing to potential areas for further exploration.
Character-Centric Story Visualization via Visual Planning and Token Alignment
This task requires machines to 1) understand long text inputs and 2) produce a globally consistent image sequence that illustrates the contents of the story.
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
Recent progress in diffusion models has revolutionized the popular technology of text-to-image generation.