Generative artificial intelligence (AI) has impressed many with its ability to create realistic images from simple text prompts.
However, these AI models, such as Stable Diffusion, Midjourney, and DALL-E, have a well-known problem: they struggle to generate non-square images, often resulting in strange distortions, such as people with extra fingers or objects that look oddly stretched.
A new method developed by computer scientists at Rice University aims to solve these issues and improve the quality of AI-generated images, even at different aspect ratios.
The research, led by Moayed Haji Ali, a Ph.D. student at Rice University, introduces a technique called ElasticDiffusion.
This method was recently presented at the Institute of Electrical and Electronics Engineers (IEEE) 2024 Conference on Computer Vision and Pattern Recognition (CVPR) in Seattle.
It could be a game-changer for AI image generation, especially for applications that require different image sizes and shapes, like widescreen monitors or smartwatch displays.
Haji Ali explained that diffusion models, a type of AI used in image generation, work by adding random noise to images during training and then learning how to remove that noise to create new images.
However, these models are typically trained on square images, which creates problems when they are asked to generate images in other shapes, like 16:9 aspect ratios.
The result is often visual errors, such as repeated or distorted features.
“Diffusion models like Stable Diffusion can create amazing results, but they are limited to square images,” Haji Ali said.
“When asked to generate images in different aspect ratios, the models struggle, leading to odd visual issues like people with extra fingers or distorted objects.”
One reason for this problem, according to Haji Ali and his advisor, Professor Vicente Ordóñez-Román, is the way these models are trained. If a model is only trained on square images, it becomes very good at generating similar images but struggles to adapt to other shapes and sizes, a problem known as overfitting. Training the model on a wider variety of images could help, but it requires massive amounts of computing power—far more than most researchers can afford.
ElasticDiffusion, the new method developed by Haji Ali, takes a different approach. Instead of training the model to handle different image shapes, ElasticDiffusion separates the image’s global and local signals, making it easier to generate images with non-square aspect ratios. The local signal contains details like the shape of a person’s eye or the texture of fur, while the global signal contains the overall structure of the image, such as the outline of a person or an animal.
Typically, diffusion models package both signals together, which causes problems when generating non-square images. ElasticDiffusion avoids this by separating the signals and applying them in a more structured way.
First, it handles the global information to understand what the image should look like overall, then it fills in the local details one section at a time. This prevents the model from repeating or distorting parts of the image, resulting in a cleaner, more consistent final product.
“This approach uses the model’s intermediate steps to ensure that the global structure of the image stays intact while allowing the local details to be added without errors,” Ordóñez-Román explained.
While ElasticDiffusion produces better results than traditional diffusion models, it does have one drawback: it takes longer to generate images. Currently, ElasticDiffusion takes 6-9 times longer to create an image compared to other models like Stable Diffusion. Haji Ali is hopeful that this can be improved and aims to reduce the time so that it matches the speed of current AI models.
“Where I want this research to go is to figure out why diffusion models struggle with these repetitive issues and develop a framework that can generate images in any aspect ratio, without additional training and at the same speed as other models,” Haji Ali said.
ElasticDiffusion represents a promising step forward in improving AI-generated images and could help eliminate many of the common issues users experience when working with current models.
By solving the problem of non-square image generation, this method could open up new possibilities for AI in areas such as digital art, video production, and virtual reality, where image quality and consistency are critical.