Recently, Google introduced Pavti, an autoregressive text-to-image generation model that enables high-fidelity photo-level image output and supports synthesis involving complex composition and rich knowledge content.
For example, using text descriptions “a raccoon in a formal suit, holding a cane and a garbage bag” and “a tiger wearing a train conductor’s hat and holding a skateboard” can generate similar images respectively.
In addition to the vivid details, Patty is also familiar with various styles, and can generate paintings in a variety of styles such as Van Gogh, abstract cubism, Egyptian tomb hieroglyphs, illustrations, statues, woodcuts, children’s crayons, Chinese ink paintings, etc. .
On June 22, 2022, a related research paper was submitted on arXiv as “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation” (Editor’s Note: An online database dedicated to preprints of scientific literature).
”Outputting images with patty is a sequence-to-sequence modeling problem, similar to machine translation,” the researchers said in Google’s official blog post. “It is therefore possible to benefit from advances in large language models, especially unlocked by scaling data and model size. In addition, the target output is a sequence of image tokens, rather than text tokens in other languages. And utilize the image tokenizer ViT-VQGAN to encode images into discrete token sequences for reconstruction into high-quality, stylistically diverse images.”
It is worth mentioning that Imagen, another text-to-image generation model launched by Google more than a month ago, also performed very well on research benchmarks. Patty and Imagen are autoregressive models and diffusion models, respectively, which are different but complementary and represent different exploration directions for Google.
In addition, the researchers explore and highlight the limitations of the Patty model, giving key example focus areas for further improvement.
Then, they also trained four versions of Patty with 350 million, 750 million, 3 billion and 200 million parameters and compared them in detail, with models with larger parameters in function and output There are substantial improvements in image quality. When comparing patties with 3 billion and 20 billion parameters, the latter was found to be better at abstract cues.
Below is how the four models generated images of “a green sign with the words ‘Very Deep Learning’ on the edge of a Grand Canyon with white clouds floating in the sky.”
For Patty to recognize long and complex cues, it needs to accurately reflect world knowledge, adhere to specific image formats and styles, and output high-quality images by composing numerous actors and objects with fine-grained detail and interaction. However, the model has certain limitations that still allow it to generate some failure examples.
For example, generate an image according to the following text: “A portrait of a statue of Anubis wearing a yellow T-shirt with a drawing of a space shuttle and a white brick wall in the background.” The plane is on the wall, not the T-shirt, and the color bleeds out a bit.
It is worth mentioning that this time the researchers also adopted a new test benchmark, Patti 2 (P2 for short), which can measure the ability of the model from various categories and challenges.
Then, the researchers say, generating images from text is very interesting, allowing us to create scenes that have never been seen or even exist. But with many benefits, there are risks and potential impacts on bias and safety, visual communication, disinformation, and creativity and the arts.
Also, some of the potential risks are related to how the model itself is developed, especially for training data. Models like Patty are typically trained on noisy image-text datasets. These datasets are known to contain biases against people of different backgrounds, leading models such as Patty to create stereotypes. For example, there are additional risks and concerns when applying the model to visual communication, such as helping low-literate social groups to output pictures.
The text-to-image model opens up many new possibilities for people, essentially acting as a paintbrush to create unique and beautiful images that can help increase human creativity and productivity. But the output range of the model depends on the training data, which can skew toward Western images and further prevent the model from exhibiting entirely new artistic styles.
For these reasons, the researchers will not release the code or data of the Patty model for public use without further safeguards. And added “patty” watermark on all the images that have been generated.
Next, the research team will focus on further research on model bias measurement and mitigation strategies, such as cue filtering, output filtering, and model recalibration.
They also believe that there is promise in using text-to-image generative models to understand biases in large image-text datasets at scale by explicitly detecting them for a set of known types of biases and potentially revealing other forms of hidden biases. In addition, the researchers plan to work with artists to adapt the power of high-performance text-to-image generation models to their work.
Finally, compared to the DALL-E2 released by Open AI some time ago and Google’s own Imagen (both of which are diffusion models), the researchers mention that Patty shows that the autoregressive model is powerful and universally applicable.