Content
summary Summary

Researchers at the Shanghai AI Laboratory have developed a specialized text-to-image model for photorealistic generation of human images. Thanks to a massive dataset and a new training method, CosmicMan achieves impressive results.

Ad

Scientists at the Shanghai AI Laboratory present CosmicMan, a novel text-to-image foundation model that specializes in generating high-quality images of people.

Unlike current image foundation models, which often struggle to generate detailed human images that match the text description, CosmicMan enables photorealistic results with precise text-image alignment. Users can even specify small details in their prompt, such as an alternative color for a hat.

Data production as a feedback loop between humans and AI

According to the researchers, led by Shikai Li and Jianglin Fu, CosmicMan's success rests on two pillars: a huge, high-quality dataset and a novel framework for training the AI model.

Ad
Ad

For CosmicMan, the scientists developed a new approach to generating training data that they call "Annotate Anyone." It works as a kind of feedback loop between humans and the AI, and aims to provide high-quality, always-up-to-date data at low cost. In this approach, the AI first generates detailed labels, which are then reviewed and optimized by humans.

The researchers built a massive dataset of human images cooperatively labeled by AI and humans. | Image: Li et al.

Using this method, the team created the "CosmicMan-HQ 1.0" dataset, which contains six million images of humans at an average resolution of 1488 x 1255 pixels. The images are annotated with precise text descriptions derived from 115 million attributes of varying levels of detail.

Focusing on the human

A second element is the so-called "Decomposed-Attention-Refocusing" framework (Daring), which, simply put, categorizes the words in a prompt into categories corresponding to the human body, such as "head," "arms," "legs," and so on.

This allows the AI model to focus on drawing each part of the person individually, rather than trying to draw everything at once. This leads to better and more easily customizable results.

The human-specialized AI model can focus on individual body parts. This offers higher accuracy and flexibility. | Image: Li et al.

In various experiments, CosmicMan shows promising results, outperforming current state-of-the-art models in both quantitative metrics and perceived visual quality, the researchers say. They see great potential for using CosmicMan in various applications, such as the entertainment industry, e-commerce, or avatar creation for virtual worlds.

Recommendation

The CosmicMan-HQ 1.0 dataset will be released soon. The team, led by Shikai Li and Jianglin Fu, will continue to improve the model and is already planning the next version.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the Shanghai AI Laboratory have developed CosmicMan, a specialized text-image model that generates photorealistic images of humans that match the text description down to the smallest detail.
  • CosmicMan's success is based on a huge, high-quality dataset called "CosmicMan-HQ 1.0" with six million images and a novel framework for training the AI model, which categorizes the human body and focuses on individual body parts.
  • In experiments, CosmicMan outperforms current models in both quantitative metrics and perceived visual quality. The developers see potential for use in a variety of applications, including entertainment, e-commerce, and the creation of avatars for virtual worlds.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.