ComfyGen AI automates multi-stage text-to-image workflows from simple prompts

Researchers from Nvidia and Tel Aviv University have developed ComfyGen, a new AI method that automates the creation of text-to-image workflows from simple prompts. The system independently selects suitable models, formulates prompts, and combines generated images with additional tools like upscalers to achieve desired results.

Traditional text-to-image generation typically relies on a single model to convert text into images. However, experienced prompt engineers often use complex, multi-stage workflows combining various components, including base models, LoRAs, prompt extensions, and upscaling models.

Das Bild zeigt einen Vergleich zwischen zwei KI-Bildgenerierungsansätzen. Oben sieht man den "Standard-Ansatz" mit einem T2I-Modell, unten den "ComfyGen"-Ansatz. Beide generieren basierend auf der Eingabeaufforderung "Stell dir einen atemberaubenden Wasserfall vor, der in einen kristallklaren Pool kaskadiert. Der Pool ist die Heimat eleganter Schwäne..." Bilder. Der ComfyGen-Ansatz beinhaltet zusätzlich ein großes Sprachmodell, einen Workflow in JSON und eine ComfyUI. Die resultierenden Bilder zeigen jeweils idyllische Wasserfallszenen mit üppiger Vegetation und klaren Gewässern. — Comparison between standard and ComfyGen approach for generating a waterfall scene. Image: Gal et al.

Component choice depends on prompt content and desired image style. Photorealistic images require different models than anime graphics or face and hand corrections. ComfyGen uses a language model like Claude 3.5 Sonnet to automatically compile suitable workflows based on concise text prompts.

Benutzeroberfläche von ComfyUI mit verschiedenen Einstellungen und Parametern. — User interface of ComfyUI. Image: Gal et al.

ComfyGen is based on the popular open-source tool ComfyUI, which allows users to define and share workflows in a structured JSON format. ComfyUI's widespread use in the Stable Diffusion community provides researchers with many human-created workflows for training data.

To test workflow performance, researchers collected 500 popular prompts and generated images using different workflows. They evaluated the results using aesthetic predictors and human preference estimation models.

In-context learning or fine-tuning

The final model takes a prompt and a target score as input and generates a JSON workflow to achieve the desired score. The researchers tested two approaches: in-context learning and fine-tuning.

In-context learning uses an existing LLM such as Claude 3.5 Sonnet. It is given a table of workflows and average scores for different prompt categories and selects the most appropriate workflow for new prompts. Fine-tuning trains an LLM (such as Llama-3.1-8B and -70B) to predict appropriate workflows for given prompts and target scores.

The following three images are all based on the prompt "A photo of a cake and a stop sign":

Kuchen mit verschwommener Schrift und Stopschild auf Holztisch, im Hintergrund eine Straße mit einem Auto, generiert mit SDXL. — SDXL image: Gal et al.

Kuchen und Stopschild vor minimalistischem, rötlichen Hintergrund, generiert mit ComfyGen-IC. — ComfyGen with in-context learning Image: Gal et al.

Kuchen mit Erdbeeren und Stopschild auf Holztisch vor Pflanzenhintergrund, generiert mit ComfyGen-FT. — ComfyGen with fine-tuning Image: Gal et al.

Experiments comparing ComfyGen with monolithic models such as Stable Diffusion XL and fixed, popular workflows showed that ComfyGen performed best on both automatic metrics and user studies. The fine-tuned variant slightly outperformed the in-context learning approach.

Recommendation

AI research

DeepMind's Genie 2 generates playable 3D worlds from single images

An analysis of the selected workflows showed that the model selection often matched the prompt categories. Facial upscaling models were selected more frequently for "people" prompts, while anatomically correct models were selected for "anime" prompts.

Promising approach with room for improvement

The advantage of ComfyGen is that it builds directly on existing workflows and scoring models created by the community, making it adaptable to new developments. However, this currently limits the variety and originality of generated workflows, as the system primarily selects known workflows from training data.

The researchers plan to develop the method to generate entirely new workflows and extend it to image-to-image tasks. In the future, prompt-driven workflows like ComfyGen could lower the barrier to entry for beginners while improving image quality.

The team suggests combining this approach with agent-based methods, where LLMs iteratively refine workflows through user dialog, as a topic for future research. The researchers have not yet released code or a demo of ComfyGen.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ComfyGen AI automates multi-stage text-to-image workflows from simple prompts

In-context learning or fine-tuning

DeepMind's Genie 2 generates playable 3D worlds from single images

Promising approach with room for improvement

New prompt method rewrites text in any style without changing its meaning

Study reveals AI models have hidden capabilities they can't access through normal prompts

Plan Like a Graph: New prompting method helps language models plan

AI agents outperform human teams in hacking competitions

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

ComfyGen AI automates multi-stage text-to-image workflows from simple prompts

In-context learning or fine-tuning

Promising approach with room for improvement

Share

Bank details