Content
summary Summary

Deepseek has completely overhauled its multimodal AI system Janus. The new version, Janus Pro, improves on its predecessor through refined training methods, expanded datasets, and larger model sizes.

Ad

The team made several key improvements to their training approach for more efficient data use. They also significantly expanded their training data, adding about 90 million new examples for multimodal understanding from various sources, including YFCC caption datasets and specialized collections for understanding tables, diagrams, memes, and documents.

For image generation, they incorporated about 72 million synthetic training examples, some created using Midjourney. This brought the ratio of real to synthetic data to an even 1:1, according to their paper.

One of the biggest changes is the introduction of a larger model size. While the original 1B version remains available, there's now a 7B version that shows notably better performance in both understanding and generating images.

Ad
Ad

In benchmark testing, the larger Janus Pro-7B scored 79.2 on MMBench for multimodal understanding, well above its predecessor's 69.4. While this marks significant progress, some similar-sized competing models still perform better.

Higher quality in image generation

The most noticeable improvements show up in prompt following. Where the earlier version struggled with brief instructions and inconsistent image quality, Janus Pro can now create detailed, consistent images from short prompts and handle complex instructions more effectively, the team says.

These improvements show up in the numbers: Janus Pro scored 80% accuracy on GenEval, compared to its predecessor's 61%. It even outperformed DALL-E 3 (67%) and Stable Diffusion 3 Medium (74%) on this metric - though these benchmarks don't tell the whole story about image quality, where both competitors still typically produce better results and better models are available.

One significant limitation holds Janus Pro back: both input and output images are restricted to 384 x 384 pixels. This affects quality, especially for fine details like faces, and makes it harder for the system to understand text in images. The team suggests future versions with higher resolution could solve these issues.

While Deepseek hasn't confirmed whether these improvements will appear in a future Janus release, it seems likely. The company recently gained attention with its R1 model - some called it the AI "Sputnik moment" for the US - and attracted new customers. A capable multimodal model could help them compete more effectively with OpenAI's ChatGPT, assuming they have the necessary computing infrastructure.

Recommendation

More information, the code, and the model are available on GitHub and HuggingFace. There is also a demo available.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Deepseek has launched Janus Pro, a major upgrade to its multimodal AI system, featuring improved training methods, expanded datasets, and larger model sizes.

  • The new version incorporates 90 million additional examples for multimodal understanding from various sources, as well as 72 million synthetic training examples for image generation, bringing the ratio of real to synthetic data to 1:1.

  • Janus Pro introduces a larger 7B model size, which outperforms its predecessor in both understanding and generating images.

Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.