Content
summary Summary

Alibaba has released a new language model called Qwen3-Next, built on a customized MoE architecture. The company says the model runs much faster than its predecessors without losing performance.

Ad

The earlier Qwen3 model used 128 experts, activating 8 at each inference step. Qwen3-Next expands that layer to 512 experts but only activates 10 of them plus a shared expert. According to Alibaba, this setup delivers more than 10 times the speed of Qwen3-32B, especially with long inputs over 32,000 tokens.

The architecture also includes several tweaks to stabilize training. These help avoid issues like uneven expert use, numerical instability, or initialization errors. Among them are normalized initialization for router parameters and output gating in attention layers.

In addition to the base model, Alibaba introduced two specialized versions: Qwen3-Next-80B-A3B-Instruct for general-purpose tasks and Qwen3-Next-80B-A3B-Thinking for reasoning-heavy problems. The company says the smaller Instruct model performs nearly on par with its flagship Qwen3-235B-A22B-Instruct, particularly with long contexts up to 256,000 tokens. The Thinking model reportedly beats Google's closed Gemini 2.5 Flash Thinking on several benchmarks and comes close to Alibaba's own top-tier Qwen3-235B-A22B-Thinking in key metrics.

Ad
Ad
Image: Qwen.ai

The models are available on Hugging Face, ModelScope, and the Nvidia API Catalog. For running them on private servers, the team recommends specialized frameworks like sglang or vllm. Current context windows go up to 256,000 tokens, and with specialized techniques, as high as one million.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba has released Qwen3-Next, a new language model built on a customized mixture-of-experts design that runs over 10 times faster than Qwen3-32B on long inputs while maintaining performance.
  • The model expands to 512 experts but activates only 10 plus a shared expert at each step, with added design tweaks like normalized router initialization and output gating to stabilize training and prevent uneven expert use.
  • Alibaba also introduced two specialized versions: Qwen3-Next-80B-A3B-Instruct, which rivals its larger Qwen3-235B in handling long contexts, and Qwen3-Next-80B-A3B-Thinking, which outperforms Google’s Gemini 2.5 Flash Thinking on some benchmarks. Both are available through Hugging Face, ModelScope, Nvidia’s API Catalog, and compatible with private deployment frameworks.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.