Alibaba has released a new language model called Qwen3-Next, built on a customized MoE architecture. The company says the model runs much faster than its predecessors without losing performance.
The earlier Qwen3 model used 128 experts, activating 8 at each inference step. Qwen3-Next expands that layer to 512 experts but only activates 10 of them plus a shared expert. According to Alibaba, this setup delivers more than 10 times the speed of Qwen3-32B, especially with long inputs over 32,000 tokens.
The architecture also includes several tweaks to stabilize training. These help avoid issues like uneven expert use, numerical instability, or initialization errors. Among them are normalized initialization for router parameters and output gating in attention layers.
In addition to the base model, Alibaba introduced two specialized versions: Qwen3-Next-80B-A3B-Instruct for general-purpose tasks and Qwen3-Next-80B-A3B-Thinking for reasoning-heavy problems. The company says the smaller Instruct model performs nearly on par with its flagship Qwen3-235B-A22B-Instruct, particularly with long contexts up to 256,000 tokens. The Thinking model reportedly beats Google's closed Gemini 2.5 Flash Thinking on several benchmarks and comes close to Alibaba's own top-tier Qwen3-235B-A22B-Thinking in key metrics.

The models are available on Hugging Face, ModelScope, and the Nvidia API Catalog. For running them on private servers, the team recommends specialized frameworks like sglang or vllm. Current context windows go up to 256,000 tokens, and with specialized techniques, as high as one million.