Alibaba's Qwen3-Next builds on a faster MoE architecture

Update

Added FP8 releases

Update from September 23, 2025:

Alibaba has added two new models to its Qwen3-Next line, both built around FP8 precision. The Qwen3-Next-80B-A3B-Instruct-FP8 and Qwen3-Next-80B-A3B-Thinking-FP8 use the FP8 (8-bit floating point) format, which is designed to boost processing speed. Both models work out of the box with frameworks like Transformers, vLLM, and SGLang.

These FP8 models are aimed at situations where speed really matters, like running AI services in real time. Compared to standard formats like FP16 or INT8, FP8 offers a stronger balance between raw performance and energy use, with just a small tradeoff in response accuracy.

Both versions are already available on Hugging Face and ModelScope. The Instruct model is geared toward general chatbot and assistant tasks, while the Thinking model is tuned for more complex, logic-heavy jobs.

Original article from September 14, 2025:

Alibaba has released a new language model called Qwen3-Next, built on a customized MoE architecture. The company says the model runs much faster than its predecessors without losing performance.

The earlier Qwen3 model used 128 experts, activating 8 at each inference step. Qwen3-Next expands that layer to 512 experts but only activates 10 of them plus a shared expert. According to Alibaba, this setup delivers more than 10 times the speed of Qwen3-32B, especially with long inputs over 32,000 tokens.

The architecture also includes several tweaks to stabilize training. These help avoid issues like uneven expert use, numerical instability, or initialization errors. Among them are normalized initialization for router parameters and output gating in attention layers.

In addition to the base model, Alibaba introduced two specialized versions: Qwen3-Next-80B-A3B-Instruct for general-purpose tasks and Qwen3-Next-80B-A3B-Thinking for reasoning-heavy problems. The company says the smaller Instruct model performs nearly on par with its flagship Qwen3-235B-A22B-Instruct, particularly with long contexts up to 256,000 tokens. The Thinking model reportedly beats Google's closed Gemini 2.5 Flash Thinking on several benchmarks and comes close to Alibaba's own top-tier Qwen3-235B-A22B-Thinking in key metrics.

Recommendation

AI in practice

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

The models are available on Hugging Face, ModelScope, and the Nvidia API Catalog. For running them on private servers, the team recommends specialized frameworks like sglang or vllm. Current context windows go up to 256,000 tokens, and with specialized techniques, as high as one million.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Alibaba's Qwen3-Next builds on a faster MoE architecture

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

OpenAI insists its shopping suggestions shouldn't be seen as advertising

AI agents in GitHub and GitLab workflows create new enterprise security risks

Google gathers triple OpenAI's AI data through its search monopoly

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Alibaba's Qwen3-Next builds on a faster MoE architecture

Share

Bank details