Microsoft's new Large Action Model can perform some tasks in Word

Microsoft researchers have developed what they're calling a "Large Action Model" (LAM) - an AI that can operate Windows programs on its own. This marks a shift from AIs that just talk about doing things to AIs that can actually do them.

Unlike traditional language models like GPT-4o, which mainly process and generate text, LAMs can turn user requests into real actions - whether that's operating software or controlling robots. While the concept isn't entirely new, this is the first model specifically trained to work with Microsoft Office products.

Comparison diagram: LLM gives text instructions, LAM actively performs steps to purchase a jacket on a shopping website, with screenshots of the actions. — The difference between LLM and LAM becomes clear when you look at something like online shopping. While regular LLMs can tell you how to shop online, LAMs can actually navigate the interface and do the shopping themselves. | Image: Microsoft

LAMs can understand what users want from various types of input - text, voice, or images - and turn those requests into detailed step-by-step plans. What's more, they can adjust their approach based on what's happening in real-time.

Training an LLM to take action

Building a LAM involves four main steps. First, the model learns to break tasks into logical steps. Then it learns from more advanced AI like GPT-4o to turn those plans into actions. After that, it explores new solutions on its own, even tackling problems that stumped other AI systems. Finally, the system gets fine-tuned through reward-based training.

Flowchart: Four-phase LAM training pipeline with task plan pretraining, expert learning, self-boosting, and reward model training. — The four-phase LAM training pipeline combines task plan pre-training, expert learning, self-reinforcing exploration, and reward model-based optimization. | Image: Microsoft

For their test case, they built a LAM based on Mistral-7B and let it loose in a Word test environment. It completed tasks successfully 71% of the time, significantly outperforming GPT-4o's 63% success rate (without visual information).

The LAM was also much faster, needing only 30 seconds per task compared to GPT-4o's 86 seconds. However, when GPT-4o got visual information, it was more accurate, with a 75.5% success rate.

Building the training data

The team started with 29,000 task-plan pairs collected from Microsoft documentation, wikiHow articles, and Bing searches. To expand this further, they used GPT-4o to transform simple tasks into more complex ones. A basic task like "Create a drop-down list" evolved into "Create a dependent drop-down list where the first selection filters the options in the second list."

This "data evolving" strategy helped them grow their dataset to 76,000 pairs - a 150% increase. From all these examples, about 2,000 successful action sequences made it into the final training set.

Process diagram: LAM development pipeline with 5 phases - data preparation, model training, integration, execution and evaluation. — The LAM development pipeline combines data preparation, training, and real-world implementation into one continuous process. The agent links the trained LAM to specific Windows actions and uses feedback to make ongoing improvements. | Image: Microsoft

The system still faces several hurdles: there are valid concerns about AI actions going wrong, regulatory questions that need answers, and technical limitations that make it difficult to scale up or adapt to different applications.

Recommendation

AI research

New Othello experiment supports the world model hypothesis for large language models

Despite these challenges, the researchers believe LAMs represent an important shift in AI development and say these "Large Action Models" mark a significant step toward artificial general intelligence (AGI). Instead of AI systems that just understand and generate text, we might soon have AI assistants that can actively help us complete real-world tasks.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft's new Large Action Model can perform some tasks in Word

Training an LLM to take action

Building the training data

New Othello experiment supports the world model hypothesis for large language models

Attackers can hijack Google Gemini with a simple prompt hidden in a calendar invite

Alibaba's new Qwen-Image model generates high-fidelity text inside images

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Microsoft's new Large Action Model can perform some tasks in Word

Training an LLM to take action

Building the training data

Share

Bank details