Microsoft researchers have developed what they're calling a "Large Action Model" (LAM) - an AI that can operate Windows programs on its own. This marks a shift from AIs that just talk about doing things to AIs that can actually do them.
Unlike traditional language models like GPT-4o, which mainly process and generate text, LAMs can turn user requests into real actions - whether that's operating software or controlling robots. While the concept isn't entirely new, this is the first model specifically trained to work with Microsoft Office products.
LAMs can understand what users want from various types of input - text, voice, or images - and turn those requests into detailed step-by-step plans. What's more, they can adjust their approach based on what's happening in real-time.
Training an LLM to take action
Building a LAM involves four main steps. First, the model learns to break tasks into logical steps. Then it learns from more advanced AI like GPT-4o to turn those plans into actions. After that, it explores new solutions on its own, even tackling problems that stumped other AI systems. Finally, the system gets fine-tuned through reward-based training.
For their test case, they built a LAM based on Mistral-7B and let it loose in a Word test environment. It completed tasks successfully 71% of the time, significantly outperforming GPT-4o's 63% success rate (without visual information).
The LAM was also much faster, needing only 30 seconds per task compared to GPT-4o's 86 seconds. However, when GPT-4o got visual information, it was more accurate, with a 75.5% success rate.
Building the training data
The team started with 29,000 task-plan pairs collected from Microsoft documentation, wikiHow articles, and Bing searches. To expand this further, they used GPT-4o to transform simple tasks into more complex ones. A basic task like "Create a drop-down list" evolved into "Create a dependent drop-down list where the first selection filters the options in the second list."
This "data evolving" strategy helped them grow their dataset to 76,000 pairs - a 150% increase. From all these examples, about 2,000 successful action sequences made it into the final training set.
The system still faces several hurdles: there are valid concerns about AI actions going wrong, regulatory questions that need answers, and technical limitations that make it difficult to scale up or adapt to different applications.
Despite these challenges, the researchers believe LAMs represent an important shift in AI development and say these "Large Action Models" mark a significant step toward artificial general intelligence (AGI). Instead of AI systems that just understand and generate text, we might soon have AI assistants that can actively help us complete real-world tasks.