Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone
Oppo's Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.
In the technical report, Oppo's AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba's Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can't touch local sensors, cameras, or private data.
X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as "fuel" for higher-level reasoning when needed, the report says. It doesn't name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.

Camera, screen, and voice feed into a single pipeline
The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user's request before triggering any action.

In the researchers' example, a user asks "How much does this cost on Taobao?" while pointing the camera at a product. The system rephrases that internally to "price of Evian spray on Taobao" and only then hands the structured intent off for execution.
Photo gallery becomes searchable memory
For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.

Every entry runs through a filter designed to strip out sensitive info before it's saved. The report flags upload risks tied to cloud vision. Moving to on-device models is the next step, the report says, so raw images never have to leave the phone.
Cloned tap paths replace step-by-step replays
Instead of planning every action from scratch, the agent clones user behavior into reusable skills. It extracts the full launch command for an app page and jumps there directly via deeplink next time, rather than replaying the original tap path.

If that fails, the system falls back through simpler launch methods one by one. To detect tappable elements, X-OmniClaw combines XML structure data with a grounding model and text recognition. That helps with ad-heavy interfaces where XML alone can't pin down a precise tap target.
From price checks to homework help
In the first scenario, a user points the camera at a product and asks about the price. The agent jumps into the shopping app, scrolls, takes screenshots, and reads out prices and sales figures through a vision-language model. A follow-up like "open the second item" works without any extra grounding.

In another example, X-OmniClaw acts as a "ScreenAvatar," a "digital surrogate" that solves on-screen tasks on command, like working through a series of practice problems one after another.

A third demo shows the system responding to a request to turn all parrot photos into a highlight album. It gathers matching files, jumps via deeplink into a video editing app's one-click composition tool, and selects the images with multi-tap.

In the fourth example, the user clones the path to a deeply nested discount page once. Next time, a voice command is enough to reopen that exact subpage , even if the app doesn't offer public deeplinks.

The project builds on the open-source HermesApp codebase and sits between OpenClaw, which focuses more on PCs, and the emergent-capability-driven Hermes Agent from Nous Research. Code and assets are available on GitHub.
Google recently showed with Gemma 4 that a fully local model on a smartphone can already act as an agent. In the demo app "Google AI Edge Gallery," the model uses agent skills to query Wikipedia, generate QR codes, or open mood trackers with trend charts.
In terms of method, the system builds on ByteDance's UI-TARS, a purely visual GUI agent that relies only on screenshots and coordinates. X-OmniClaw combines that approach with structural XML data and on-device execution to cut down on the error rate that pure vision pipelines hit with dynamic interfaces.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowRead on for the full picture.
Subscribe for hype-free coverage.
- Access to all THE DECODER articles.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.