Ad
Skip to content

Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Image description
Nano Banana Pro prompted by THE DECODER

Oppo's Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.

In the technical report, Oppo's AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba's Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can't touch local sensors, cameras, or private data.

X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as "fuel" for higher-level reasoning when needed, the report says. It doesn't name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.

X-OmniClaw's full architecture runs on-device. Cloud models only kick in as "fuel" for complex reasoning, Oppo says.
X-OmniClaw's full architecture runs on-device. Cloud models only provide "fuel" for complex reasoning, according to Oppo. | Image: Oppo

Camera, screen, and voice feed into a single pipeline

The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user's request before triggering any action.

The perception stack pulls in text, voice, camera, and screen signals, syncs them up, and hands a structured intent to the language model.
The perception stack combines text, voice, camera, and screen signals, aligns them in time, and passes a structured intent to the language model. | Image: Oppo

In the researchers' example, a user asks "How much does this cost on Taobao?" while pointing the camera at a product. The system rephrases that internally to "price of Evian spray on Taobao" and only then hands the structured intent off for execution.

Photo gallery becomes searchable memory

For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.

The memory module crunches gallery photos during idle time into a Markdown file called "image-memory.md," filtering out sensitive content before saving.
The memory module summarizes gallery photos during idle time into a Markdown file called "image-memory.md" and filters out sensitive content before saving. | Image: Oppo

Every entry runs through a filter designed to strip out sensitive info before it's saved. The report flags upload risks tied to cloud vision. Moving to on-device models is the next step, the report says, so raw images never have to leave the phone.

Cloned tap paths replace step-by-step replays

Instead of planning every action from scratch, the agent clones user behavior into reusable skills. It extracts the full launch command for an app page and jumps there directly via deeplink next time, rather than replaying the original tap path.

Instead of replaying tap paths, X-OmniClaw clones an app page's launch activity and jumps straight there via deeplink next time.
Instead of replaying tap paths, X-OmniClaw clones an app page's launch activity and jumps straight there via deeplink next time. | Image: Oppo

If that fails, the system falls back through simpler launch methods one by one. To detect tappable elements, X-OmniClaw combines XML structure data with a grounding model and text recognition. That helps with ad-heavy interfaces where XML alone can't pin down a precise tap target.

From price checks to homework help

In the first scenario, a user points the camera at a product and asks about the price. The agent jumps into the shopping app, scrolls, takes screenshots, and reads out prices and sales figures through a vision-language model. A follow-up like "open the second item" works without any extra grounding.

The user points the camera at a bottle and asks "How much does this cost?" The agent opens Taobao, scrolls through results, and reads out prices and sales figures.
The user points the camera at a bottle and asks "How much does this cost?" The agent opens Taobao, scrolls through results, and reads out prices and sales figures. | Image: Oppo

In another example, X-OmniClaw acts as a "ScreenAvatar," a "digital surrogate" that solves on-screen tasks on command, like working through a series of practice problems one after another.

As a "ScreenAvatar," X-OmniClaw works through practice problems in sequence, tapping correct answers on its own.
As a "ScreenAvatar," X-OmniClaw works through a series of practice problems in sequence, tapping the correct answers on its own. | Image: Oppo

A third demo shows the system responding to a request to turn all parrot photos into a highlight album. It gathers matching files, jumps via deeplink into a video editing app's one-click composition tool, and selects the images with multi-tap.

From a voice request for a parrot album, the agent searches its condensed gallery memory for matching photos and hands them off to CapCut.
From a voice request for a parrot album, the agent searches the condensed gallery memory for matching photos and hands them off to CapCut. | Image: Oppo

In the fourth example, the user clones the path to a deeply nested discount page once. Next time, a voice command is enough to reopen that exact subpage , even if the app doesn't offer public deeplinks.

Show the agent the path to a deeply nested Meituan discount page once. Next time, a voice command gets you there - no public deeplink needed.
One recorded path to a deeply nested Meituan discount page is enough. Next time, a voice command gets you there - no public deeplink needed. | Image: Oppo

The project builds on the open-source HermesApp codebase and sits between OpenClaw, which focuses more on PCs, and the emergent-capability-driven Hermes Agent from Nous Research. Code and assets are available on GitHub.

Google recently showed with Gemma 4 that a fully local model on a smartphone can already act as an agent. In the demo app "Google AI Edge Gallery," the model uses agent skills to query Wikipedia, generate QR codes, or open mood trackers with trend charts.

In terms of method, the system builds on ByteDance's UI-TARS, a purely visual GUI agent that relies only on screenshots and coordinates. X-OmniClaw combines that approach with structural XML data and on-device execution to cut down on the error rate that pure vision pipelines hit with dynamic interfaces.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder