Researchers from China have introduced GeoVista, an open-source AI model that locates images by combining visual analysis with live web searches. The system aims to match the performance of commercial leaders like Gemini 2.5 Flash.
Developed by Tencent and several Chinese universities, the model relies on two primary tools. A zoom function magnifies specific regions to inspect details, while a search tool pulls up to ten relevant sources from platforms like Tripadvisor, Instagram, Facebook, Pinterest, and Wikipedia. GeoVista automatically decides when to deploy each tool.
Real-time search improves results
The team cites web search integration as GeoVista's key advantage over existing methods. While models like Mini-o3 or ByteDance's DeepEyes focus on image manipulation, GeoVista actively mines external data. The paper does not specify which search provider the system uses.

The team built GeoVista on top of Qwen2.5-VL-7B-Instruct using a two-phase process. First, a supervised learning phase taught the model basic reasoning and tool usage via 2,000 curated examples. Commercial AI models generated the example tool calls and justifications, which the team assembled into multi-level thought processes.
In the second phase, reinforcement learning refined these skills using 12,000 examples. A custom reward system prioritizes geographic precision: correct answers at the city level earn higher rewards than those at the province or country level.
Open source catches up to proprietary tech
On the team's custom GeoBench dataset, GeoVista achieved 92.64 percent accuracy at the country level, 79.60 percent at the province level, and 72.68 percent at the city level. The model works best on panoramas (79.49 percent city accuracy) and standard photos (72.27 percent), while satellite images remain the hardest task at 44.92 percent.

By comparison, Gemini 2.5 Pro hit 78.98 percent at the city level, GPT-5 reached 67.11 percent, and Gemini 2.5 Flash scored 73.29 percent. Other open-source models lagged significantly, with Mini-o3-7B reaching only 11.3 percent. The recently announced Gemini 3 could change these rankings in future tests.
For distance measurements, 52.83 percent of GeoVista's predictions landed within 3 kilometers (1.86 miles) of the actual location, with a median deviation of 2.35 kilometers (1.46 miles). Gemini 2.5 Pro achieved 64.45 percent accuracy with a median deviation of 800 meters (0.5 miles), while GPT-5 managed 55.12 percent at 1.86 kilometers (1.15 miles).
Ablation tests confirmed the need for both training phases. Without the initial supervised learning, performance dropped sharply as the model generated overly short answers and failed to use tools. Skipping reinforcement learning caused similar dips. The tiered reward system proved essential for leveraging multi-level geographic data.
Notably, incorrect tool usage dropped during reinforcement learning even though the team didn't directly optimize for it. Performance also scaled with data volume: tests with 1,500, 3,000, 6,000, and 12,000 examples showed continuous improvement.
New benchmark filters out easy targets
Along with the model, the researchers released GeoBench, a dataset containing 1,142 high-resolution images from 66 countries and 108 cities. The set includes 512 standard photos, 512 panoramas, and 108 satellite images, all with a resolution of at least one million pixels.

Stricter filtering sets GeoBench apart from existing datasets like OpenStreetView-5M or GeoComp. The team removed non-localizable images—such as food close-ups or generic landscapes—and easily recognizable landmarks, arguing that internet images vary widely in how easily they can be located.
The benchmark evaluates performance in two ways: a step-by-step accuracy check at the country, province, and city levels, and a precise distance measurement calculated by converting text addresses into coordinates.
The model weights, code, and benchmark are available on the project page. While the paper does not address potential misuse, the implications are clear: anyone posting public photos should assume AI models can now determine their location with high precision.