Content
summary Summary

Researchers from China have introduced GeoVista, an open-source AI model that locates images by combining visual analysis with live web searches. The system aims to match the performance of commercial leaders like Gemini 2.5 Flash.

Ad

Developed by Tencent and several Chinese universities, the model relies on two primary tools. A zoom function magnifies specific regions to inspect details, while a search tool pulls up to ten relevant sources from platforms like Tripadvisor, Instagram, Facebook, Pinterest, and Wikipedia. GeoVista automatically decides when to deploy each tool.

Real-time search improves results

The team cites web search integration as GeoVista's key advantage over existing methods. While models like Mini-o3 or ByteDance's DeepEyes focus on image manipulation, GeoVista actively mines external data. The paper does not specify which search provider the system uses.

Collage of global GeoBench images alongside a flowchart of the GeoVista Agentic Pipeline with user query, policy model, thought-action-observation loop, and zoom and web search tools.
GeoVista iteratively zooms in on image sections and queries online sources until it pinpoints the location. | Image: Wang et al.

The team built GeoVista on top of Qwen2.5-VL-7B-Instruct using a two-phase process. First, a supervised learning phase taught the model basic reasoning and tool usage via 2,000 curated examples. Commercial AI models generated the example tool calls and justifications, which the team assembled into multi-level thought processes.

Ad
Ad

In the second phase, reinforcement learning refined these skills using 12,000 examples. A custom reward system prioritizes geographic precision: correct answers at the city level earn higher rewards than those at the province or country level.

Open source catches up to proprietary tech

On the team's custom GeoBench dataset, GeoVista achieved 92.64 percent accuracy at the country level, 79.60 percent at the province level, and 72.68 percent at the city level. The model works best on panoramas (79.49 percent city accuracy) and standard photos (72.27 percent), while satellite images remain the hardest task at 44.92 percent.

Table: GeoBench accuracy of closed-source and open-source models at the country, province, city, and data type levels.
GeoVista-7B outperforms other open-source models on GeoBench and nears the accuracy of top proprietary models at the city level, especially with panoramic images. | Image: Wang et al.

By comparison, Gemini 2.5 Pro hit 78.98 percent at the city level, GPT-5 reached 67.11 percent, and Gemini 2.5 Flash scored 73.29 percent. Other open-source models lagged significantly, with Mini-o3-7B reaching only 11.3 percent. The recently announced Gemini 3 could change these rankings in future tests.

For distance measurements, 52.83 percent of GeoVista's predictions landed within 3 kilometers (1.86 miles) of the actual location, with a median deviation of 2.35 kilometers (1.46 miles). Gemini 2.5 Pro achieved 64.45 percent accuracy with a median deviation of 800 meters (0.5 miles), while GPT-5 managed 55.12 percent at 1.86 kilometers (1.15 miles).

Ablation tests confirmed the need for both training phases. Without the initial supervised learning, performance dropped sharply as the model generated overly short answers and failed to use tools. Skipping reinforcement learning caused similar dips. The tiered reward system proved essential for leveraging multi-level geographic data.

Recommendation

Notably, incorrect tool usage dropped during reinforcement learning even though the team didn't directly optimize for it. Performance also scaled with data volume: tests with 1,500, 3,000, 6,000, and 12,000 examples showed continuous improvement.

New benchmark filters out easy targets

Along with the model, the researchers released GeoBench, a dataset containing 1,142 high-resolution images from 66 countries and 108 cities. The set includes 512 standard photos, 512 panoramas, and 108 satellite images, all with a resolution of at least one million pixels.

Schematic diagram of the GeoBench evaluation: geocoding of predicted addresses, calculation of the Haversine distance, and multi-stage verification of visual clues.
The evaluation verifies country, province, and city names before automatically geocoding the data to compare it against reference coordinates. | Image: Wang et al.

Stricter filtering sets GeoBench apart from existing datasets like OpenStreetView-5M or GeoComp. The team removed non-localizable images—such as food close-ups or generic landscapes—and easily recognizable landmarks, arguing that internet images vary widely in how easily they can be located.

The benchmark evaluates performance in two ways: a step-by-step accuracy check at the country, province, and city levels, and a precise distance measurement calculated by converting text addresses into coordinates.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The model weights, code, and benchmark are available on the project page. While the paper does not address potential misuse, the implications are clear: anyone posting public photos should assume AI models can now determine their location with high precision.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • GeoVista is an open-source AI model from China that can analyze images and use web searches to figure out where a photo was taken.
  • The model combines image analysis, targeted zooming, and retrieving information from the internet, learning to use these techniques on its own.
  • A new test dataset with realistic scenarios shows that GeoVista is among the best open-source tools for pinpointing the locations of images.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.