Content
summary Summary

Researchers have created a new universal optical character recognition (OCR) model called GOT (General OCR Theory). Their paper introduces the concept of OCR 2.0, which aims to combine the strengths of traditional OCR systems and large language models.

Ad

According to the researchers, an OCR 2.0 model uses a unified end-to-end architecture and requires fewer resources than large language models, while being versatile enough to recognize more than just plain text.

GOT's architecture consists of an image encoder with approximately 80 million parameters and a speech decoder with 500 million parameters. The encoder compresses 1,024 x 1,024 pixel images into tokens, which the decoder then converts into text of up to 8,000 characters.

'OCR 2.0' unlocks automated processing of complex visual data in science, music, and analytics

The new model can recognize and convert various types of visual information into editable text. These include scene text and document text in English and Chinese, mathematical and chemical formulas, musical notes, simple geometric shapes, and diagrams with their components.

Ad
Ad
Flow diagram: Three-stage GOT model architecture with vision encoder, linear layer, and language models for OCR 2.0 technology.
The diagram illustrates the three-stage architecture of the GOT (General OCR Theory) model, which combines traditional OCR systems with large language models. The researchers call this "OCR 2.0". | Image: Wei et al.

To optimize training, the researchers first trained only the encoder on text recognition tasks. They then added Alibaba's Qwen-0.5B as a decoder and fine-tuned the entire model with diverse, synthetic data. The team used rendering tools such as LaTeX, Mathpix-markdown-it, TikZ, Verovio, Matplotlib, and Pyecharts to generate millions of image-text pairs for training.

Three book pages in Chinese with OCR recognition and extracted text below, showing format retention across multiple pages.
OCR 2.0 allows you to extract formatted text, headings, and even images from multiple pages and convert them into a structured digital form. | Image: Wei et al.

The researchers report that GOT's modular design and synthetic data training allow for flexible expansion. New capabilities can be added without retraining the entire model. This approach allows for efficient updates and improvements to the system over time, they say.

Three-column diagram: Text sources, rendering tools, and visual results for scientific and technical representations.
This overview shows the workflow from text sources through rendering tools to visual results. It illustrates how various input formats such as .tex or SMILES codes can be transformed into complex mathematical formulas, chemical structures, geometric figures, and diagrams through specialized rendering tools. | Image: Wei et al.

In experiments, GOT performed well across various OCR tasks. It achieved top scores in document and scene text recognition, even outperforming specialized models and large language models in diagram recognition.

Comparison of OCR inputs and outputs: Chemical structural formula, musical notes, and bar chart with corresponding digital representations.
From complex chemical structural formulas to musical notation and data visualization: OCR 2.0 can accurately capture various formats and convert them into machine-readable formats. This opens up new possibilities for automated processing and analysis in science, music, and data analysis. | Image: Wei et al.

The researchers have made a free demo and the code available on Hugging Face for others to use and build upon.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed GOT (General OCR Theory), a new universal optical character recognition model that combines the strengths of traditional OCR systems with those of large language models. They call this approach "OCR-2.0".
  • GOT consists of an efficient image encoder with 80 million parameters and a versatile speech decoder with 500 million parameters, enabling it to recognize and convert a wide variety of visual information, such as text, formulas, musical notes, and diagrams, into editable text.
  • Thanks to its modular structure and training on synthetic data, GOT can be flexibly expanded to include new capabilities, achieving top results in various OCR tasks and even outperforming specialized models in some cases.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.