'OCR 2.0' model converts images of text, formulas, notes, and shapes into editable text

Researchers have created a new universal optical character recognition (OCR) model called GOT (General OCR Theory). Their paper introduces the concept of OCR 2.0, which aims to combine the strengths of traditional OCR systems and large language models.

According to the researchers, an OCR 2.0 model uses a unified end-to-end architecture and requires fewer resources than large language models, while being versatile enough to recognize more than just plain text.

GOT's architecture consists of an image encoder with approximately 80 million parameters and a speech decoder with 500 million parameters. The encoder compresses 1,024 x 1,024 pixel images into tokens, which the decoder then converts into text of up to 8,000 characters.

'OCR 2.0' unlocks automated processing of complex visual data in science, music, and analytics

The new model can recognize and convert various types of visual information into editable text. These include scene text and document text in English and Chinese, mathematical and chemical formulas, musical notes, simple geometric shapes, and diagrams with their components.

Flow diagram: Three-stage GOT model architecture with vision encoder, linear layer, and language models for OCR 2.0 technology. — The diagram illustrates the three-stage architecture of the GOT (General OCR Theory) model, which combines traditional OCR systems with large language models. The researchers call this "OCR 2.0". | Image: Wei et al.

To optimize training, the researchers first trained only the encoder on text recognition tasks. They then added Alibaba's Qwen-0.5B as a decoder and fine-tuned the entire model with diverse, synthetic data. The team used rendering tools such as LaTeX, Mathpix-markdown-it, TikZ, Verovio, Matplotlib, and Pyecharts to generate millions of image-text pairs for training.

Three book pages in Chinese with OCR recognition and extracted text below, showing format retention across multiple pages. — OCR 2.0 allows you to extract formatted text, headings, and even images from multiple pages and convert them into a structured digital form. | Image: Wei et al.

The researchers report that GOT's modular design and synthetic data training allow for flexible expansion. New capabilities can be added without retraining the entire model. This approach allows for efficient updates and improvements to the system over time, they say.

Three-column diagram: Text sources, rendering tools, and visual results for scientific and technical representations. — This overview shows the workflow from text sources through rendering tools to visual results. It illustrates how various input formats such as .tex or SMILES codes can be transformed into complex mathematical formulas, chemical structures, geometric figures, and diagrams through specialized rendering tools. | Image: Wei et al.

In experiments, GOT performed well across various OCR tasks. It achieved top scores in document and scene text recognition, even outperforming specialized models and large language models in diagram recognition.

Comparison of OCR inputs and outputs: Chemical structural formula, musical notes, and bar chart with corresponding digital representations. — From complex chemical structural formulas to musical notation and data visualization: OCR 2.0 can accurately capture various formats and convert them into machine-readable formats. This opens up new possibilities for automated processing and analysis in science, music, and data analysis. | Image: Wei et al.

The researchers have made a free demo and the code available on Hugging Face for others to use and build upon.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

'OCR 2.0' model converts images of text, formulas, notes, and shapes into editable text

'OCR 2.0' unlocks automated processing of complex visual data in science, music, and analytics

Deepseek's Janus Pro is a good upgrade, but it won't fuel a US AI 'Sputnik crisis'

Seattle startup shrinks computer vision AI to fit in your pocket

GPT-4o and Claude 3.5 Sonnet dominate vision language models

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

ChatGPT's memory could turn personal details into ads OpenAI CEO Altman once called dystopian

The long-predicted deepfake dystopia has arrived with Sora 2

'OCR 2.0' model converts images of text, formulas, notes, and shapes into editable text

'OCR 2.0' unlocks automated processing of complex visual data in science, music, and analytics

Deepseek's Janus Pro is a good upgrade, but it won't fuel a US AI 'Sputnik crisis'

Seattle startup shrinks computer vision AI to fit in your pocket

GPT-4o and Claude 3.5 Sonnet dominate vision language models