AI research

Nougat: Meta's latest AI model makes scientific PDFs machine-readable

Maximilian Schreiner

Midjourney prompted by THE DECODER

Metas Nougat is an AI text recognition model that can reliably convert scientific PDFs to text.

Researchers at Meta have unveiled Nougat (Neural Optical Understanding for Academic Documents), an AI model that converts PDF images of scientific articles into structured, machine-readable text. Nougat aims to bridge the gap between human-readable PDF documents and machine-readable text, improving access to scientific knowledge.

Based on a variant of Vision Transformer for image analysis, Nougat performs optical character recognition (OCR) tailored for processing scientific documents. Unlike traditional OCR engines, which work line-by-line, Nougat processes the entire page. According to the team, this makes it easier to handle features such as superscripts and subscripts in mathematical formulas, which have often been transcribed incorrectly in the past.

For training, the team used a dataset of PDFs of scientific articles from sources such as arXiv and PubMed Central with the corresponding LaTeX source code from the author(s). The dataset consists of more than 8 million pages.

Metas Nougat significantly outperforms existing alternatives

In tests, Nougat achieved high accuracy in extracting text, formulas and tables from pages of scientific articles. For continuous text, it achieved a BLEU score of over 91% and an accuracy of over 96%. Performance for formulas and tables was lower at just over 75%, but still significantly more reliable than alternatives such as GROBID, whose accuracy for mathematical formulas is just under 11%.

According to Meta, Nougat is a promising solution for improving access to scientific knowledge by converting PDF research papers into structured, machine-readable text. This could make millions of scientific articles more accessible by bridging the gap between PDF and text.

However, challenges remain in managing cross-document consistency and avoiding repetitive text loops during generation, the team says.

The code and models are available on GitHub and are intended to accelerate future work in scientific document processing. More information and examples are available on the Nougat's project page.

Sources: