A new study once again confirms that the way texts are broken down into tokens has a significant influence on the ability of AI language models to solve simple counting tasks.
According to a study by researchers at the University of British Columbia and Yale University, tokenization is a crucial factor in how well large language models (LLMs) can solve counting tasks. The researchers investigated how different methods of breaking down text into smaller units (tokens) affect the counting abilities of AI systems, which also have difficulties with this due to their architecture. This ability is important because counting is a fundamental component of many complex thinking tasks.
"Chain-of-Thought" can mitigate some architectural limitations
"Chain-of-Thought" (CoT) can mitigate some architectural limitations of the transformers underlying today's large language models. In this method, intermediate steps of the thought process are output in natural language to enable deeper reasoning. OpenAI's new o1 model, for example, relies heavily on this method — although not just in prompting.
However, LLMs typically use byte-level tokenizers like Byte Pair Encoding (BPE), which combine multiple characters into a single token. This fundamentally changes the way information is processed.
The researchers therefore investigated the influence of tokenization on the counting abilities of LLMs using a black-box approach. Specifically, they manipulated the BPE tokenizers through carefully designed string formats to make them function in different ways.
GPT-4o and Claude fail at simple counting tasks
In experiments with the GPT-4o mini and Claude-3.5-sonnet models, it was shown that the accuracy of counting varied by up to 80 percent depending on the tokenization method. When contiguous letters were combined into a single token, performance was even worse than when no CoT was used at all.
However, when the elements to be counted were separated by delimiters such as spaces or commas, performance improved by 13 to 40 percent. The models achieved the best results when each letter formed its own token.
The researchers also found that the effects of tokenization differ from model to model. Certain tokens proved to be more sensitive to counting tasks, even when the type of task did not change.
Rare letters like "z" performed better in the experiments than common ones like "e". The researchers believe that rare tokens contain less embedded information, which leads to less distraction in the attention calculation during the counting process.
Eliminating tokenizers could be a solution
A different way of tokenizing could be a way of making LLMs better at counting - but getting rid of them altogether could also be a solution. Meta, for example, has introduced MegaByte, a new method that does not use tokenizers but instead processes text, images, and audio at the byte level. The company plans to extend the technology to larger models. So far, tests on smaller models have shown promising results. OpenAI co-founder Andrej Karpathy supported the approach on Twitter: "Everyone should hope we can throw away tokenization in LLMs."