How useful are million-token context windows, really? In a recent interview, Nikolay Savinov from Deepmind explained that when a model is fed many tokens, it has to distribute its attention across all of them. This means focusing more on one part of the context automatically leads to less attention for the rest. To get the best results, Savinov recommends including only the content that is truly relevant to the task.
I'm just talking about-- the current reality is like, if you want to make good use of it right now, then, well, let's be realistic.
Nikolay Savinov
Recent research supports this approach. In practice, this could mean cutting out unnecessary pages from a PDF before sending it to an AI model, even if the system can technically process the entire document at once.