Chinese researchers let LLMs share meaning through internal memory instead of text

A research team in China has built a new way for large language models to talk to each other using their internal memory, instead of relying on text. This cache-to-cache (C2C) method lets models share information faster and more accurately.

Right now, when different language models work together, they have to send messages back and forth as text. But this approach has three big problems, according to researchers from several Chinese universities: text is a bottleneck, natural language can be ambiguous, and generating each token takes time.

One example shows where things break down. If a programmer LLM tells a writer LLM to "write content to the section wrapper," the writer might not know what "<p>" means and put the text in the wrong spot.

Diagramm zeigt zwei Kommunikationswege zwischen KI-Modellen: Text-zu-Text-Kommunikation führt zu Fehlplatzierung von Inhalten durch Missverständnis, während Cache-to-Cache-Kommunikation korrektes Verständnis der HTML-Struktur ermöglicht. — When LLMs pass instructions as plain text, the writer can misplace content due to structural confusion. With C2C, models share the meaning directly through the KV cache, so the writer knows exactly where things belong. | Image: Fu et al.

Sharing memory instead of text

The team's solution is C2C, which lets models exchange meaning by sending their internal memory—the KV cache—instead of final text.

The KV cache is like a model's internal scratchpad. As the model processes text, it stores mathematical snapshots of each word and phrase. These snapshots hold much richer information than the final text output. Text only gives the end result, but the KV cache captures all the steps and context along the way.

With C2C, a programming model can pass its internal understanding of something like an HTML structure straight to a writer model. The writer then knows exactly where to put everything, without guessing.

The C2C system works by projecting the source model's KV cache into the target model and merging their memory through a neural network called Cache Fuser. The Cache Fuser has three parts: a projection module to line up different cache formats, a dynamic weighting system to decide how much information to use, and an adaptive gate to pick which model layers get the benefit.

Flussdiagramm der C2C-Architektur zeigt Datenfluss von Sharer- und Receiver-Cache durch Fuser-Module (Projektion, dynamische Gewichtung, Gate) zur finalen Antwortgenerierung. — Cache Fuser lines up the internal memory from both models, weighs how important each piece is, and filters what gets transferred, so only useful knowledge is shared. | Image: Fu et al.

Different models store their internal data in unique ways, so the researchers had to sync these representations step by step. They first align how words are broken down, then connect the different model layers.

Tests show that enriching a model's KV cache with another model's memory boosts response quality, without making the cache bigger. The team also found KV caches could be converted across models, with each model putting its own spin on the same input.

Recommendation

AI research

MatterGen: Microsoft presents AI tools for generating and simulating new materials

Faster, better results

In benchmarks, C2C beat regular text-based communication by 3 to 5 percent and increased accuracy by 8.5 to 10.5 percent compared to single models. Speed also roughly doubled.

The team tested different model combos, including Qwen2.5, Qwen3, Llama 3.2, and Gemma 3, with sizes from 0.6 billion to 14 billion parameters. Bigger source models with more knowledge delivered even better results.

Technical checks confirmed that C2C increases the semantic richness of shared memory. After fusion, the information density went up, showing that extra knowledge really got transferred.

One big plus is efficiency. Only the C2C connection module needs training—the source and target models stay the same. This avoids the massive costs of retraining full models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The researchers say C2C could be used for privacy-sensitive teamwork between cloud and edge devices, paired with existing acceleration tricks, or as part of multimodal systems that mix language, images, and actions.

The team has open-sourced their code on GitHub and sees cache-to-cache as a practical alternative to text for building faster, more scalable AI systems.

Chinese researchers let LLMs share meaning through internal memory instead of text

Sharing memory instead of text

MatterGen: Microsoft presents AI tools for generating and simulating new materials

Faster, better results

We risk a deluge of AI-written "science" pushing corporate interests – here’s what to do about it

Study claims 78 training examples are enough to build autonomous agents

Google Deepmind brings agentic AI capabilities into robots with two new Gemini models

OpenAI says GPT-5 shows 30 percent less political bias than previous models

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

OpenAI unveils Sora 2 video model with realistic physics, high-quality audio, and a new social app

Chinese researchers let LLMs share meaning through internal memory instead of text

Sharing memory instead of text

Faster, better results

Share

Bank details