CoCoMix (Jihoon et al., 2025)¹ by Meta have made conceptual learning, i.e., learning concepts behind words instead of just predicting the next token a reality, making them remarkably steerable and interpretable.
But a core question remains: even a conceptually brilliant model can struggle with nuanced or factual recall challenges after training, during actual deployment. You could ask a seemingly simple question like, “Earlier in our 2-million-token conversation, where did we discuss Pinocchio’s famously growing nose?” No matter how conceptually capable the LLM is, it cannot answer this simple question if the answer lies outside its context window.
So the question becomes, can we equip these intelligent LLMs with an adaptable “memory” or performance boost precisely when it counts — during inference?
Transformers (Vaswani et al., 2017)² have become nothing short of ubiquitous in the modern AI landscape. Ever since their breakout success, they’ve been the go-to architecture across domains.
Back in 2020, the default response to any machine learning problem was often, “just throw attention at it” — and surprisingly, it worked, often outperforming state-of-the-art models. Vision tasks? Use transformers (Dosovitskiy et al., 2020)³. Time series forecasting? Transformers again (Zerveas et al., 2021)⁴. Natural language processing? Well, transformers practically defined it (Rogers et al., 2021)⁵.
But as our reliance on large models deepened and compute budgets expanded, even this “do it all” architecture began to show its limits — and so began the push to stretch its capabilities even further.
The bottleneck? Attention’s ‘everyone-talks-to-everyone’ approach. Brilliant but quadratically expensive —imagine a room of a million people, where each person must remember every conversation with everyone. This restricts Transformers to a narrow “working memory,” struggling with the “long-term recall” needed for understanding vast documents, as early information simply fades away.
Beyond the context limits, vanilla transformers face another fundamental hurdle: a lack of adaptability after training. While they excel at applying their vast pre-trained knowledge to predict the next token — a process of sophisticated reasoning and prediction — this is not the same as true learning. Like Google Maps — while it finds the “shortest path” for you, it forgets there’s construction ahead and wants you to drive through barricades. A human guide, on the other hand, would have shown you an alternate alley route.
This inability to “learn on the fly” from the data they are currently processing represents a critical limitation for tasks requiring continuous adaptation or memory of novel experiences beyond the training set.
Instead of targeting just one limitation, the researchers took a broader perspective: how do intelligent systems, like the human brain, manage memory and adapt to new situations? It’s not about having one massive, ever-accessible memory. It’s a more flexible setup, where different components coordinate to handle different kinds of information and experiences.
The Titans’ architecture (Behrouz et al., 2025)⁶ embraces this, built not around a single, monolithic attention block but around a cooperative team of specialized memory systems, each playing a crucial role in understanding and responding to the task at hand.
So, how do these three truly work together? To get started, STM is essentially the standard Self-Attention calculation, which is a staple in vanilla transformers. Its “memory” is the KV cache and attention matrices it learns during training.
On the other hand, PM is a set of learnable parameters, which are prepended to the input sequence, and are learned during training and act as the “Holy Grail” for the model to adhere to, no matter what, during inference.
Fairly easy to understand till now— hmm? Then let us dive into the innovation and truly exciting part, the one which, although it is implemented as a simple MLP network, can adapt during test time — the LMM module:
Wait a minute… parameter updates at test time? Isn’t that something we only do during training? Isn’t this basically cheating?
Are these the questions you thought of when you heard the term Test-time training? These are valid questions, but no, it is not cheating. Titans leverage principles from online learning and meta-learning to enable rapid, localized updates tailored specifically for memorization, not general task improvement. It doesn’t look at external labels during test-time to compute gradients and optimize parameters; instead, everything stays self-contained: the model adjusts internally, using only what it already knows and what it sees in the moment.
In human memory, routine and predictable events often fade, while unexpected or surprising moments tend to persist (Mandler, 2014)⁷. This is the core idea behind the implementation of dynamic test-time updates.
The LMM acts as an associative memory: it learns to connect “keys” (cues) to “values” (information). For every new piece of data xt (The input chunk in MAG & MAL, STM (Self-Attention) output in MAC):
To make the LMM learn from this loss, we incorporate the Surprise Signal, which measures how much the model was “surprised” at seeing the ground truth (vt). This “Surprise” is mathematically defined as the gradient of the loss function with respect to the LMM’s parameters.
A large gradient means xt is highly “surprising” or unexpected given the LMM’s current knowledge.
Basic Learning Step: The simplest way the LMM then learns is by adjusting its parameters slightly in the direction that would reduce this surprise (i.e., reduce the loss), much like a step in gradient descent:
Reacting only to immediate “surprise” is not enough. A good memory needs to see trends and also know when to let go of old, irrelevant information.
Smart Learning Direction (ΔΘMt): First, the LMM calculates the best direction to adjust its parameters. This is not just based on the current surprise, but also on a “memory” of recent surprises.
Final Parameter Update (ΘMt): The LMM then updates its actual parameters, mixing its old knowledge with this new learning direction, and crucially, allowing for “forgetting.”
In a Nutshell: The LMM looks at the current data’s “surprise” (∇Loss_current_surprise), blends it with recent learning trends (momentum ΔΘMt-1), and then updates its internal knowledge (ΘMt), deciding how much old information to keep or forget (at) in the process. The data-dependent gates (ηt, θt, at) make it adaptive on the fly.
The Google researchers explored three main ways in which these three memory modules could be arranged:
In this setup, Titans creates an augmented and richer context for the STM (Standard Self-Attention Block).
Analogy: The text (sequence) arrives in pages (chunks). For each page, an ever-learning note-taker (LMM) quickly finds relevant summaries from past notes and mixes them with essential “rulebook” notes (PM). The student (STM/Attention) reads the whole thing — rulebook, relevant past summaries, AND the current page — and based on what it learns from this enriched context, tells the note-taker exactly which points on the current page were crucial to remember for future summaries.
The final answer is formed considering both the student’s detailed reading and the note-taker’s updated memory perspective.
This design integrates memories differently, in parallel, blending information from the STM and LMM outputs through a dynamic gate.
Analogy: As each page arrives, two tasks happen side-by-side: The student (STM) focuses intently only on the page in front of them, while the separate note-taker (LMM) reads through the current page and relates it to the content it has seen so far, quietly updating its overall understanding. The student provides an answer based on the paragraph, and the note-taker offers a thought based on its evolving big picture.
The coordinator (Gating mechanism) simply blends or picks from their two outputs for the final response.
In this variant, the LMM network is used as an initial processing layer that modifies the sequence before it reaches the Attention mechanism (STM).
Analogy: First, every new page goes straight to a main note-taker (LMM) who processes it all, summarizing as it goes and updating its summarizing style along the way. This (potentially less detailed) summary is then handed off to the student (STM). The student only sees and focuses on local parts of this summarized text, basing their answer entirely on what the main note-taker has provided.
So, now we know everything about the next possible revolutionary after Transformers, but will it be that big? Did Google’s researchers truly crack the code for models that can remember, adapt, and conquer challenges previously thought impossible? Let’s go through the long list of novel findings one by one:
Titans go far beyond simply predicting the next word a bit more accurately. Thanks to its dynamic Long-Term Memory Module (LMM), it shows a deeper, more intuitive grasp of language and context. When evaluated against strong baselines like Transformer++ and several of the latest recurrent models, Titans consistently outperformed them, not just in language modeling, but also on commonsense reasoning tasks.
Titans’ designs showed outstanding performance continuity on the S-NIAH task from the RULER benchmark (Hsieh et al., 2024)⁸, which was created to assess effective context length. Titans models — including the standalone Neural Memory (LMM as a model)— maintained strong retrieval rates even at 16K tokens, in contrast to several state-of-the-art recurrent models that had sharp accuracy declines with growing sequence length.
Retrieving a fact is one thing. But reasoning with multiple facts, spread across massive contexts? That’s the real test, and it is exactly what the BABILong benchmark (Yury Kuratov et al., 2024)⁹ demands. Titans (specifically the MAC architecture) didn’t just do well — it outperformed everyone. Even big models like GPT-4 and Llama 3.1–70B, even those that had access to external tools or retrieval systems, while Titans’ largest model is 760M parameters!
Apart from that, Titans (MAC hybrid architecture) also managed to score 70% accuracy even at 10 million tokens. To put that into perspective, that’s like navigating and finding puzzle pieces in the entire Harry Potter series… times ten.
The researchers explored what happens when the Long-Term Memory Module (LMM) is made deeper by stacking more layers. The results? A deeper LMM dramatically improves its ability to store and organize important information, making it less likely to forget crucial details, especially in long-form sequences where most models struggle to maintain context.
While LMMs alone managed to get linear time complexity for efficient processing across massive inputs, deeper LMMs do come with a slight trade-off: reduced throughput, or fewer tokens processed per second.
Another really exciting fact is that the same memory mechanism worked outside of traditional language tasks. In time series forecasting, a domain known for chaotic, shifting patterns, the Long-Term Memory Module (LMM) held its own against highly specialized models, including those based on Mamba (previous SOTA).
In DNA modeling, which is a completely different task, the architecture showed strong results. That kind of generality is not easy to come by, and it suggests that memory, when handled well, is not just useful, it is foundational across domains.
And that wraps up this deep dive into Titans. Exploring this architecture has been genuinely fun — it is refreshing to see research that goes beyond scaling and instead digs into how memory and learning might actually work in more adaptive, human-like ways. Google’s legacy of foundational work continues here, from inventing the Transformer to now rethinking how AI can learn during inference. Titans feel like a natural evolution of that spirit.
That said, the AI landscape today is a lot more crowded than it was back in 2017. New ideas, no matter how brilliant, face a steeper path to becoming the default. Performance is just one piece — efficiency, simplicity, and community traction matter more than ever.
Still, Titans make a strong case for a future where models don’t just think with what they already know, but genuinely adapt as they go. Whether this becomes the next “just throw attention at it” moment or not, it is a promising step toward a smarter, more intelligent AI.
[1] Tack, Jihoon, et al., “LLM Pretraining with Continuous Concepts.” (2025) arXiv preprint arXiv:2502.08524. [2] Vaswani, Ashish, et al., “Attention is all you need.” (2017), Advances in neural information processing systems 30. [3] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” (2020), arXiv preprint arXiv:2010.11929. [4] Zerveas, George, et al. “A transformer-based framework for multivariate time series representation learning.” (2021), Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. [5] Rogers, Anna, et al., “A primer in BERTology: What we know about how BERT works.” (2021), Transactions of the association for computational linguistics 8: 842–866. [6] Behrouz, Ali, Peilin Zhong, and Vahab Mirrokni. “Titans: Learning to memorize at test time.” (2024), arXiv preprint arXiv:2501.00663. [7] Mandler, George. “Affect and cognition” (2014). Psychology Press, 3–36. [8] Hsieh, Cheng-Ping, et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: First Conference on Language Modeling. 2024. [9] Kuratov, Yury, et al. “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.” (2024), Advances in Neural Information Processing Systems 37: 106519–106554. [10] Grešová, Katarína, et al. “Genomic benchmarks: a collection of datasets for genomic sequence classification.” (2023) BMC Genomic Data 24.1: 25.
If you often open multiple tabs and struggle to keep track of them, Tabs Reminder is the solution you need. Tabs Reminder lets you set reminders for tabs so you can close them and get notified about them later. Never lose track of important tabs again with Tabs Reminder!
Try our Chrome extension today!
Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more