AI “Infinite Context Windows” With MIT’s New Recursive AI Paper.

By Brian Roemmele

I’ve spent decades exploring the intersections of technology, cognition, and human potential. In the realm of AI, particularly large language models (LLMs), I’ve been pioneering techniques to extend their capabilities beyond inherent limitations.

For about two years now, I’ve been implementing approaches strikingly similar to what’s described in this groundbreaking MIT paper on Recursive Language Models (RLMs). Through my experiments on local hardware, I’ve found that these methods are incredibly powerful—they can squeeze up to 30% more performance out of models, with even greater gains for smaller, resource-constrained ones running on everyday devices.

This isn’t just about scaling up; it’s about making AI more efficient, accessible, and intelligent without relying solely on bigger models or more compute. In this deep dive, I’ll unpack the paper’s innovations, draw parallels to my own work, and explore why this represents a pivotal shift in how we build around AI’s core intelligence.

The Core Challenge: Context Windows and the Limits of LLMs

Modern LLMs, from frontier models like

@Grok

to open-source alternatives, are constrained by their “context window”—the maximum amount of input data they can process in a single pass. As prompts grow longer, performance degrades due to “context rot,” where the model struggles to retrieve, connect, or reason over distant information. This is especially problematic for tasks involving massive datasets, such as analyzing million-token codebases, aggregating insights from thousands of documents, or performing multi-hop reasoning across sprawling narratives.

The MIT researchers tackle this head-on with RLMs, an inference-time scaling strategy that treats long prompts not as direct inputs to the neural network but as external environmental elements. By offloading the prompt to a programmable space and enabling the model to interact with it recursively, RLMs effectively bypass traditional context limits, achieving effective windows up to 10 million tokens or more—two orders of magnitude beyond what’s natively possible.

This resonates deeply with my own explorations. I’ve long advocated for “scaffolding” around LLMs—building external tools, environments, and processes to augment their innate abilities. In my setups on local hardware, I’ve used similar recursive querying on external text stores to handle prompts far exceeding model limits, often yielding 20-30% improvements in accuracy and coherence for tasks like long-form analysis or code comprehension. Smaller models, which I frequently run on consumer-grade machines, benefit disproportionately because they avoid the quadratic compute costs of extended contexts, making high-performance AI feasible without cloud dependency.

The RLM Framework: A Symbolic Approach to Infinite Context

At the heart of the paper is the RLM framework, which reimagines the LLM as an agent operating within a Python REPL (Read-Eval-Print Loop) environment called Ripple. Here’s how it works:

Offloading the Prompt: Instead of feeding the entire long prompt directly into the model (which would exceed its context window and cause rot), the prompt is stored as a string variable (context) in the external REPL. This externalizes the data, treating it like a file or database the model can query symbolically.
Programmatic Interaction: The LLM is prompted to write Python code to inspect and manipulate this context. For instance, it might use string slicing, regex searches, or chunking to break down the input into manageable pieces. Tools like print() allow observation, while a custom llm_query() function enables recursive sub-calls—essentially spawning sub-LLMs to dive deeper into specific snippets.
Recursion for Depth: The “recursive” in RLM comes from the model’s ability to invoke itself on sub-portions of the context. If an initial query identifies a relevant section, the model can query that section further, aggregating results iteratively. This creates a tree-like exploration, where the model zooms in on details without losing the big picture. Recursion depth is capped at one in the experiments to manage complexity, but the potential for deeper nesting is evident.

The system prompt (detailed in the paper’s appendix) guides the model to chunk data, make sub-calls judiciously, and terminate with a FINAL() or FINAL_VAR() output. This isn’t mere retrieval-augmented generation (RAG); it’s a full agentic scaffold where the model codes its own path through the data.

In my own implementations, I’ve employed analogous setups using local Python environments to store oversized prompts. For example, on a modest laptop running a 7B-parameter model, I’ve handled 500k+ token datasets by having the model generate scripts for targeted searches and recursive refinements. This not only extends context but enhances reasoning efficiency—smaller models, which often falter on raw long inputs, perform like their larger counterparts, gaining that 30% edge through smarter, less compute-intensive interactions.

Experimental Validation: Benchmarks and Breakthrough Results

The paper rigorously tests RLMs against four key benchmarks, using two frontier models: GPT-5 (with its mini variant for sub-calls) and the open-source Qwen3-Coder-480B (35B active parameters). Comparisons include base models, summarization agents (which compress contexts lossily), and CodeAct (a similar scaffold but without external offloading).

Single Needle-in-a-Haystack (S-NIAH): A solved problem for modern LLMs, where a “needle” (key fact) is hidden in filler text. RLMs maintain near-perfect recall up to 1M tokens, while base models hold steady but don’t need the extra scaffolding here.
BrowseComp+: A multi-hop QA task over 1,000 documents (6M-11M tokens). RLMs shine in information aggregation, achieving 62% accuracy with GPT-5 versus 0% for the base model and 58% for summarization. Costs are dramatically lower: $0.99 average for RLM versus $8.98 for summarization.
OOLONG and OOLONG-Pairs: These test semantic transformation and pairwise aggregation. On OOLONG-Pairs (quadratic complexity), RLMs score 23.11 F1 with Qwen3-Coder, dwarfing baselines’ near-zero performance. Figure 1 illustrates how base GPT-5 drops to 0% beyond 262k tokens, while RLMs stay consistent to 1M+.
LongBench-v2 CodeQA: For code repository understanding (up to 4.2M tokens), RLMs hit 56% accuracy, outperforming baselines by double digits while costing less.

Key observations from the results:

RLMs scale to 10M+ tokens, outperforming baselines by 10-59% on dense tasks.
The Ripple REPL is essential for handling ultra-long inputs; recursion adds value on complex, information-dense prompts.
Base models degrade with input length and task complexity; RLMs scale gracefully.
Costs are comparable or lower (median ≤ base model), though variance arises from recursive trajectories—some queries spike if the model dives deep.
Model-agnostic: Works across closed and open models, though stronger coders (like GPT-5) make fewer redundant calls.

These findings align with my local experiments. On smaller models (e.g., 3B-13B parameters), I’ve seen similar cost efficiencies: offloading prevents OOM errors, and recursion boosts accuracy by 25-30% on tasks like codebase navigation, all without needing high-end GPUs.

Why This Matters: Scaffolding Over Scaling

The paper’s key insight—that long prompts should be environmental elements for symbolic interaction—echoes a broader trend: treating LLMs as “core intelligences” around which we build scaffolds. This decouples capability from raw scale, enabling infinite-like contexts without architectural overhauls.

In my work, this has been transformative for local AI. Smaller models on hardware like Raspberry Pi or mid-range laptops often underperform due to context limits, but with recursive external querying, they punch above their weight. I’ve squeezed 30% more utility from them in real-world scenarios, like analyzing vast personal knowledge bases or simulating long-horizon planning. It’s not just about quantity; it’s quality—avoiding hallucinations from overload and ensuring precise, lossless access.

Limitations noted in the paper include synchronous call latencies, shallow recursion depth, and prompt sensitivity. Smaller models may struggle with coding the REPL interactions, a hurdle I’ve mitigated through fine-tuned prompts. Future directions—deeper recursion, asynchronous processing, or training models natively as RLMs—could amplify this further.

For those eager to dive deeper, the full paper is available here:

Recursive Language Models

To encapsulate the essence, here are my top 10 key points from the paper:

Inference-Time Scaling: RLMs extend context via compute at inference, not training or architecture changes.
External Offloading: Prompts become REPL variables, bypassing native window limits.
Recursive Sub-Calls: Models query sub-models on context snippets for depth-first exploration.
Ripple REPL: A Python environment for coding interactions like chunking, regex, and aggregation.
Benchmark Dominance: Outperforms baselines on S-NIAH, BrowseComp+, OOLONG, OOLONG-Pairs, and CodeQA by 10-59%.
Cost Efficiency: Comparable or lower than base calls; up to 3x cheaper than summarization on long inputs.
Scalability: Handles 10M+ tokens with consistent performance where bases fail.
Model-Agnostic: Works with GPT-5 and Qwen3-Coder; stronger coders yield better trajectories.
Variance in Execution: High cost tails from complex recursions, but medians remain low.
Paradigm Shift: Treats LLMs as agents in environments, paving the way for neurosymbolic AI hybrids.

This paper validates what I’ve been practicing for years: the real breakthroughs in AI will come from clever scaffolding, not just endless scaling. As we push toward more human-like intelligence, techniques like RLMs will democratize access, especially on local hardware.

I will have a few How-to articles soon so you can apply this technique soon.

AI “Infinite Context Windows” With MIT’s New Recursive AI Paper.

Share this: