Don't engineer around a base model that can't utilize a long context to reason. Do research instead.

Don’t engineer around a base model that can’t utilize a long context to reason. Do research instead.

This blog covers five papers. My takeaway from reading them is that although the research community has developed techniques that let a model take in a longer input (MLA, NSA, RoPE, YaRN, etc.), we haven’t been able to teach the model how to use a long input to reason.

The first four papers show empirical failure modes. The last paper gives a mechanistic interpretation:

Lost in the Middle: How Language Models Use Long Contexts [Link]
LLMs Get Lost In Multi-Turn Conversation [Link]
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval [Link]
Reasoning Shift: How Context Silently Shortens LLM Reasoning [Link]
Emotion Concepts and their Function in a Large Language Model [Link]

Lost in the Middle

When an LLM is doing information retrieval from a long context, performance is often highest when the relevant information occurs at the beginning or end of the input context (primacy bias), and degrades significantly when the model must access relevant information in the middle of the context (recency bias) — even for explicitly long-context models.

This may also be related to the attention sink observation (paper-1, paper-2), which may be mitigated by gated attention.

LLMs Get Lost In Multi-Turn Conversation

The authors developed a “sharding” technique that takes existing single-turn benchmark instructions and splits them into smaller pieces (“shards”), each revealing one requirement. These shards mimic how real users often start with vague requests and gradually clarify. This lets them compare single-turn and multi-turn performance on the exact same underlying tasks, controlling for task difficulty.

They ran over 200,000 simulated conversations across three settings: FULL (all info in one turn), CONCAT (shards concatenated into one turn), and SHARDED (shards revealed across multiple turns). The CONCAT baseline is key — it confirms the performance drop isn’t caused by rephrasing during sharding, since CONCAT performance stays within 5% of FULL. The problem is specifically the multi-turn, underspecified nature of SHARDED conversations.

An example of the three settings

The original problem: Jay is making snowballs to prepare for a snowball fight with his sister. He can build 20 snowballs in an hour, but 2 melt every 15 minutes. How long will it take before he has 60 snowballs?

The shards (produced by the sharding process):

Shard 1: How long before Jay’s ready for the snowball fight?
Shard 2: He’s preparing for a snowball fight with his sister.
Shard 3: He can make 20 snowballs per hour.
Shard 4: He’s trying to get to 60 total.
Shard 5: The problem is that 2 melt every 15 minutes.

—

FULL (single-turn, original instruction):

Turn 1 — User: Jay is making snowballs to prepare for a snowball fight with his sister. He can build 20 snowballs in an hour, but 2 melt every 15 minutes. How long will it take before he has 60 snowballs?

Turn 1 — LLM: (answers with all information available)

—

CONCAT (single-turn, but using the sharded text):

Turn 1 — User: Please solve the following, taking into account all bullet points:
How long before Jay’s ready for the snowball fight?
He’s preparing for a snowball fight with his sister.
He can make 20 snowballs per hour.
He’s trying to get to 60 total.
The problem is that 2 melt every 15 minutes.

Turn 1 — LLM: (answers with all information available)

—

SHARDED (multi-turn, underspecified):

Turn 1 — User: How long before Jay’s ready for the snowball fight?

Turn 1 — LLM: “What snowball fight? Could you tell me more about what Jay is doing?” (or, problematically: “Assuming Jay needs to buy supplies, travel to the park, and gather snow, it might take about 2 hours…” ← premature answer with wrong assumptions)

Turn 2 — User: He’s preparing for a snowball fight with his sister.

Turn 2 — LLM: “Got it. How is he preparing? Is he making snowballs?”

Turn 3 — User: He can make 20 snowballs per hour.

Turn 3 — LLM: “OK, but how many does he need? If he needs, say, 40 snowballs, that would take 2 hours.” ← assumption about 40

Turn 4 — User: He’s trying to get to 60 total.

Turn 4 — LLM: “At 20 per hour, he’d need 3 hours to make 60.” ← answer attempt, but missing the melting detail

Turn 5 — User: The problem is that 2 melt every 15 minutes.

Turn 5 — LLM: “Oh, let me recalculate… he makes 20/hour but loses 8/hour (2 every 15 min), so net 12/hour. 60 ÷ 12 = 5 hours.” ← may or may not correctly revise the earlier wrong attempt

Results: Define aptitude (90th percentile performance) and unreliability (the gap between 90th and 10th percentile). In single-turn settings, stronger models are both more capable and more reliable. In multi-turn settings, all models — even frontier ones — exhibit similarly high unreliability, with 50-point swings between best and worst runs on the same instruction.

Analysis: Four behavioral patterns drive this: (1) LLMs prematurely attempt full answers, making incorrect assumptions about unspecified details; (2) they over-rely on their own earlier (wrong) attempts, producing increasingly “bloated” solutions; (3) they disproportionately attend to the first and last turns while neglecting middle turns (a multi-turn analogue of “lost in the middle”); and (4) they generate overly verbose responses, which introduce more assumptions that compound errors.

Solutions that don’t work: RECAP (after the multi-turn conversation ends, a final turn restates all prior user messages at once, giving the LLM a chance to redo its answer) and SNOWBALL (at every turn, the user repeats all previously revealed information alongside the new information) both helped but still fell well short of single-turn performance. Lowering temperature to 0 helped in single-turn settings but barely improved multi-turn reliability, since small early differences cascade across turns.

Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

The paper’s central conclusion is that the input length itself degrades an LLM’s problem-solving performance, independent of whether the model can retrieve the relevant information and independent of any distraction from irrelevant content. This challenges the widely held assumption that if a model can find the right information in a long context, it should perform just as well as it would on a short input.

Progressively more controlled experiments. I like them:

Step 1 — Essay distractors (§3). Short-context problems (GSM8K, MMLU, HumanEval, variable summation) are padded with Paul Graham essay tokens between evidence and question. Retrieval (the model can recite the evidence verbatim) stays near-perfect (e.g., exact match on 970/1000 MMLU items at 30K tokens), but task accuracy drops 13.9%–85% across Llama-3.1-8B and Mistral-v0.3-7B.

Step 2 — Whitespace distractors (§4.1). Essay tokens are replaced with whitespace to minimize semantic distraction. Performance still drops substantially (up to 48% for Llama on VarSum, 30% for Mistral on GSM8K). Closed-source models (GPT-4o, Claude 3.5, Gemini 2.0) are more robust but still degrade on most tasks. Moving the evidence right next to the question (whitespace placed before it, not between) still produces drops of up to 20%, ruling out positional distance as the cause.

Step 3 — Attention masking (§4.2). All distraction tokens are masked so the model attends only to the evidence and question — identical to the short-context setting except for longer positional encodings. Performance still drops consistently (7.9%–50% at 30K masked tokens), confirming that context length alone, with zero distraction, hurts reasoning.

Mitigation: retrieve then reason (§5). Motivated by these findings, the authors propose prompting the model to first recite the retrieved evidence and then solve the problem using that recited evidence as a new, shorter prompt. This boosts Mistral’s GSM8K accuracy by up to 31.2% on their synthetic benchmark, and yields consistent improvements of up to 4% for GPT-4o on the RULER benchmark.

Reasoning Shift: How Context Silently Shortens LLM Reasoning

This paper finds that when reasoning LLMs encounter problems embedded in longer or more complex contexts — whether irrelevant prefix text, multi-turn conversations, or multi-problem prompts — they produce up to 50% fewer reasoning tokens compared to solving the same problem in isolation, with even a few hundred distractor tokens triggering an 18% reduction. Crucially, the shorter traces aren’t caused by models reaching answers faster; the position of the first candidate answer is nearly identical across conditions ( 925 vs. 939 tokens). Instead, the compression comes from models skipping post-answer self-verification — in the long-input condition, 68% of traces terminate immediately after stating a final answer versus 57% at baseline, and a resampling experiment showed that self-checking words like “wait,” “but,” and “alternatively” appear at roughly half their baseline frequency when the context is longer. This means surrounding context silently suppresses the double-checking behaviors that make reasoning models effective on hard problems, degrading accuracy by 9–15% on challenging tasks even though the models show no confusion about the task itself.

More experiments could be conducted on SOTA reasoning models like Kimi-K2.5.

Emotion Concepts and their Function in a Large Language Model

[WIP]