Engram: A Critical Analysis of Conditional Memory as a Sparsity Axis - DeepSeek's Adaptation of Classic N-grams for Early Layer Feature Representation

19 Jan 2026

Paper: Engram: Conditional Memory via Scalable Lookup Authors: DeepSeek-AI & Peking University Date: January 2026 arXiv: 2601.07372v1

TL;DR

The Engram paper introduces conditional memory as a new sparsity axis for large language models, complementing the established paradigm of conditional computation (MoE). The central thesis is that Transformers lack a native knowledge lookup primitive, forcing them to simulate retrieval through expensive computation. Engram addresses this by storing N-gram pattern embeddings in hash-indexed tables, enabling $O(1)$ retrieval of static local patterns while preserving the Transformer backbone for dynamic reasoning.
Under rigorous iso-parameter and iso-FLOPs constraints, Engram-27B outperforms the MoE-27B baseline across diverse benchmarks: knowledge-intensive tasks (MMLU +3.4), general reasoning (BBH +5.0), and code/math (HumanEval +3.0). The paper’s U-shaped allocation curve demonstrates that optimal sparse model design allocates approximately 20-25% of inactive parameters to Engram memory rather than MoE experts.
The architecture is strategically significant given documented constraints on Chinese AI compute capacity. Engram’s ability to offload parameters to host DRAM (bypassing HBM bottlenecks) and extract more capability per FLOP directly addresses hardware limitations. The paper’s structure—detailed concept validation at modest scale, infrastructure described but not benchmarked at frontier scale—matches DeepSeek’s historical publication-to-deployment pattern, suggesting high probability of inclusion in their forthcoming V4 frontier model.
Key technical innovations include: Tokenizer compression (surjective mapping collapsing semantically equivalent tokens with 23% vocabulary reduction), Multi-head hashing (collision-robust retrieval via $K$ independent hash functions), Context-aware gating (learned scalar gates filtering irrelevant or collision-contaminated retrievals), mHC integration (branch-specific gating enabling expressivity-efficiency tradeoffs), and Memory-compute decoupling (prefetch-and-overlap strategy enabling <3% overhead when offloading 100B parameters to host memory).
The paper opens significant unexplored territory, including domain-specialized Engram modules for high-accuracy applications (medical, legal, scientific) and the broader design space of memory-expressivity tradeoffs within the conditional memory paradigm.

Sectional Summary

Section 1: Introduction

The paper opens by identifying a fundamental limitation in Transformer architectures: the absence of a native knowledge lookup primitive. While MoE provides conditional computation (sparse expert activation), Transformers lack conditional memory (sparse knowledge retrieval). This forces models to reconstruct static patterns—named entities, collocations, idioms—through iterative computation at every forward pass.

Engram proposes treating conditional memory as a first-class modeling primitive. The key insight is that local N-gram patterns are deterministic given the input tokens, enabling hash-based $O(1)$ retrieval rather than learned routing. This architectural choice decouples memory scaling from compute scaling: adding Engram parameters increases storage requirements but not per-token FLOPs.

The introduction frames the paper’s core contribution as answering the allocation question: given fixed total parameters and activated parameters, how should sparse capacity be distributed between MoE experts (conditional computation) and Engram embeddings (conditional memory)?

Section 2: Method

Section 2.1: Tokenizer Compression

Standard subword tokenizers create multiple token IDs for semantically equivalent surface forms (“Apple”, “apple”, “ apple”, “APPLE”). This is catastrophic for N-gram lookup, as each variant maps to different hash slots.

Engram implements a surjective mapping $P: V \rightarrow V’$ that collapses equivalent tokens via:

NFKC Unicode normalization
Lowercasing
Whitespace stripping
Diacritic removal

This achieves 23% vocabulary reduction, improving coverage per embedding slot and reducing hash collision probability.

Section 2.2: Multi-Head Hashing

The combinatorial space of N-grams ($\lvert V \rvert^3 \approx 10^{15}$ for trigrams with 128k vocabulary) cannot be stored explicitly. Engram uses $K$ independent hash functions mapping N-grams to table slots of size $M$:

\[\phi_{n,k}: \mathbb{Z}^n \rightarrow \{0, 1, \ldots, M-1\}\]

Multi-head hashing provides collision robustness: if two N-grams collide in one hash function, they almost certainly differ in others. With $K=8$ heads and $M=3 \times 10^6$ slots, the probability of total collision is approximately $10^{-52}$, making catastrophic collision effectively impossible.

Section 2.3: Context-Aware Gating

Static lookup cannot resolve polysemy (“bank” as financial institution vs. riverbank) or filter hash collisions. Engram introduces a learned gating mechanism:

\[\alpha_t = \sigma\left(\frac{\text{RMSNorm}(h_t)^\top \cdot \text{RMSNorm}(k_t)}{\sqrt{d}}\right)\]

The gate $\alpha_t \in (0,1)$ modulates the retrieved embedding based on compatibility with the Transformer’s hidden state $h_t$. This reintroduces minimal dynamic computation ($O(d)$ for dot product) while preserving the efficiency benefits of static retrieval.

Section 2.4: Multi-Branch Integration with mHC

Engram integrates with Manifold-Constrained Hyper-Connections (mHC; Xie et al., 2025), which expand the residual stream to $M$ parallel branches. The design shares expensive components (embedding table, value projection) while separating cheap components (key projections, gates):

Shared: One embedding table $E$, one value projection $W_V$
Separate: $M$ key projections ${W_K^{(m)}}$, $M$ scalar gates ${\alpha^{(m)}}$

This enables branch-specific decisions about memory utilization while amortizing storage costs.

Section 2.5: Decoupling Memory from Compute

The section describes two system designs:

Training: Standard model parallelism shards Engram tables across GPUs, using All-to-All communication to gather active embeddings. This distributes but does not eliminate HBM requirements.

Inference: Deterministic hash-based addressing enables prefetch-and-overlap strategies. Since indices are known before the forward pass, embeddings can be asynchronously retrieved from host DRAM via PCIe while preceding layers compute. Engram modules at layers 2 and 15 provide sufficient computation buffer to mask transfer latency.

The paper describes (but does not benchmark) a multi-level cache hierarchy exploiting Zipfian N-gram distribution: frequent patterns in GPU HBM, common patterns in host DRAM, rare patterns on NVMe SSD.

Section 3: Scaling vs. Sparsity

Section 3.1: Allocation Under Fixed Constraints

The paper introduces the allocation ratio $\rho \in [0,1]$, where $\rho$ determines the fraction of sparse parameters assigned to MoE experts versus Engram memory. Under fixed total and activated parameters, sweeping $\rho$ reveals a U-shaped validation loss curve:

$\rho = 100\%$ (pure MoE): Loss = 1.7248
$\rho \approx 75\text{-}80\%$ (optimal): Loss = 1.7109
$\rho \rightarrow 0\%$ (Engram-dominated): Loss increases

The U-shape is replicated at two compute budgets ($2 \times 10^{20}$ and $6 \times 10^{20}$ FLOPs), suggesting stability across regimes. The optimal allocation dedicates approximately 20-25% of sparse capacity to Engram.

Section 3.2: Infinite Memory Regime

With a fixed MoE backbone, sweeping Engram capacity from $10^5$ to $10^7$ slots reveals log-linear scaling: validation loss decreases linearly with $\log(\text{slots})$. This suggests Engram can scale efficiently beyond tested ranges.

Comparison with OverEncoding (Huang et al., 2025a) shows Engram extracts more value from equivalent memory budgets, attributed to deeper injection points, context-aware gating, and tokenizer compression.

Section 4: Experiments

Section 4.1: Pre-training Setup

Models are trained for 50,000 steps on 262B tokens:

Dense-4B: 4.1B total, 3.8B activated
MoE-27B: 26.7B total, 3.8B activated (72 routed + 2 shared experts)
Engram-27B: 26.7B total, 3.8B activated (55 routed + 2 shared experts, 5.7B Engram)
Engram-40B: 39.5B total, 3.8B activated (55 routed + 2 shared experts, 18.5B Engram)

All models use DeepSeek-V3 tokenizer, MLA attention (DeepSeek-AI, 2024a), mHC ($M=4$), and Muon optimizer.

Section 4.2: Pre-training Results

Engram-27B outperforms MoE-27B across all benchmark categories:

Category	Representative Benchmark	MoE-27B	Engram-27B	$\Delta$
Knowledge	MMLU	57.4	60.4	+3.0
Reasoning	BBH	50.9	55.9	+5.0
Code	HumanEval	37.8	40.8	+3.0
Math	GSM8K	58.4	60.6	+2.2

Engram-40B shows further gains on most benchmarks but underperforms Engram-27B on code tasks (HumanEval 38.4 vs 40.8). The paper attributes this to undertraining—the loss gap between Engram-40B and baselines continues widening at training end.

Section 5: Long-Context Capability

Section 5.1: Experimental Setup

Models undergo YaRN context extension (Peng et al., 2023) from 4k to 32k tokens over 5,000 additional steps.

Section 5.2: Results and Analysis

The paper makes three controlled comparisons:

Iso-Loss (46k vs 50k steps): Matching pre-training loss isolates architectural effects. Engram dramatically outperforms on complex retrieval: Multi-Query NIAH 97.0 vs 84.2 (+12.8), Variable Tracking 87.2 vs 77.0 (+10.2).
Iso-FLOPs (50k vs 50k steps): Standard comparison shows Engram advantages compound with its better base quality.
Extreme (41k vs 50k steps): Engram at 82% training compute matches MoE on LongPPL while exceeding on RULER tasks.

The mechanism: Engram handles local patterns via $O(1)$ lookup, freeing attention capacity for global context management. Tasks requiring broad attention (Frequent Words Extraction: +26.3) show largest gains.

Section 6: Analysis

Section 6.1: Effective Depth

LogitLens analysis (nostalgebraist, 2020) shows Engram representations converge to prediction-ready states earlier (lower KL divergence at early layers). CKA analysis (Kornblith et al., 2019) reveals Engram layer 5 representations match MoE layer ~12 representations for named entities.

Interpretation: By offloading static pattern reconstruction to lookup, Engram effectively increases model depth—early layers can immediately begin reasoning rather than spending capacity on feature composition.

Section 6.2: Structural Ablations

Layer placement sweep finds layer 2 optimal for single-module Engram (balances early intervention with contextual precision for gating). Multi-branch integration and context-aware gating are critical; depthwise convolution provides marginal benefit.

Section 6.3: Sensitivity Analysis

Suppressing Engram output during inference reveals functional specialization:

Factual knowledge (TriviaQA): Catastrophic collapse to 29% retained
Reading comprehension (C3): Resilient at 93% retained

This demonstrates Engram becomes the primary repository for parametric knowledge, while the backbone retains comprehension and reasoning capabilities.

Section 6.4: System Efficiency

Table 4 demonstrates 100B parameter Engram offloaded to host DRAM incurs <3% throughput overhead (8,858 vs 9,032 tokens/sec on 4B backbone). This validates the prefetch-and-overlap strategy.

The paper positions Engram against:

N-gram language models (Shannon, 1948; Jurafsky & Martin, 2024): Engram modernizes the concept with learned embeddings and neural integration
OverEncoding (Huang et al., 2025a): Prior N-gram embedding work limited to input layer averaging
Product key memory (Lample et al., 2019): Attention-based retrieval vs. Engram’s hash-based deterministic lookup
Retrieval-augmented generation (Lewis et al., 2020): External document retrieval vs. Engram’s internal parametric memory

Detailed Technical Analysis

1. Classical N-gram Models: Foundation and Connection to Engram

Lay Analogy: Your Phone’s Predictive Keyboard

Imagine texting a friend. You type “I’ll meet you at the” and your phone suggests “airport”, “office”, or “usual”. How does it know? It has memorized millions of text messages and learned that certain words frequently follow certain phrases.

Your phone doesn’t analyze the entire conversation—it just looks at the last few words and consults a giant lookup table: “When people type ‘at the’, what do they usually type next?”

This is exactly how an N-gram model works:

It memorizes patterns from training text
It only looks at the immediate local context (the last $N-1$ words)
Prediction is a table lookup, not computation

The “N” in N-gram refers to the window size. A 3-gram (trigram) model looks at the previous 2 words to predict the next one.

Mathematical Foundation

The Goal: Assign Probability to Sequences

Given a sentence $W = (w_1, w_2, \ldots, w_T)$, we want to compute $P(W)$—the probability of this exact word sequence occurring.

Using the chain rule of probability, we decompose this as:

\[P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, w_2, \ldots, w_{t-1})\]

Problem: Computing $P(w_t \mid w_1, \ldots, w_{t-1})$ requires conditioning on the entire history. For a vocabulary of size $\lvert V \rvert$ and sequence length $T$, we’d need to estimate $\lvert V \rvert^T$ parameters—astronomically intractable.

The Markov Assumption (Key Simplification)

The N-gram model makes the $(N-1)$th-order Markov assumption: the probability of the next word depends only on the previous $(N-1)$ words:

\[P(w_t \mid w_1, \ldots, w_{t-1}) \approx P(w_t \mid w_{t-N+1}, \ldots, w_{t-1})\]

This truncates history to a fixed-size window, making the model tractable.

Notation Summary

Symbol	Meaning
$w_t$	Word at position $t$
$V$	Vocabulary (set of all words)
$\lvert V \rvert$	Vocabulary size
$N$	The “N” in N-gram (context window + target)
$w_{t-N+1}^{t-1}$	Shorthand for $(w_{t-N+1}, \ldots, w_{t-1})$ — the context
$C(\cdot)$	Count function (occurrences in training corpus)

Maximum Likelihood Estimation

We estimate probabilities by counting co-occurrences in a training corpus:

\[P_{\text{MLE}}(w_t \mid w_{t-N+1}^{t-1}) = \frac{C(w_{t-N+1}, \ldots, w_{t-1}, w_t)}{C(w_{t-N+1}, \ldots, w_{t-1})}\]

In plain English:

\[P(\text{next word} \mid \text{context}) = \frac{\text{Times we saw (context + next word) together}}{\text{Times we saw (context) at all}}\]

Concrete Examples

Bigram ($N=2$): Conditions on 1 previous word

\[P(\text{mat} \mid \text{the}) = \frac{C(\text{"the mat"})}{C(\text{"the"})} = \frac{1,247}{89,432} \approx 0.014\]

Trigram ($N=3$): Conditions on 2 previous words

\[P(\text{mat} \mid \text{on}, \text{the}) = \frac{C(\text{"on the mat"})}{C(\text{"on the"})} = \frac{342}{5,891} \approx 0.058\]

Notice: The trigram probability is higher because “on the” provides more specific context than just “the”.

Full Sentence Probability

For the sentence “the cat sat on the mat” using a trigram model:

\[P(\text{sentence}) = P(\text{the} \mid \langle s \rangle, \langle s \rangle) \times P(\text{cat} \mid \langle s \rangle, \text{the}) \times P(\text{sat} \mid \text{the}, \text{cat}) \times \ldots\]

Where $\langle s \rangle$ is a special start-of-sentence token.

Visual Representation: Trigram Decomposition

┌─────────────────────────────────────────────────────────────────┐
│                    TRIGRAM MODEL (N=3)                          │
│                                                                 │
│   Sentence: "the cat sat on the mat"                           │
│                                                                 │
│   ┌─────────┬─────────┬─────────┐                              │
│   │ Context │ Context │ Target  │   Probability                │
│   │  w_{t-2}│  w_{t-1}│   w_t   │                              │
│   ├─────────┼─────────┼─────────┤                              │
│   │   <s>   │   <s>   │   the   │   P(the | <s>, <s>)          │
│   │   <s>   │   the   │   cat   │   P(cat | <s>, the)          │
│   │   the   │   cat   │   sat   │   P(sat | the, cat)          │
│   │   cat   │   sat   │   on    │   P(on | cat, sat)           │
│   │   sat   │   on    │   the   │   P(the | sat, on)           │
│   │   on    │   the   │   mat   │   P(mat | on, the)           │
│   │   the   │   mat   │  </s>   │   P(</s> | the, mat)         │
│   └─────────┴─────────┴─────────┘                              │
│                                                                 │
│   The sliding window moves through the sentence:                │
│                                                                 │
│   [<s>  <s>  the] cat  sat  on  the  mat  </s>                 │
│        [<s>  the  cat] sat  on  the  mat  </s>                 │
│             [the  cat  sat] on  the  mat  </s>                 │
│                  [cat  sat  on] the  mat  </s>                 │
│                       [sat  on  the] mat  </s>                 │
│                            [on  the  mat] </s>                 │
│                                 [the  mat  </s>]               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Lookup Table Structure

┌──────────────────────────────────────────────────────────────────┐
│                  N-GRAM PROBABILITY TABLE                        │
│                  (Trigram Example)                               │
├──────────────────────────────────┬───────────────────────────────┤
│          Context (Key)           │      Next Word Probabilities  │
├──────────────────────────────────┼───────────────────────────────┤
│      "the cat"                   │  sat: 0.12, is: 0.08, ...    │
│      "cat sat"                   │  on: 0.25, down: 0.18, ...   │
│      "sat on"                    │  the: 0.35, a: 0.22, ...     │
│      "on the"                    │  mat: 0.18, floor: 0.15, ... │
│      "how are"                   │  you: 0.85, things: 0.08, ...|
│           ...                    │            ...                │
└──────────────────────────────────┴───────────────────────────────┘

                    ↓ At inference time ↓

        Input: "sat on"  →  Lookup  →  Output: P(the|sat,on) = 0.35
                              O(1)

Handling Edge Cases: Smoothing

A critical problem: unseen N-grams get probability zero, which makes entire sentences have $P = 0$.

Example: If “quantum cat” never appeared in training:

\[P(\text{sat} \mid \text{quantum}, \text{cat}) = \frac{C(\text{"quantum cat sat"})}{C(\text{"quantum cat"})} = \frac{0}{0} = \text{undefined}\]

Solutions (smoothing techniques):

Laplace (Add-1) Smoothing: Add 1 to all counts

\[P_{\text{Laplace}}(w_t \mid w_{t-1}) = \frac{C(w_{t-1}, w_t) + 1}{C(w_{t-1}) + \lvert V \rvert}\]

Kneser-Ney Smoothing: Sophisticated interpolation using lower-order models
Backoff: If trigram unseen, fall back to bigram, then unigram

Computational Complexity

Operation	Time Complexity	Space Complexity
Training (counting)	$O(T)$ where $T$ = corpus size	$O(\lvert V \rvert^N)$ worst case
Inference (lookup)	$O(1)$ per prediction	—
Storage	—	$O(M)$ where $M$ = unique N-grams

The $O(1)$ lookup is the key property that makes N-grams attractive—and what the Engram paper exploits.

Key Properties Summary

Property	Assessment
Lookup complexity	$O(1)$—instant retrieval
Interpretability	High—direct frequency counts
Long-range dependencies	None—limited to $(N-1)$ context
Generalization	None—”cat sat” $\neq$ “dog sat”
Data sparsity	Severe—most N-grams unseen

Connection to Engram

Engram modernizes N-gram models by:

Classic N-gram	Engram
Stores probability distributions	Stores dense embedding vectors
Exact string matching	Hash-based approximate matching
Zero probability for unseen N-grams	Hash collisions handled by gating
No semantic generalization	Tokenizer compression groups equivalent forms
Standalone model	Module within Transformer backbone

The core insight remains the same: local patterns don’t require deep computation—they can be retrieved in $O(1)$ time, freeing neural network depth for tasks that actually require reasoning.

2. Tokenizer Compression: Canonical Projection for N-gram Efficiency

The Problem: Tokenizer Fragmentation

Standard subword tokenizers (BPE, SentencePiece) are designed for lossless text reconstruction, not semantic coherence. This creates a proliferation of token IDs that represent the same underlying concept:

Raw text: "Apple"  → Token ID: 12847
Raw text: "apple"  → Token ID: 18234  
Raw text: " apple" → Token ID: 31092  (note the leading space)
Raw text: " Apple" → Token ID: 45123
Raw text: "APPLE"  → Token ID: 67891
Raw text: "äpple"  → Token ID: 89012  (diacritic variant)

All six tokens refer to the same semantic concept, but each has a completely different ID. For N-gram lookup, this is catastrophic:

The trigram “the red apple” and “The red Apple” would hash to entirely different embedding slots
You’d need to observe both variants in training to learn both patterns
The combinatorial explosion is severe: if each position has 6 variants, a trigram has $6^3 = 216$ possible ID combinations for semantically identical content

The Solution: Canonical Projection

The paper implements a surjective mapping $P: V \rightarrow V’$ that collapses semantically equivalent tokens into canonical representatives:

P: V → V'  (surjective = many-to-one)

P(12847) = P(18234) = P(31092) = P(45123) = P(67891) = P(89012) = 7234
   ↑         ↑          ↑          ↑          ↑          ↑          ↑
"Apple"   "apple"   " apple"  " Apple"   "APPLE"   "äpple"   canonical
                                                              "apple"

The mapping applies several normalization steps:

NFKC Unicode Normalization: Converts compatibility characters to canonical forms (ﬁ → fi, ² → 2, ä → a in some modes)
Lowercasing: A → a
Whitespace stripping: “ apple” → “apple”
Diacritic removal (implied): café → cafe

What Appendix C Reveals

The paper’s Table 6 shows the most aggressively merged groups:

Canonical Token	Merge Count	Original Variants
`'␣'` (whitespace)	163	`\t`, `\n`, `\r`, `␣`, `␣␣`, `\n\n`, `␣␣␣`, `␣\n`, …
`'a'`	54	`A`, `a`, `␣a`, `␣A`, `á`, `ä`, `ã`, `ą`, `à`, `å`, `â`, …
`'o'`	40	`O`, `o`, `␣o`, `␣O`, `ó`, `ö`, `ô`, `õ`, `ő`, `ò`, …
`'e'`	35	`E`, `e`, `␣e`, `␣E`, `é`, `è`, `ę`, `ě`, `ê`, …
`'i'`	30	`I`, `i`, `␣I`, `␣i`, `í`, `ì`, `î`, `ï`, …

The 23% vocabulary reduction means roughly 30,000 token IDs are collapsed into existing canonical forms.

Concrete Step-by-Step Example

The complete Engram embedding process for a real input:

Input Sentence

"The Milky Way galaxy"

Step 1: Standard Tokenization (DeepSeek-V3 tokenizer)

Text:     "The Milky Way galaxy"
Tokens:   ["The", " Mil", "ky", " Way", " galaxy"]
Raw IDs:  [1847,   29341,  8472,  15234,  31847]

Step 2: Tokenizer Compression (Canonical Projection)

Apply $P()$ to each raw ID:

Raw ID    Text        Normalization Steps              Canonical ID
------    ----        ---------------------            ------------
1847      "The"       lowercase → "the"                → 892
29341     " Mil"      strip space, lower → "mil"       → 4521  
8472      "ky"        lowercase → "ky"                 → 8470
15234     " Way"      strip space, lower → "way"       → 2847
31847     " galaxy"   strip space, lower → "galaxy"    → 12453

Canonical IDs: [892, 4521, 8470, 2847, 12453]

Critical point: If the input had been “THE MILKY WAY GALAXY” or “ the milky way galaxy”, the canonical IDs would be identical.

Step 3: Form N-gram Contexts

For a trigram model ($N=3$), extract suffix N-grams at each position:

Position t=0: g_{0,3} = (<s>, <s>, 892)         → "the"
Position t=1: g_{1,3} = (<s>, 892, 4521)        → "the mil"
Position t=2: g_{2,3} = (892, 4521, 8470)       → "the milky"
Position t=3: g_{3,3} = (4521, 8470, 2847)      → "milky way"
Position t=4: g_{4,3} = (8470, 2847, 12453)     → "way galaxy"

Step 4: Multi-Head Hashing

For each N-gram, apply $K$ different hash functions to get embedding indices:

For g_{3,3} = (4521, 8470, 2847) representing "milky way":

Hash Head 1: φ_{3,1}(4521, 8470, 2847) = 2847391 mod 3000017 = 847374
Hash Head 2: φ_{3,2}(4521, 8470, 2847) = 9182734 mod 3000017 = 182700
...
Hash Head 8: φ_{3,8}(4521, 8470, 2847) = 1928374 mod 3000017 = 928357

The hash function is multiplicative-XOR:

\[\phi(x_1, x_2, x_3) = ((x_1 \cdot p_1) \oplus (x_2 \cdot p_2) \oplus (x_3 \cdot p_3)) \mod M\]

Where $p_1, p_2, p_3$ are different prime multipliers per head, and $M$ is a prime table size.

Step 5: Embedding Retrieval

Look up embeddings from each table:

e_{3,3,1} = E_{3,1}[847374]  ∈ ℝ^{d/K}    # Head 1 embedding
e_{3,3,2} = E_{3,2}[182700]  ∈ ℝ^{d/K}    # Head 2 embedding
...
e_{3,3,8} = E_{3,8}[928357]  ∈ ℝ^{d/K}    # Head 8 embedding

Step 6: Concatenate Across Heads and N-gram Orders

For position t=3 ("way"):

From bigrams (N=2):
  e_{3,2} = [e_{3,2,1} ∥ e_{3,2,2} ∥ ... ∥ e_{3,2,8}]  # "milky" → "way"

From trigrams (N=3):  
  e_{3,3} = [e_{3,3,1} ∥ e_{3,3,2} ∥ ... ∥ e_{3,3,8}]  # "the milky" → "way"

Final memory vector:
  e₃ = [e_{3,2} ∥ e_{3,3}] ∈ ℝ^{d_{mem}}

Step 7: Context-Aware Gating

The retrieved embedding $e_3$ is static—it doesn’t know the actual context. The gating mechanism modulates it:

\[k_3 = W_K \cdot e_3\] \[v_3 = W_V \cdot e_3\] \[\alpha_3 = \sigma\left(\frac{\text{RMSNorm}(h_3)^\top \cdot \text{RMSNorm}(k_3)}{\sqrt{d}}\right)\] \[\tilde{v}_3 = \alpha_3 \cdot v_3\]

If the context $h_3$ (from preceding Transformer layers) is incompatible with the retrieved memory (e.g., hash collision retrieved “Milky Way candy bar” context when we need “Milky Way galaxy”), $\alpha_3 \rightarrow 0$ and the memory is suppressed.

Visual Diagram of the Full Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ENGRAM EMBEDDING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  INPUT: "The Milky Way"                                                     │
│            │                                                                │
│            ▼                                                                │
│  ┌─────────────────────┐                                                    │
│  │  STANDARD TOKENIZER │                                                    │
│  │  (BPE/SentencePiece)│                                                    │
│  └──────────┬──────────┘                                                    │
│             │                                                               │
│             ▼                                                               │
│  Raw IDs: [1847, 29341, 8472, 15234]                                        │
│    "The"   " Mil"  "ky"  " Way"                                             │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────┐                                │
│  │     TOKENIZER COMPRESSION P: V → V'     │                                │
│  │  ┌───────────────────────────────────┐  │                                │
│  │  │ • NFKC normalization              │  │                                │
│  │  │ • Lowercasing                     │  │                                │
│  │  │ • Whitespace stripping            │  │                                │
│  │  │ • Diacritic removal               │  │                                │
│  │  └───────────────────────────────────┘  │                                │
│  └──────────┬──────────────────────────────┘                                │
│             │                                                               │
│             ▼                                                               │
│  Canonical IDs: [892, 4521, 8470, 2847]                                     │
│     "the"   "mil"   "ky"   "way"                                            │
│             │                                                               │
│             │     ┌──────────────────────────────────────────┐              │
│             │     │  KEY INSIGHT: These canonical IDs are    │              │
│             │     │  IDENTICAL for "THE MILKY WAY",          │              │
│             │     │  "the milky way", " The Milky Way", etc. │              │
│             │     └──────────────────────────────────────────┘              │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────┐                                │
│  │         FORM N-GRAM CONTEXTS            │                                │
│  │                                         │                                │
│  │  Bigrams:   (the,mil) (mil,ky) (ky,way) │                                │
│  │  Trigrams:  (the,mil,ky) (mil,ky,way)   │                                │
│  └──────────┬──────────────────────────────┘                                │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────┐                                │
│  │         MULTI-HEAD HASHING              │                                │
│  │                                         │                                │
│  │  For each N-gram, K hash functions:     │                                │
│  │                                         │                                │
│  │  (mil,ky,way) ──┬── φ₁ ──→ idx: 847374  │                                │
│  │                 ├── φ₂ ──→ idx: 182700  │                                │
│  │                 ├── φ₃ ──→ idx: 293847  │                                │
│  │                 └── ...                 │                                │
│  └──────────┬──────────────────────────────┘                                │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐        │
│  │              EMBEDDING TABLE LOOKUP (O(1))                      │        │
│  │                                                                 │        │
│  │   E_{3,1}[847374] ──→ [0.12, -0.34, 0.87, ...]  (d/K dims)     │        │
│  │   E_{3,2}[182700] ──→ [0.45, 0.23, -0.11, ...]  (d/K dims)     │        │
│  │   ...                                                           │        │
│  │                                                                 │        │
│  │   Concatenate all heads & N-gram orders:                        │        │
│  │   e_t = [e_{2-gram} ∥ e_{3-gram}] ∈ ℝ^{d_{mem}}                │        │
│  └──────────┬──────────────────────────────────────────────────────┘        │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐        │
│  │              CONTEXT-AWARE GATING                               │        │
│  │                                                                 │        │
│  │   h_t ─────────────────────┐                                    │        │
│  │   (from Transformer)       │                                    │        │
│  │                            ▼                                    │        │
│  │   e_t ───→ W_K ───→ k_t ───┼───→ α_t = σ(h_t · k_t / √d)       │        │
│  │       └──→ W_V ───→ v_t ───┼───→ ṽ_t = α_t · v_t               │        │
│  │                            │                                    │        │
│  │   If context mismatches retrieved memory: α_t → 0 (suppress)    │        │
│  │   If context aligns with retrieved memory: α_t → 1 (use it)     │        │
│  └──────────┬──────────────────────────────────────────────────────┘        │
│             │                                                               │
│             ▼                                                               │
│  ┌─────────────────────────────────────────┐                                │
│  │   RESIDUAL CONNECTION TO BACKBONE       │                                │
│  │                                         │                                │
│  │   H^(ℓ) ← H^(ℓ) + Conv(ṽ_t)            │                                │
│  └─────────────────────────────────────────┘                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why This Matters for N-gram Efficiency

Without Tokenizer Compression

Consider training on a corpus with these occurrences:

"the Milky Way" appears 10,000 times
"The Milky Way" appears 8,000 times  
"THE MILKY WAY" appears 500 times
" the milky way" appears 3,000 times

Without compression, these are four separate N-gram entries, each with fewer training examples. The embedding for each variant is learned independently.

With Tokenizer Compression

All 21,500 occurrences contribute to a single canonical N-gram:

(the, milky, way) → single embedding learned from 21,500 examples

This provides:

Better statistics: More training signal per pattern
Smaller tables: 23% fewer slots needed
Better generalization: Rare variants (like “THE MILKY WAY”) benefit from common variants
Reduced hash collisions: Fewer unique N-grams means lower collision probability per slot

The Surjective Function in Practice

The mapping $P$ is implemented as a precomputed lookup table:

# Pseudocode for tokenizer compression
class TokenizerCompressor:
    def __init__(self, tokenizer):
        self.projection = {}  # Raw ID → Canonical ID
        
        # Group tokens by normalized form
        normalized_groups = defaultdict(list)
        for token_id in range(tokenizer.vocab_size):
            token_text = tokenizer.decode([token_id])
            canonical = self.normalize(token_text)
            normalized_groups[canonical].append(token_id)
        
        # Assign canonical IDs
        canonical_id = 0
        for canonical_text, raw_ids in normalized_groups.items():
            for raw_id in raw_ids:
                self.projection[raw_id] = canonical_id
            canonical_id += 1
        
        self.compressed_vocab_size = canonical_id  # ~77% of original
    
    def normalize(self, text):
        text = unicodedata.normalize('NFKC', text)  # Unicode normalization
        text = text.lower()                          # Lowercase
        text = text.strip()                          # Strip whitespace
        # Additional normalizations...
        return text
    
    def compress(self, token_ids):
        return [self.projection[tid] for tid in token_ids]

The compression happens only for Engram indexing—the main Transformer backbone still uses the original token IDs and embeddings. This is crucial: you want the model to distinguish “Apple” (company) from “apple” (fruit) in its representations, but for N-gram pattern matching, the surface form variations shouldn’t matter.

3. Multi-Head Hashing: Principled Collision Mitigation

The Combinatorial Problem

For vocabulary size $\lvert V \rvert = 128,000$ and trigrams ($N=3$):

\[\lvert\text{possible trigrams}\rvert = \lvert V \rvert^3 = 128,000^3 \approx 2.1 \times 10^{15}\]

You cannot allocate 2 quadrillion embedding slots. So you must compress this space.

Single-Head Hashing: The Naive Approach

A single hash function maps the astronomical N-gram space to a manageable table:

\[\phi: \mathbb{Z}^N \rightarrow \{0, 1, \ldots, M-1\}\]

Where $M$ might be 3 million (a prime, for better distribution).

("the", "milky", "way")  ──φ──→  slot 847,374
("quantum", "field", "theory")  ──φ──→  slot 2,391,847
("the", "red", "apple")  ──φ──→  slot 847,374   ← COLLISION!

The problem: When two unrelated N-grams collide, they share the same embedding. The model learns a muddled average of all colliding patterns. With $2 \times 10^{15}$ N-grams and $3 \times 10^6$ slots, the expected collisions per slot is ~700 million N-grams—catastrophic.

But most N-grams are vanishingly rare or never occur in training. The effective collision rate depends on the training distribution, which follows Zipf’s law. Still, collisions are inevitable and damaging with a single head.

Multi-Head Hashing: Collision Mitigation

The key insight is probabilistic: if two N-grams collide in one hash function, they almost certainly won’t collide in $K$ independent hash functions.

With $K=8$ heads, each with table size $M$:

\[P(\text{collision in all heads}) = \left(\frac{1}{M}\right)^K\]

For $M = 3,000,017$ (prime) and $K = 8$:

\[P(\text{total collision}) = \left(\frac{1}{3 \times 10^6}\right)^8 \approx 10^{-52}\]

This is astronomically unlikely. In practice, two N-grams will share some heads but not all heads:

N-gram A: "the milky way"
  Head 1: slot 847,374  ←─┐
  Head 2: slot 182,700     │ collision
  Head 3: slot 2,918,374   │
  Head 4: slot 501,283  ←──┼─┐
  Head 5: slot 1,847,291   │ │
  Head 6: slot 928,174     │ │
  Head 7: slot 2,103,847   │ │
  Head 8: slot 384,192     │ │
                           │ │
N-gram B: "the red apple"  │ │
  Head 1: slot 847,374  ←──┘ │ same slot (collision)
  Head 2: slot 2,847,123     │ different
  Head 3: slot 918,234       │ different
  Head 4: slot 501,283  ←────┘ same slot (collision)
  Head 5: slot 2,918,473       different
  Head 6: slot 129,384         different
  Head 7: slot 1,029,384       different
  Head 8: slot 2,918,473       different

The concatenated embeddings are:

$e_A = [E_1[847374] \| E_2[182700] \| E_3[2918374] \| \ldots \| E_8[384192]]$ $e_B = [E_1[847374] \| E_2[2847123] \| E_3[918234] \| \ldots \| E_8[2918473]]$

Even with 2 collisions out of 8 heads, 75% of the embedding dimensions are distinct. The representations remain distinguishable.

Visual: How Multi-Head Reduces Collision Damage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SINGLE HEAD vs MULTI-HEAD HASHING                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SINGLE HEAD (K=1):                                                         │
│  ═══════════════════                                                        │
│                                                                             │
│  "the milky way"  ───┐                                                      │
│                      ├───→ slot 847,374 ───→ [shared embedding]             │
│  "the red apple"  ───┘                       (CONTAMINATED)                 │
│                                                                             │
│  Problem: 100% of embedding is shared on collision                          │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  MULTI-HEAD (K=8):                                                          │
│  ═══════════════════                                                        │
│                                                                             │
│  "the milky way"     Head 1   Head 2   Head 3   Head 4   ...   Head 8      │
│         │              │        │        │        │              │          │
│         └──────────→ [847K]  [183K]  [2.9M]  [501K]   ...    [384K]        │
│                        ↓        ↓        ↓        ↓              ↓          │
│                      [ e₁  ∥   e₂   ∥   e₃   ∥   e₄   ∥ ... ∥   e₈  ]     │
│                        ↑                  ↑                                 │
│                     collision           unique                              │
│                        ↓                  ↓                                 │
│                      [ e₁  ∥   e₂'  ∥   e₃'  ∥   e₄   ∥ ... ∥   e₈' ]     │
│                        ↓        ↓        ↓        ↓              ↓          │
│         ┌──────────→ [847K]  [2.8M]  [918K]  [501K]   ...    [2.9M]        │
│         │              │        │        │        │              │          │
│  "the red apple"     Head 1   Head 2   Head 3   Head 4   ...   Head 8      │
│                                                                             │
│  Result: Only 2/8 = 25% of embedding dimensions collide                     │
│          75% of representation is DISTINCT                                  │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  COLLISION PROBABILITY:                                                     │
│                                                                             │
│    K=1:  P(full collision) = 1/M           ≈ 3×10⁻⁷                        │
│    K=8:  P(full collision) = (1/M)^8       ≈ 10⁻⁵²                         │
│                                                                             │
│  The multi-head design makes "catastrophic collision" essentially           │
│  impossible while gracefully degrading on partial collisions.               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Theoretical Foundation: Established Techniques

Multi-head hashing is a variant of several established methods:

1. Feature Hashing (The “Hashing Trick”)

Introduced by Weinberger et al. (2009) for high-dimensional sparse features:

\[\phi(x) = \sum_{j} \xi(j) \cdot x_j \cdot e_{h(j)}\]

Where $h()$ is a hash function and $\xi()$ is a sign function to reduce bias. Used extensively in large-scale ML (Vowpal Wabbit, scikit-learn’s HashingVectorizer).

2. Count-Min Sketch

A probabilistic data structure using multiple hash functions to estimate frequencies (Cormode & Muthukrishnan, 2005):

     h₁  h₂  h₃  h₄
    ┌───┬───┬───┬───┐
    │ 3 │ 0 │ 1 │ 2 │  ← row 1
    ├───┼───┼───┼───┤
    │ 1 │ 4 │ 0 │ 1 │  ← row 2
    ├───┼───┼───┼───┤
    │ 2 │ 1 │ 3 │ 0 │  ← row 3
    └───┴───┴───┴───┘
    
Estimate = min(counts across all hash positions)

The minimum across heads gives a collision-robust estimate.

3. Bloom Filters

Test set membership with multiple hash functions—false positives possible, false negatives impossible. Same probabilistic principle: collision in all $K$ hashes is exponentially unlikely.

4. Random Projections (Johnson-Lindenstrauss)

The JL lemma guarantees that random projections preserve pairwise distances (Johnson & Lindenstrauss, 1984):

\[\|f(x) - f(y)\|_2 = (1 \pm \epsilon)\|x - y\|_2\]

With high probability, for appropriate target dimension.

Why Semantically-Uninformed Hashing Works

The hash functions are semantically uninformed—”milky way” (galaxy) and “milky way” (candy bar) might partially collide, while “milky way” and “andromeda galaxy” won’t benefit from any shared structure.

But this isn’t a bug, it’s a feature:

Semantic similarity is handled by the Transformer backbone, not Engram
Engram stores surface-level N-gram patterns, which are about co-occurrence, not meaning
The gating mechanism provides semantic filtering after retrieval

The hash function’s job is simply to provide a consistent, deterministic, uniform mapping. It doesn’t need to be smart—it just needs to avoid systematic bias.

The Mathematical Guarantee

For $K$ independent hash functions with table size $M$, the expected number of “clean” dimensions (no collision with any other active N-gram) follows a balls-into-bins analysis.

If there are $n$ active N-grams in a batch:

\[E[\text{collisions per head}] = n - M\left(1 - \left(1 - \frac{1}{M}\right)^n\right) \approx \frac{n^2}{2M}\]

For typical batch sizes ($n \sim 4096$ tokens $\times$ 2 N-gram orders $= 8192$) and $M = 3M$:

\[E[\text{collisions per head}] \approx \frac{8192^2}{2 \times 3 \times 10^6} \approx 11\]

So ~11 collisions per head per batch, but the probability of the same pair colliding across all 8 heads is negligible.

The Gating Mechanism as Second Defense

Even when partial collisions occur, the context-aware gating provides semantic filtering:

# Retrieved embedding might contain collision noise
e_t = retrieve_ngram_embedding(context)  # Partially contaminated

# But the Transformer hidden state knows the true context
h_t = transformer_layers(input)  # Has global context

# Gating checks: "Does this retrieved memory match my context?"
alpha = sigmoid(dot(h_t, W_K @ e_t) / sqrt(d))

if context_matches_memory:
    alpha → 1.0  # Use the memory
else:
    alpha → 0.0  # Suppress (probably collision noise)

This is why Figure 7 in the paper (gating visualization) shows selective activation—the model learns to ignore retrieved embeddings when they don’t match the actual context.

Summary Assessment

Aspect	Assessment
Is it dimensionality reduction?	Yes—from $\lvert V \rvert^N$ to $K \times M$ dimensions
Is it semantically informed?	No—the hash function is semantically uninformed, but the design is principled
Is it lossy?	Yes, but multi-head makes catastrophic loss exponentially unlikely
Is it novel?	No—it’s a neural adaptation of feature hashing / count-min sketch
Why does it work?	Probabilistic guarantees + gating mechanism + Zipfian sparsity of N-grams

The semantically-uninformed mapping is actually a strength: it requires no learning, is deterministic (enabling prefetching), and provides theoretical guarantees on collision rates. The semantic heavy lifting is delegated to the Transformer backbone and the learned gating mechanism.

4. Context-Aware Gating: Lightweight Dynamic Filtering

The Fundamental Tension

The paper presents a dichotomy:

Static memory: $O(1)$ lookup, no computation, context-blind
Dynamic computation: $O(d^2)$ per layer, full expressivity, context-aware

But pure static lookup has a fatal flaw: the same N-gram can mean different things in different contexts.

"The bank was steep"        → riverbank (geography)
"The bank was closed"       → financial institution
"The bank shot was perfect" → billiards term

All three share the bigram “The bank”, which would retrieve the same static embedding. Without some mechanism to disambiguate, Engram would inject irrelevant or contradictory information.

The Gating Computation

The Static Component (Context-Independent)

# These depend ONLY on the N-gram hash—identical every time "the bank" appears
e_t = lookup_ngram_embedding(token_ids)     # Static: ℝ^{d_mem}
k_t = W_K @ e_t                              # Static: ℝ^d (linear projection)
v_t = W_V @ e_t                              # Static: ℝ^d (linear projection)

At this point, everything is deterministic. Given the same input tokens, you get identical $k_t$ and $v_t$ regardless of surrounding context.

The Dynamic Component (Context-Dependent)

# h_t comes from preceding Transformer layers—FULLY context-dependent
h_t = transformer_output[t]  # Has seen full sequence via attention

# The gate computation is where static meets dynamic
q_t = RMSNorm(h_t)           # Normalized query (dynamic)
k_t_norm = RMSNorm(k_t)      # Normalized key (static)

# Scalar attention score
alpha_t = sigmoid(dot(q_t, k_t_norm) / sqrt(d))  # ∈ (0, 1)

# Gated output
v_tilde = alpha_t * v_t      # Dynamic scaling of static vector

Geometric Interpretation

The output $\tilde{v}$ lives on a one-dimensional ray in embedding space:

                    v_t (static direction)
                     ↑
                     │
                     │  α=1.0 (full activation)
                     │
                     ●──────────────────→
                     │
                     │  α=0.5 (partial)
                     │
         ───────────●───────────────────→
                     │
                     │  α≈0 (suppressed)
                     ●
                   origin

The direction is fixed; only the magnitude varies. This is fundamentally less expressive than full dynamic computation, where both direction and magnitude are context-dependent.

Comparison: Gating vs. Full Dynamic Computation

Full Attention (What Transformers Do)

# Every aspect is context-dependent
Q = W_Q @ H          # Queries from all positions
K = W_K @ H          # Keys from all positions  
V = W_V @ H          # Values from all positions

attention = softmax(Q @ K.T / sqrt(d))  # Full N×N interaction
output = attention @ V                   # Weighted combination of ALL values

Expressivity: Output can be any linear combination of value vectors, with weights determined dynamically by full sequence context.

Cost: $O(N^2 \cdot d)$ for sequence length $N$, dimension $d$

Full FFN (What MoE Experts Do)

# Arbitrary nonlinear transformation
hidden = activation(W_1 @ x + b_1)    # Up-project
output = W_2 @ hidden + b_2            # Down-project

Expressivity: Can approximate any continuous function (universal approximation)

Cost: $O(d \cdot d_{ff})$ where $d_{ff}$ is typically $4d$

Engram Gating (What This Paper Does)

# Scalar modulation of static vector
alpha = sigmoid(dot(h_t, k_t) / sqrt(d))  # Single dot product
output = alpha * v_t                       # Scalar multiplication

Expressivity: Output constrained to ray defined by $v_t$; only magnitude varies

Cost: $O(d)$ for the dot product + $O(d)$ for the scaling = $O(d)$

Cost Comparison Table

Mechanism	FLOPs per Token	Expressivity
Full Attention	$O(N \cdot d)$	Any weighted combination of values
FFN Layer	$O(d \cdot d_{ff}) \approx O(4d^2)$	Universal function approximation
MoE Expert	$O(d \cdot d_{ff} / \text{num_experts})$	Same, but sparse
Engram Gating	$O(d)$	Scalar scaling of fixed direction

Engram gating is orders of magnitude cheaper but dramatically less expressive.

The Computational Hierarchy

Engram creates a tiered computation strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    COMPUTATIONAL HIERARCHY                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TIER 1: Static Lookup (O(1))                                              │
│  ════════════════════════════                                              │
│  • Hash N-gram → retrieve embedding                                         │
│  • Zero computation, pure memory access                                     │
│  • Handles: stereotyped patterns, named entities, collocations             │
│                                                                             │
│                          ↓                                                  │
│                                                                             │
│  TIER 2: Lightweight Gating (O(d))                                         │
│  ══════════════════════════════════                                        │
│  • Single dot product + sigmoid                                            │
│  • Filters out irrelevant/colliding retrievals                             │
│  • Handles: polysemy, hash collisions, context mismatch                    │
│                                                                             │
│                          ↓                                                  │
│                                                                             │
│  TIER 3: Full Transformer Computation (O(d²))                              │
│  ════════════════════════════════════════════                              │
│  • Attention + FFN/MoE                                                     │
│  • Full dynamic reasoning                                                  │
│  • Handles: composition, inference, long-range dependencies                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The hypothesis is that many tokens don’t need Tier 3 processing—they’re predictable from local context and can be handled by Tiers 1+2. This frees Tier 3 capacity for tokens that actually require reasoning.

Expressivity Limitations of Gating

1. Cannot Change Direction

Context A: "The river bank was steep"
Context B: "The investment bank was profitable"

Both retrieve: v_t = [0.3, -0.2, 0.8, ...]  (same static vector)

Context A gating: α_A * v_t = 0.9 * [0.3, -0.2, 0.8, ...] 
Context B gating: α_B * v_t = 0.7 * [0.3, -0.2, 0.8, ...]

The outputs point in THE SAME DIRECTION—only magnitude differs.
The model cannot rotate "bank" toward financial vs. geographical meanings.

Who handles this? The Transformer backbone. The Engram contribution is added residually, and subsequent attention layers can still distinguish contexts.

2. Cannot Compose Information

"The large red ball" 

Engram retrieves:
  - bigram ("large", "red") → e_1
  - bigram ("red", "ball") → e_2
  - trigram ("large", "red", "ball") → e_3

But these are independent lookups—Engram cannot compute
the COMPOSITIONAL meaning "a ball that is both large and red"
beyond what's stored in e_3.

Who handles this? The Transformer’s attention mechanism, which can dynamically compose features across positions.

3. Cannot Reason

"If it's raining, then the ground is wet. It's raining. Therefore..."

No static N-gram lookup can complete this—it requires:
  - Understanding conditional structure
  - Applying modus ponens
  - Generating "the ground is wet"

Who handles this? The full Transformer stack, which the paper argues now has more “effective depth” because it’s not wasting early layers on pattern matching.

The Spectrum of Staticness

FULLY STATIC                                              FULLY DYNAMIC
     │                                                           │
     ▼                                                           ▼
┌─────────┬──────────┬───────────┬──────────────┬───────────────┐
│ Raw     │ Engram   │ Engram    │ Per-head     │ Full          │
│ N-gram  │ (single  │ (multi-   │ Gating       │ Attention     │
│ Lookup  │ gate)    │ branch)   │ (alternative)│               │
├─────────┼──────────┼───────────┼──────────────┼───────────────┤
│ e_t     │ α·v_t    │ Σ α_m·v_t │ Σ α_k·e_k    │ softmax(QK)·V │
├─────────┼──────────┼───────────┼──────────────┼───────────────┤
│ O(1)    │ O(d)     │ O(M·d)    │ O(K·d)       │ O(N·d)        │
├─────────┼──────────┼───────────┼──────────────┼───────────────┤
│ No      │ 1D ray   │ M rays    │ K-dim        │ Full span     │
│ control │          │ (M=4)     │ subspace     │ of values     │
└─────────┴──────────┴───────────┴──────────────┴───────────────┘

Engram with multi-branch gating sits at a sweet spot: enough dynamism to filter collisions and polysemy, cheap enough to not defeat the purpose of static lookup.

Alternative Design: Per-Head Gating

The current architecture uses a single scalar gate for the entire retrieved embedding:

e_t = concat([e_head_1, e_head_2, ..., e_head_K])  # All heads retrieved
alpha = compute_gate(h_t, e_t)                      # Single scalar
output = alpha * project(e_t)                       # All-or-nothing

An alternative would be per-head gating:

e_heads = [e_head_1, e_head_2, ..., e_head_K]       # All heads retrieved
alphas = [compute_gate(h_t, e_k) for e_k in e_heads]  # K separate gates
output = concat([α_k * e_k for α_k, e_k in zip(alphas, e_heads)])

Why this could work:

Selective collision filtering: If head 3 collided but heads 1,2,4-8 didn’t, you could suppress only head 3
Feature-specific modulation: Different heads might capture different aspects
Smoother expressivity gradient: $K$-dimensional control surface instead of scalar

Why the paper doesn’t do this:

Computational cost: $K$ separate dot products instead of 1
Multi-branch integration already provides this: mHC with $M=4$ branches offers similar granularity
Empirical sufficiency: The ablations suggest current design works well enough

Empirical Validation: Does Gating Actually Discriminate?

Figure 7 in the paper shows gating activation qualitatively:

"Only Alexander the Great could tame the horse Bucephalus."

Gating activation (α values):
  "Alexander"     → low (not end of pattern)
  "the"           → low  
  "Great"         → HIGH (completes "Alexander the Great")
  "could"         → low
  "tame"          → low
  "the"           → low
  "horse"         → low
  "Bucephalus"    → medium (entity, but less stereotyped)

The gating successfully identifies where static patterns END—which is exactly where the retrieved embedding is most reliable (the full N-gram was seen in training). At positions mid-pattern or on novel combinations, gating suppresses the retrieval.

5. mHC-Engram Integration: Expressivity-Efficiency Tradeoffs

Manifold-Constrained Hyper-Connections (mHC) Background

The paper assumes familiarity with mHC (Xie et al., 2025).

Standard Residual Connection

Traditional Transformers use a single residual stream:

\[h^{(\ell+1)} = h^{(\ell)} + f(h^{(\ell)})\]

All information flows through one pathway. The residual connection preserves the input, and $f()$ adds new information.

Hyper-Connections (HC)

Hyper-Connections expand this to $M$ parallel branches with learnable mixing:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     STANDARD RESIDUAL vs HYPER-CONNECTIONS                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STANDARD RESIDUAL (M=1):                                                   │
│                                                                             │
│       h ────────────────┬────────────────→ h + f(h)                        │
│                         │                     ↑                             │
│                         └───→ f(·) ───────────┘                            │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  HYPER-CONNECTIONS (M=4):                                                   │
│                                                                             │
│       h₁ ──────┬───────────────────────────────┬──→ h₁'                    │
│                │         ╲    learnable        │                            │
│       h₂ ──────┼────────────╲   mixing  ───────┼──→ h₂'                    │
│                │              ╲  weights       │                            │
│       h₃ ──────┼────────────────╲──────────────┼──→ h₃'                    │
│                │                  ╲            │                            │
│       h₄ ──────┼────────────────────╲──────────┼──→ h₄'                    │
│                │                      ↓        │                            │
│                └───────────→ f(·) ─────────────┘                            │
│                                                                             │
│  Each output h_m' is a learned combination of:                              │
│    • All input branches h₁...h₄                                            │
│    • The transformation output f(combined_input)                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The “Manifold-Constrained” Part

The key innovation in mHC is constraining the connection weights to lie on a manifold that preserves certain geometric properties (like gradient flow stability). Without this constraint, having $M$ branches with arbitrary mixing weights can cause training instabilities.

Mathematically:

\[H_{\text{out}} = A \cdot H_{\text{in}} + B \cdot f(C \cdot H_{\text{in}})\]

where $A, B, C$ satisfy manifold constraints ensuring stable gradients.

The Engram-mHC Integration Design

What’s Shared (Efficiency)

# ONE embedding table for all branches
E = shared_ngram_embedding_table  # Massive: billions of parameters

# ONE value projection for all branches  
W_V = shared_value_projection     # Shape: d × d_mem

# The expensive retrieval happens ONCE
e_t = E[hash(ngram)]              # O(1) lookup, same for all branches
v_t = W_V @ e_t                   # O(d × d_mem), computed once

What’s Separate (Expressivity)

# M DIFFERENT key projections, one per branch
W_K = [W_K_1, W_K_2, W_K_3, W_K_4]  # M separate matrices

# Each branch computes its OWN gate
for m in range(M):
    k_t_m = W_K[m] @ e_t                           # Branch-specific key
    alpha_m = sigmoid(dot(h_t[m], k_t_m) / sqrt(d))  # Branch-specific gate
    u_t[m] = alpha_m * v_t                          # Branch-specific output

Visual Representation of the Integration

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ENGRAM × mHC INTEGRATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         N-gram tokens                                       │
│                              │                                              │
│                              ▼                                              │
│                    ┌─────────────────┐                                      │
│                    │   Hash + Lookup │  ← SHARED (O(1), one retrieval)     │
│                    │   E[hash(·)]    │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│                             ▼                                               │
│                         e_t (static embedding)                              │
│                             │                                               │
│              ┌──────────────┼──────────────┐                               │
│              │              │              │                                │
│              ▼              ▼              ▼                                │
│         ┌────────┐    ┌────────┐    ┌────────┐                             │
│         │  W_V   │    │  W_V   │    │  W_V   │  ← SHARED (one projection) │
│         │(shared)│    │(shared)│    │(shared)│                             │
│         └───┬────┘    └───┬────┘    └───┬────┘                             │
│             │             │             │                                   │
│             ▼             ▼             ▼                                   │
│            v_t           v_t           v_t      (identical value vectors)  │
│             │             │             │                                   │
│             │             │             │                                   │
│         ┌───┴───┐     ┌───┴───┐     ┌───┴───┐                              │
│         │ W_K_1 │     │ W_K_2 │     │ W_K_3 │   ← SEPARATE (M projections)│
│         └───┬───┘     └───┬───┘     └───┬───┘                              │
│             │             │             │                                   │
│             ▼             ▼             ▼                                   │
│           k_t_1         k_t_2         k_t_3    (different key vectors)     │
│             │             │             │                                   │
│      ┌──────┴──────┐ ┌────┴────┐ ┌─────┴─────┐                             │
│      │   h_t[1]    │ │  h_t[2] │ │   h_t[3]  │  ← From mHC branches       │
│      │   (query)   │ │ (query) │ │  (query)  │                             │
│      └──────┬──────┘ └────┬────┘ └─────┬─────┘                             │
│             │             │             │                                   │
│             ▼             ▼             ▼                                   │
│      α_1 = σ(q·k_1) α_2 = σ(q·k_2) α_3 = σ(q·k_3)  SEPARATE gates        │
│             │             │             │                                   │
│             ▼             ▼             ▼                                   │
│         α_1 · v_t     α_2 · v_t     α_3 · v_t    (different magnitudes)   │
│             │             │             │                                   │
│             ▼             ▼             ▼                                   │
│         Branch 1      Branch 2      Branch 3     → to mHC residual        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The Expressivity-Efficiency Spectrum

MAXIMUM EFFICIENCY                                    MAXIMUM EXPRESSIVITY
(Minimum Parameters)                                  (Maximum Parameters)
        │                                                       │
        ▼                                                       ▼
┌───────────────┬───────────────┬───────────────┬───────────────┬───────────────┐
│   Config A    │   Config B    │   Config C    │   Config D    │   Config E    │
│               │               │  (PAPER'S     │               │               │
│               │               │   CHOICE)     │               │               │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ • 1 embed     │ • 1 embed     │ • 1 embed     │ • 1 embed     │ • M embed     │
│   table       │   table       │   table       │   table       │   tables      │
│ • 1 W_V       │ • 1 W_V       │ • 1 W_V       │ • M W_V       │ • M W_V       │
│ • 1 W_K       │ • M W_K       │ • M W_K       │ • M W_K       │ • M W_K       │
│ • 1 gate      │ • 1 gate      │ • M gates     │ • M gates     │ • M gates     │
│   (shared)    │   (shared)    │ • 1 conv      │ • M conv      │ • M conv      │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ All branches  │ Same key      │ Different     │ Different     │ Completely    │
│ get identical │ space, but    │ gates, same   │ value         │ independent   │
│ contribution  │ branches      │ value         │ projections   │ per branch    │
│               │ vote together │ direction     │ per branch    │               │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ Params: ~P    │ Params: ~P+Md²│ Params: ~P+Md²│ Params:       │ Params:       │
│               │               │               │ ~P+2Md²       │ ~M×P          │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ Expressivity: │ Expressivity: │ Expressivity: │ Expressivity: │ Expressivity: │
│ 1 scalar      │ 1 scalar      │ M scalars     │ M vectors     │ M independent │
│ for all       │ (joint)       │ (independent) │ (M rays)      │ vectors       │
└───────────────┴───────────────┴───────────────┴───────────────┴───────────────┘

The Tradeoff Relationship

The relationship is partially inverse, but with nuance:

The inverse relationship:

                    Engram Contribution
                           │
      Pure Static          │           Pure Dynamic
      (no gating)          │           (full attention)
            │              │                  │
            ▼              │                  ▼
     ┌──────────────────┐  │  ┌──────────────────────────┐
     │ Fixed embedding  │  │  │ Query-dependent          │
     │ Same for all     │◄─┼─►│ selection over all       │
     │ contexts         │  │  │ memory positions         │
     └──────────────────┘  │  └──────────────────────────┘
            │              │                  │
       O(1) cost           │             O(N·d) cost
       No expressivity     │             Full expressivity
                           │
                    ┌──────┴──────┐
                    │   Gating    │
                    │  (O(d))     │
                    │             │
                    │ Intermediate│
                    │ cost &      │
                    │ expressivity│
                    └─────────────┘

Orthogonal scaling axes:

The paper’s design achieves orthogonal scaling:

mHC scales expressivity via branch parallelism (more information pathways)
Engram scales capacity via memory size (more stored patterns)

These are somewhat independent axes:

                        Memory Size (Engram)
                              │
                    Low       │        High
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         Low  │   Small,      │   Large       │
              │   simple      │   memory,     │
   Branches   │   model       │   simple      │
   (mHC)      │               │   routing     │
              │───────────────┼───────────────│
              │               │               │
         High │   Small       │   PAPER'S     │
              │   memory,     │   CHOICE      │
              │   complex     │   (27B total) │
              │   routing     │               │
              └───────────────┴───────────────┘

Available Tuning Knobs

Number of branches $M$: More branches = more expressive gating, more parameters
What’s shared vs. separate: Could separate $W_V$ for direction control per branch
Gating granularity: Vector gating instead of scalar for per-dimension control
Hash heads per branch: Different heads for different branches

6. Memory-Compute Decoupling: Training vs. Inference Strategies

Training: Distributed HBM Sharding

The paper describes training system design:

“During training, to accommodate large-scale embedding tables, we employ standard model parallelism by sharding the tables across available GPUs. An All-to-All communication primitive is used to gather active rows in the forward pass and dispatch gradients in the backward pass.”

This is not offloading to host memory. During training:

The embedding tables are split across multiple GPUs
Each GPU holds $1/N$ of the table in its HBM
All2All communication gathers the needed embeddings
Gradients flow back via the same All2All primitive

┌─────────────────────────────────────────────────────────────────────────────┐
│                         TRAINING: DISTRIBUTED SHARDING                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│    GPU 0 (HBM)         GPU 1 (HBM)         GPU 2 (HBM)         GPU 3 (HBM) │
│   ┌───────────┐       ┌───────────┐       ┌───────────┐       ┌───────────┐│
│   │ E[0:M/4]  │       │E[M/4:M/2] │       │E[M/2:3M/4]│       │E[3M/4:M]  ││
│   │           │       │           │       │           │       │           ││
│   │ Shard 0   │       │ Shard 1   │       │ Shard 2   │       │ Shard 3   ││
│   └─────┬─────┘       └─────┬─────┘       └─────┬─────┘       └─────┬─────┘│
│         │                   │                   │                   │      │
│         └───────────────────┴───────────────────┴───────────────────┘      │
│                                     │                                       │
│                              All-to-All                                     │
│                           Communication                                     │
│                                     │                                       │
│                                     ▼                                       │
│                    Each GPU receives embeddings it needs                    │
│                    for its local batch of tokens                           │
│                                                                             │
│   MEMORY REQUIREMENT: Total Engram params / Number of GPUs per GPU         │
│   STILL IN HBM: Yes, distributed but still on-device                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Training still requires substantial HBM—it’s just distributed.

Inference: Host Memory Offloading with Prefetching

The dramatic memory savings apply to inference:

“During inference, this deterministic nature enables a prefetch-and-overlap strategy. Since memory indices are known prior to the forward pass, the system can asynchronously retrieve embeddings from abundant host memory via PCIe.”

┌─────────────────────────────────────────────────────────────────────────────┐
│                      INFERENCE: HOST MEMORY OFFLOADING                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                          HOST MEMORY (DRAM)                                 │
│                    ┌─────────────────────────────┐                          │
│                    │                             │                          │
│                    │   ENGRAM EMBEDDING TABLE    │                          │
│                    │        (100B params)        │                          │
│                    │                             │                          │
│                    │   ~200GB at FP16            │                          │
│                    │   (abundant, cheap)         │                          │
│                    │                             │                          │
│                    └──────────────┬──────────────┘                          │
│                                   │                                         │
│                              PCIe Transfer                                  │
│                           (async, prefetched)                               │
│                                   │                                         │
│                                   ▼                                         │
│                    ┌─────────────────────────────┐                          │
│                    │         GPU (HBM)           │                          │
│                    │                             │                          │
│                    │  ┌───────────────────────┐  │                          │
│                    │  │ Transformer Backbone  │  │                          │
│                    │  │ (4B-8B params)        │  │                          │
│                    │  │ ~8-16GB               │  │                          │
│                    │  └───────────────────────┘  │                          │
│                    │                             │                          │
│                    │  ┌───────────────────────┐  │                          │
│                    │  │ Prefetch Buffer       │  │  ← Small buffer for     │
│                    │  │ (active embeddings)   │  │    current batch        │
│                    │  │ ~few MB               │  │                          │
│                    │  └───────────────────────┘  │                          │
│                    │                             │                          │
│                    │  Total HBM: ~10-20GB        │                          │
│                    │  (NOT 100GB+)               │                          │
│                    │                             │                          │
│                    └─────────────────────────────┘                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why Prefetching Works for Inference But Not Training

The critical enabler is deterministic addressing:

Inference: Indices Known in Advance

# At inference time, the input sequence is fixed
input_tokens = [1847, 29341, 8472, 15234, 31847]

# We can compute ALL Engram indices BEFORE the forward pass
ngram_indices_layer_2 = compute_hashes(input_tokens, layer=2)
ngram_indices_layer_15 = compute_hashes(input_tokens, layer=15)

# Start prefetching while Layer 1 computes
async_prefetch(ngram_indices_layer_2)  # PCIe transfer begins

# By the time we reach Layer 2, embeddings are already in GPU buffer
layer_1_output = transformer_layer_1(input)
engram_embeddings = await_prefetch()  # Already arrived!
layer_2_output = engram_layer_2(layer_1_output, engram_embeddings)

Training: Gradient-Dependent Updates

# During training, we need gradients for EVERY accessed embedding
forward:
    e_t = E[hash(ngram)]  # Retrieved from host? 
    loss = compute_loss(model(e_t))

backward:
    grad_e_t = d_loss / d_e_t  # Need to send gradient BACK to host
    E[hash(ngram)] -= lr * grad_e_t  # Update in host memory
    
# Problem 1: PCIe bandwidth for gradients back to host
# Problem 2: Optimizer states (Adam momentum, variance) where do they live?
# Problem 3: Gradient accumulation across distributed batches

Training requires bidirectional, latency-sensitive communication that doesn’t tolerate PCIe bottlenecks as gracefully.

Memory Accounting Examples

Traditional 100B model:

\[100\text{B params} \times 2 \text{ bytes (FP16)} = 200\text{GB HBM}\]

Requires 3× H100 80GB (tensor parallel) or 2× H200 141GB.

Engram 100B (8B backbone + 92B Engram):

Component	Location	Size
Transformer Backbone	GPU HBM	~16GB
Engram Tables	Host DRAM	~184GB
Total GPU HBM		~16GB

Can run on 1× RTX 4090 24GB + 256GB host RAM.

Important caveat: Not equivalent models! The 8B backbone limits reasoning depth. But for knowledge retrieval tasks, may be competitive.

Key Insight: Breaking the 1GB/1B Rule

The traditional rule of thumb (1GB HBM per 1B parameters) doesn’t apply to Engram inference because:

Engram parameters can reside in host DRAM
Prefetch-overlap masks PCIe latency
Zipfian caching further reduces effective latency

This enables running larger parameter models on memory-constrained hardware, but the Engram parameters provide “memory capacity” not “reasoning depth.”

7. The U-Shaped Allocation Curve: Empirical Findings and Open Questions

What The Paper Says

“This observed U-shape confirms the structural complementarity between the two modules:

MoE-dominated ($\rho \rightarrow 100\%$): The model lacks dedicated memory for static patterns, forcing it to inefficiently reconstruct them through depth and computation.

Engram-dominated ($\rho \rightarrow 0\%$): The model loses conditional computation capacity, hurting tasks that require dynamic, context-dependent reasoning; memory cannot replace computation in this regime.”

What The Paper Doesn’t Explain

A formal model predicting the optimal allocation ratio
An explanation for why the optimum is at ~75-80% specifically (rather than 50% or 90%)
Analysis of the curve shape—why U-shaped rather than V-shaped, linear, or asymmetric?
Scale dependence—does the optimal $\rho$ shift as total parameters increase?
Task dependence—is $\rho^*$ different for knowledge-intensive vs. reasoning tasks?

Plausible Hypotheses

Hypothesis 1: Diminishing Returns on Each Axis

Both MoE experts and Engram slots likely have diminishing returns following power laws:

$L_{\text{MoE}}(n) = A \cdot n^{-\alpha}$ $L_{\text{Engram}}(m) = B \cdot m^{-\beta}$

Under fixed parameter budget $P_{\text{sparse}}$:

\[n \cdot p_{\text{expert}} + m \cdot p_{\text{slot}} = P_{\text{sparse}}\]

The U-shape emerges if both functions are convex and the coefficients create an interior optimum.

Hypothesis 2: Functional Specialization with Coverage Requirements

If ~30-40% of tokens are predictable from local patterns and ~60-70% require reasoning, optimal allocation should roughly match this split. The paper’s finding ($\rho^* \approx 0.75\text{-}0.80$) is slightly higher, suggesting MoE is somewhat less efficient at its task than Engram at its task.

Hypothesis 3: Hash Collision Saturation

Beyond some point, additional Engram capacity hits diminishing returns due to training signal sparsity for rare N-grams, while MoE experts can generalize across inputs.

Assessment

The U-shaped curve is an empirical finding, not a derived scaling law. The paper provides no predictive equations for optimal allocation in new configurations. Practitioners can use $\rho \approx 0.75\text{-}0.80$ as a starting point but have no principled way to adjust for specific use cases without running their own sweeps.

8. Scaling Experiments: Zero-Sum Allocation vs. Additive Scaling

Experiment 1: Zero-Sum Allocation (Section 3.1)

Setup: Fixed total parameters $P_{\text{tot}}$, fixed activated parameters $P_{\text{act}}$. Sweep allocation ratio $\rho$.

Finding: U-shaped loss curve with optimal $\rho \approx 75\text{-}80\%$. Reallocating ~20-25% of sparse budget from MoE to Engram improves performance.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EXPERIMENT 1: ZERO-SUM ALLOCATION                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Total Sparse Budget: 10B parameters (fixed)                              │
│                                                                             │
│   ρ = 100%:  MoE gets 10B  │  Engram gets 0B     → Loss: 1.7248           │
│   ρ = 80%:   MoE gets 8B   │  Engram gets 2B    → Loss: 1.7109 (optimal) │
│   ρ = 60%:   MoE gets 6B   │  Engram gets 4B    → Loss: ~1.715            │
│   ρ = 40%:   MoE gets 4B   │  Engram gets 6B    → Loss: ~1.725            │
│                                                                             │
│   KEY INSIGHT: Trading MoE capacity for Engram capacity improves loss      │
│                up to a point, then hurts.                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Interpretation: Under fixed constraints, some Engram is better than pure MoE. This is genuine reallocation—MoE capacity decreases as Engram increases.

Experiment 2: Additive Scaling (Section 3.2)

Setup: Fixed MoE backbone ($P_{\text{tot}} \approx 3\text{B}$, $P_{\text{act}} = 568\text{M}$). Sweep Engram capacity from 0.3B to 13B.

Finding: Log-linear scaling—loss decreases linearly with $\log(\text{slots})$. No saturation observed.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EXPERIMENT 2: ADDITIVE SCALING                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Fixed MoE Backbone: 3B parameters (UNCHANGED across all runs)            │
│                                                                             │
│   Config A:  MoE: 3B  +  Engram: 0.3B   = Total: 3.3B   → Loss: ~1.81     │
│   Config B:  MoE: 3B  +  Engram: 1B     = Total: 4B     → Loss: ~1.78     │
│   Config C:  MoE: 3B  +  Engram: 5B     = Total: 8B     → Loss: ~1.76     │
│   Config D:  MoE: 3B  +  Engram: 13B    = Total: 16B    → Loss: ~1.74     │
│                                                                             │
│   The PROPORTION of Engram increases (9% → 81%), but this is because      │
│   the denominator grows, NOT because MoE shrinks.                          │
│                                                                             │
│   NOTHING IS OFFLOADED—MoE capacity is preserved.                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Interpretation: With unconstrained memory, more Engram is always better (within tested range). Nothing is “offloaded”—MoE is unchanged, Engram is added.

Critical Distinction

Aspect	Experiment 1	Experiment 2
MoE params	Decreases as $\rho \downarrow$	Constant
Engram params	Increases as $\rho \downarrow$	Increases
Total params	Constant	Increases
Nature	Reallocation	Addition

The OverEncoding Comparison

OverEncoding (Huang et al., 2025a) also uses hash-based N-gram embeddings but:

Injects at layer 0 only (no prefetch overlap)
Uses fixed averaging (no gating)
No tokenizer compression

┌─────────────────────────────────────────────────────────────────────────────┐
│                    OVERENCODING vs ENGRAM                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   OVERENCODING:                                                             │
│   ─────────────                                                             │
│   • N-gram embeddings retrieved at INPUT LAYER (Layer 0)                   │
│   • Integration: AVERAGING with vocabulary embedding                        │
│   • No gating mechanism                                                     │
│   • No context-awareness                                                    │
│                                                                             │
│       input_embedding = 0.5 * vocab_embed[token] + 0.5 * ngram_embed[hash] │
│                                                                             │
│   ENGRAM:                                                                   │
│   ───────                                                                   │
│   • N-gram embeddings retrieved at INTERMEDIATE LAYERS (e.g., 2, 15)       │
│   • Integration: GATED RESIDUAL addition                                    │
│   • Context-aware gating (can suppress irrelevant retrievals)              │
│   • Tokenizer compression                                                   │
│                                                                             │
│       hidden = hidden + gate(context, ngram_embed) * project(ngram_embed)  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Engram extracts more value from equivalent memory budget due to deeper injection, gating, and compression. The comparison demonstrates that how you integrate N-gram memory matters as much as whether you include it.

9. Strategic Publication Patterns and V4 Deployment Probability

The Undertraining Signal

The paper explicitly acknowledges:

“Finally, scaling to Engram-40B further reduces pre-training loss and improves performance across most benchmarks. Although it does not yet strictly dominate Engram-27B on every task, this is likely an artifact of under-training. We observe that the training loss gap between Engram-40B and the baselines continues to widen towards the end of training, suggesting that the expanded memory capacity has not yet fully saturated within the current token budget.”

Evidence of Undertraining

Loss gap widening: Engram-40B advantage over baselines increases toward training end
Inconsistent benchmark dominance: Engram-40B regresses on code tasks (HumanEval 38.4 vs Engram-27B 40.8)

Historical Pattern: DeepSeek Publication → Deployment

Innovation	Paper Date	Deployed In	Lag
DeepSeekMoE	Jan 2024	V2 (May 2024)	~4 months
MLA	May 2024	V2 (May 2024)	<1 month
Aux-loss-free	Nov 2024	V3 (Dec 2024)	~1 month
mHC	Dec 2024	V3 (Dec 2024)	<1 month
Engram	Jan 2026	V4 (???)	???

Pattern: Every major DeepSeek paper has been deployed. No exception.

The Strategic Publication Pattern

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RESEARCH PAPER AS STRATEGIC SIGNAL                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   PHASE 1: INTERNAL R&D (Not Published)                                    │
│   ─────────────────────────────────────────────────────────────────────────│
│   • Full-scale experiments at frontier compute                             │
│   • Production infrastructure development                                  │
│   • Integration with existing model architecture (V3)                      │
│   • Iterative refinement based on internal benchmarks                      │
│                                                                             │
│                              ↓                                              │
│                                                                             │
│   PHASE 2: ACADEMIC PUBLICATION (This Paper)                               │
│   ─────────────────────────────────────────────────────────────────────────│
│   • Establish intellectual priority                                        │
│   • Validate core concepts at reduced scale                                │
│   • Describe (but don't fully benchmark) production design                 │
│   • Signal direction to recruit talent and shape field                     │
│   • Deliberately omit frontier-scale results                               │
│                                                                             │
│                              ↓                                              │
│                                                                             │
│   PHASE 3: PRODUCT LAUNCH (V4 Announcement)                                │
│   ─────────────────────────────────────────────────────────────────────────│
│   • Reveal frontier-scale performance                                      │
│   • Cite own paper as foundation                                          │
│   • Competitors now 6-12 months behind                                     │
│   • Academic paper provides legitimacy                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Probability Assessment

Scenario	Probability	Description
Full adoption	~35%	Engram as described, scaled to 100B+
Modified adoption	~40%	Engram-like module with undisclosed modifications
Partial/optional	~15%	Engram in specific variants only
Deferred/abandoned	~10%	Internal issues prevent deployment

Combined probability of some adoption: ~75%

Expected timeline: Based on historical patterns (papers 1-4 months before deployment), V4 announcement likely in Q1-Q2 2026.

10. Long-Context Enhancement: The Attention Capacity Mechanism

The Experimental Design

The paper compares models with matched base quality (iso-loss) to isolate architectural effects from general capability differences. This methodology is itself a contribution—many long-context papers conflate these factors.

Key Results: Iso-Loss Comparison

Both models at pre-training loss = 1.63:

Metric	MoE-27B (50k)	Engram-27B (46k)	$\Delta$
LongPPL - Book	4.38	4.19	-0.19
LongPPL - Paper	2.91	2.84	-0.07
LongPPL - Code	2.49	2.45	-0.04
NIAH - Multi-Query	84.2	97.0	+12.8
Variable Tracking	77.0	87.2	+10.2
Frequent Words Extraction	73.0	98.6	+25.6

The Mechanism: Attention Capacity Hypothesis

Standard Transformers use attention for both local patterns and long-range dependencies. At long contexts, these compete:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ATTENTION CAPACITY HYPOTHESIS                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   STANDARD TRANSFORMER:                                                     │
│   ─────────────────────                                                    │
│                                                                             │
│       ┌─────────────────────────────────────────────┐                      │
│       │            ATTENTION BUDGET                  │                      │
│       │  ┌─────────────────┬───────────────────┐    │                      │
│       │  │  Local Patterns │  Global Patterns  │    │                      │
│       │  │  "the cat sat"  │  "earlier, John"  │    │                      │
│       │  │                 │  "mentioned that" │    │                      │
│       │  │     ~60%        │       ~40%        │    │                      │
│       │  └─────────────────┴───────────────────┘    │                      │
│       └─────────────────────────────────────────────┘                      │
│                                                                             │
│       At long contexts, local patterns "crowd out" global attention.       │
│                                                                             │
│                                                                             │
│   ENGRAM TRANSFORMER:                                                       │
│   ───────────────────                                                      │
│                                                                             │
│       ┌─────────────────────────────────────────────┐                      │
│       │            ATTENTION BUDGET                  │                      │
│       │  ┌─────────────────────────────────────┐    │                      │
│       │  │        Global Patterns Only         │    │                      │
│       │  │  "earlier, John mentioned that..."  │    │                      │
│       │  │                                     │    │                      │
│       │  │             ~90%                    │    │                      │
│       │  └─────────────────────────────────────┘    │                      │
│       └─────────────────────────────────────────────┘                      │
│                                                                             │
│       ┌─────────────────────────────────────────────┐                      │
│       │          ENGRAM (separate budget)           │                      │
│       │  ┌─────────────────────────────────────┐    │                      │
│       │  │        Local Patterns Only          │    │                      │
│       │  │       "the cat sat", "New York"     │    │                      │
│       │  │            O(1) lookup              │    │                      │
│       │  └─────────────────────────────────────┘    │                      │
│       └─────────────────────────────────────────────┘                      │
│                                                                             │
│       By handling local patterns separately, attention can focus           │
│       entirely on long-range dependencies.                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Task-Specific Pattern

Task Type	Engram Advantage	Explanation
LongPPL (average perplexity)	Modest (+2-5%)	Most tokens are locally predictable
Single-needle retrieval	None (~0%)	Easy for both architectures
Multi-query retrieval	Large (+15%)	Attention-limited in MoE
Variable tracking	Large (+16%)	Requires multi-hop retrieval
Frequent word extraction	Huge (+36%)	Requires global scanning

The pattern confirms the mechanism: Engram’s advantage scales with how much the task requires global (vs. local) attention.

11. The Effective Depth Hypothesis: Bypassing Pattern Reconstruction

The Claim

“By equipping the model with an explicit knowledge lookup capability, Engram effectively mimics an increase in model depth by relieving the model of the early stages of feature composition.”

What Early Layers Do

Standard Transformers spend layers 1-6 progressively reconstructing static patterns. The paper cites Ghandeharioun et al. (2024) to illustrate with “Diana, Princess of Wales”:

┌─────────────────────────────────────────────────────────────────────────────┐
│         ENTITY RESOLUTION: "Diana, Princess of Wales"                       │
│         (How a standard Transformer reconstructs a known entity)            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Input: "... Diana, Princess of Wales ..."                                │
│   Task: Internally represent who this entity is                            │
│                                                                             │
│   Layer    Latent State (via PatchScope)           What's Happening        │
│   ─────────────────────────────────────────────────────────────────────────│
│   1-2      "Wales: Country in the United Kingdom"  Just the last token     │
│   3        "Wales: Country in Europe"              Still just geography    │
│   4        "Princess of Wales: Title held by       Starting to see title   │
│             female sovereigns..."                                          │
│   5        "Princess of Wales: Title given to      Title semantics emerge  │
│             the wife of the Prince of Wales..."                            │
│   6        "Diana, Princess of Wales (1961-1997),  FINALLY: Full entity   │
│             the first wife of Prince Charles..."                           │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   OBSERVATION:                                                              │
│   It takes 6 LAYERS just to reconstruct a well-known entity.               │
│   This is static knowledge—it's the same every time this phrase appears.  │
│   These layers are essentially rebuilding a lookup table at runtime.       │
│                                                                             │
│   ENGRAM ALTERNATIVE:                                                       │
│   Layer 2: Engram looks up trigram "Princess of Wales"                     │
│            → Retrieves pre-computed embedding encoding the full entity     │
│   Layer 3+: Can immediately proceed to REASONING about Diana              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

LogitLens Evidence

LogitLens (nostalgebraist, 2020) projects each layer’s hidden state through the final LM head to measure “prediction readiness”:

\[\text{KL}(P^{(\ell)} \| P^{\text{final}})\]

Finding: Engram shows systematically lower KL divergence at early layers—representations converge to prediction-ready states faster.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KL DIVERGENCE BY LAYER                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   KL Divergence                                                            │
│       │                                                                     │
│   10  │ ●                                                                  │
│       │  ●                                                                  │
│    8  │   ●●                                                               │
│       │     ●●    MoE-27B                                                  │
│    6  │       ●●●                                                          │
│       │   ○      ●●●                                                       │
│    4  │    ○○        ●●●●                                                  │
│       │      ○○○          ●●●●●●                                           │
│    2  │         ○○○○○           ●●●●●●●●●●                                │
│       │              ○○○○○○○○○○○○○○○○○○○○○                                │
│    0  │──────────────────────────────────────────→ Layer                   │
│       0    5    10   15   20   25   30                                     │
│                                                                             │
│       ● MoE-27B        ○ Engram-27B                                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

CKA Evidence

CKA (Centered Kernel Alignment; Kornblith et al., 2019) measures representational similarity between layers.

Finding: Engram layer 5 representations match MoE layer ~12 representations for named entities. The “soft alignment index” quantifies this shift:

Engram Layer $j$	Soft Alignment $a_j$	“Depth Bonus” ($a_j - j$)
0	~2	+2 layers
5	~12	+7 layers
10	~17	+7 layers
15	~21	+6 layers
20	~24	+4 layers
25	~27	+2 layers

Early-to-mid Engram layers gain the most “effective depth” because that’s where static pattern reconstruction would normally occur.

The Mechanism Summarized

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ENGRAM'S "EFFECTIVE DEPTH" MECHANISM                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   STANDARD TRANSFORMER:                                                     │
│                                                                             │
│   Layer 0  ──→ Raw token embeddings                                        │
│   Layer 1  ──→ Local bigram features                                       │
│   Layer 2  ──→ Trigram patterns emerging                                   │
│   Layer 3  ──→ Entity boundaries detected                                  │
│   Layer 4  ──→ Entity types recognized                                     │
│   Layer 5  ──→ Multi-token entities composed        ← RECONSTRUCTION      │
│   Layer 6  ──→ Entity semantics resolved             ← COMPLETE HERE      │
│   Layer 7  ──→ Begin relational reasoning           ← REASONING STARTS    │
│   ...                                                                       │
│   Layer 30 ──→ Final prediction                                            │
│                                                                             │
│   ENGRAM TRANSFORMER:                                                       │
│                                                                             │
│   Layer 0  ──→ Raw token embeddings                                        │
│   Layer 1  ──→ Basic contextualization                                     │
│   Layer 2  ──→ ENGRAM INJECTION                     ← STATIC PATTERNS     │
│               │                                       RETRIEVED O(1)       │
│               └→ Representations NOW equivalent to MoE Layer 6-7          │
│   Layer 3  ──→ Can immediately begin reasoning      ← REASONING STARTS    │
│   ...                                                                       │
│   Layer 30 ──→ Final prediction                                            │
│                                                                             │
│   NET EFFECT:                                                               │
│   Engram "skips" ~4-5 layers of reconstruction work.                       │
│   Those layers can now contribute to reasoning instead.                    │
│   A 30-layer Engram model has ~34-35 "effective layers" of reasoning.     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

12. Functional Separation and Domain-Specialized Engram Potential

The Sensitivity Experiment

The experiment completely suppresses Engram output during inference ($\alpha \rightarrow 0$ for all positions).

Results by Task Category

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RETAINED PERFORMANCE BY TASK TYPE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TASK CATEGORY              BENCHMARK        RETAINED     INTERPRETATION  │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   READING COMPREHENSION      C3               93%          Backbone holds  │
│   (Context provides answer)  RACE-Middle      89%          comprehension   │
│                              RACE-High        84%          capability      │
│                              DROP             81%                          │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   COMMONSENSE REASONING      HellaSwag        85%          Backbone holds  │
│   (World knowledge needed)   ARC-Challenge    81%          most common-    │
│                              PIQA             81%          sense patterns  │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   KNOWLEDGE-INTENSIVE        CMMLU            78%          Moderate        │
│   REASONING                  MMLU             75%          degradation—    │
│   (Facts + reasoning)        MMLU-PRO         72%          needs both      │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   CODE                       CruxEval         76%          Mixed—patterns  │
│   (Patterns + logic)         MBPP             68%          matter for      │
│                              HumanEval        58%          generation      │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   ALGORITHMIC REASONING      BBH              67%          Surprising      │
│   (Multi-step logic)         GSM8K            62%          dependency—     │
│                              MGSM             44%          maybe via       │
│                              MATH             36%          pattern recog   │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   FACTUAL KNOWLEDGE          TriviaQA-ZH      44%          CATASTROPHIC    │
│   (Pure recall)              PopQA            44%          COLLAPSE        │
│                              TriviaQA         29%          ← Engram IS     │
│                                                              the knowledge │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Interpretation: Functional Separation

Engram becomes the primary repository for parametric knowledge. Factual knowledge is inherently N-gram structured (“The capital of France” → “Paris”), making it ideal for static lookup. The backbone retains comprehension and reasoning capabilities.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ARCHITECTURAL CAPABILITY MAPPING                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         ┌─────────────────────────────┐                    │
│                         │      FULL MODEL             │                    │
│                         │   (Backbone + Engram)       │                    │
│                         └─────────────┬───────────────┘                    │
│                                       │                                     │
│               ┌───────────────────────┼───────────────────────┐            │
│               │                       │                       │            │
│               ▼                       │                       ▼            │
│   ┌───────────────────────┐          │          ┌───────────────────────┐ │
│   │      BACKBONE         │          │          │       ENGRAM          │ │
│   │   (MoE Experts +      │          │          │   (N-gram Memory)     │ │
│   │    Attention)         │          │          │                       │ │
│   ├───────────────────────┤          │          ├───────────────────────┤ │
│   │ • Reading comprehens. │          │          │ • Factual knowledge   │ │
│   │ • Commonsense reason. │          │          │ • Entity recognition  │ │
│   │ • Context extraction  │          │          │ • Pattern completion  │ │
│   │ • Logical inference   │          │          │ • Stereotyped phrases │ │
│   │ • Long-range deps     │          │          │ • Local collocations  │ │
│   └───────────────────────┘          │          └───────────────────────┘ │
│               │                       │                       │            │
│               │          ┌────────────┴────────────┐          │            │
│               │          │   REQUIRES BOTH         │          │            │
│               └─────────►│ • Knowledge reasoning   │◄─────────┘            │
│                          │ • Code generation       │                       │
│                          │ • Math problem solving  │                       │
│                          └─────────────────────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Implication: Domain-Specialized Engram

If Engram disproportionately stores factual knowledge, domain-specialized Engram modules could dramatically improve accuracy in knowledge-intensive applications:

Medical Engram Potential

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MEDICAL ENGRAM: POTENTIAL DESIGN                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   DOMAIN CHARACTERISTICS:                                                   │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   Medical text is HIGHLY N-GRAM STRUCTURED:                                │
│                                                                             │
│   • Drug names: "acetaminophen", "metformin hydrochloride"                 │
│   • Dosages: "500mg twice daily", "10mg/kg body weight"                    │
│   • Conditions: "type 2 diabetes mellitus", "acute myocardial infarction" │
│   • Anatomical terms: "left anterior descending artery"                    │
│   • Procedures: "laparoscopic cholecystectomy"                             │
│   • Interactions: "contraindicated with MAO inhibitors"                    │
│                                                                             │
│   These are LOCAL, STATIC patterns—perfect for Engram.                     │
│                                                                             │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   POTENTIAL ARCHITECTURE:                                                   │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   Base Model:                                                               │
│   ├─ General backbone (reasoning, comprehension)        ~20B params        │
│   └─ General Engram (common knowledge)                  ~10B params        │
│                                                                             │
│   Medical Specialist:                                                       │
│   ├─ Same backbone                                      ~20B params        │
│   ├─ General Engram (retained)                          ~10B params        │
│   └─ Medical Engram (domain-specific)                   ~50B params        │
│       • Trained on PubMed, clinical notes, FDA data                        │
│       • Drug-drug interactions                                             │
│       • Diagnostic criteria                                                │
│       • Treatment protocols                                                │
│                                                                             │
│   Total: 80B params, but only 20B activated per token                     │
│   Massive factual capacity with modest compute cost.                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Advantages of Domain-Specialized Engram

Efficiency: Scales knowledge without scaling inference cost
Modularity: Swap domain modules without backbone retraining
Updateability: Incremental updates without catastrophic forgetting
Auditability: Deterministic retrievals enable knowledge provenance

Future Direction: Mixture of Memories (MoM)

┌─────────────────────────────────────────────────────────────────────────────┐
│           FUTURE DIRECTION: MIXTURE OF MEMORIES (MoM)                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Just as MoE routes computation to specialized experts,                   │
│   MoM could route retrieval to specialized memory modules:                 │
│                                                                             │
│                        Input Tokens                                         │
│                             │                                               │
│                             ▼                                               │
│                    ┌─────────────────┐                                      │
│                    │  Memory Router  │                                      │
│                    │  (Learned)      │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│           ┌─────────────────┼─────────────────┐                            │
│           │                 │                 │                             │
│           ▼                 ▼                 ▼                             │
│   ┌───────────────┐ ┌───────────────┐ ┌───────────────┐                    │
│   │   General     │ │   Medical     │ │    Legal      │                    │
│   │   Engram      │ │   Engram      │ │   Engram      │                    │
│   │   (50B)       │ │   (50B)       │ │   (30B)       │                    │
│   └───────┬───────┘ └───────┬───────┘ └───────┬───────┘                    │
│           │                 │                 │                             │
│           └─────────────────┼─────────────────┘                            │
│                             │                                               │
│                             ▼                                               │
│                    ┌─────────────────┐                                      │
│                    │   Gated Fusion  │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│                             ▼                                               │
│                    ┌─────────────────┐                                      │
│                    │    Backbone     │                                      │
│                    │    (20B MoE)    │                                      │
│                    └─────────────────┘                                      │
│                                                                             │
│   Total: 150B+ parameters                                                  │
│   Activated: ~25B per token                                                │
│   Specialized knowledge for each domain                                    │
│   Single backbone for shared reasoning                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

13. Scale Claims vs. Empirical Validation: Strategic Omissions

Systematic Gap Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SCALE CLAIMS vs. EMPIRICAL VALIDATION                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   CLAIM                                    VALIDATED?    GAP                │
│   ─────────────────────────────────────────────────────────────────────────│
│                                                                             │
│   U-shaped allocation is stable            Partially     Only 2 compute    │
│   across compute regimes                                 budgets tested    │
│                                                          (2×10²⁰, 6×10²⁰) │
│                                                          V3 uses ~10²⁵    │
│                                                                             │
│   Log-linear Engram scaling continues      Partially     Tested to ~13B   │
│   indefinitely                                           Claims apply to  │
│                                                          100B+            │
│                                                                             │
│   100B Engram offloading with <3%          Throughput    Capability NOT   │
│   overhead                                 only          tested at 100B   │
│                                                                             │
│   Multi-level cache hierarchy              Described     No empirical     │
│   exploits Zipfian distribution                          validation       │
│                                                                             │
│   Engram-40B undertraining implies         Claimed       No extended      │
│   further gains possible                                 training run     │
│                                                                             │
│   Prefetch-overlap strategy scales         Architecture  Not tested at    │
│   to production serving                    described     high QPS/batch   │
│                                                                             │
│   Context-extension gains persist          32k context   No 128k or 1M    │
│   at longer contexts                       only          context testing  │
│                                                                             │
│   MoE + Engram composition optimal         Asserted      No comparison    │
│   for frontier models                                    at frontier      │
│                                                          scale            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Evidence of Intentional Omission

Selective precision: Extremely detailed on some aspects (hyperparameters to 4 decimals), conspicuously vague on scale
Infrastructure over-specification: Section 2.5 describes production-grade systems unnecessary for 27B validation
Explicit undertraining acknowledgment: They tell you results are incomplete
Architecture alignment with V3: Uses V3 tokenizer, MLA, mHC—not generic research

The Competitive Timing Dimension

┌─────────────────────────────────────────────────────────────────────────────┐
│                    COMPETITIVE TIMING ANALYSIS                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   IF DEEPSEEK PUBLISHES FULL FRONTIER RESULTS:                             │
│   ─────────────────────────────────────────────────────────────────────────│
│   • OpenAI/Anthropic/Google immediately start replication                  │
│   • US labs have more compute to iterate faster                            │
│   • DeepSeek's head start erodes quickly                                   │
│                                                                             │
│   BY PUBLISHING CONCEPT WITHOUT SCALE VALIDATION:                          │
│   ─────────────────────────────────────────────────────────────────────────│
│   • Establishes intellectual priority (can cite own work)                  │
│   • Competitors must independently validate scale                          │
│   • Buys time to ship V4 before replication                                │
│   • When V4 launches, competitors are still experimenting                  │
│                                                                             │
│   THE STRATEGIC CALCULUS:                                                  │
│   ─────────────────────────────────────────────────────────────────────────│
│   • DeepSeek's advantage is NOT compute (they have less)                  │
│   • DeepSeek's advantage IS architectural innovation speed                 │
│   • Publishing establishes priority, withholding preserves lead time      │
│   • Optimal strategy: publish concept, withhold scale results             │
│                                                                             │
│   This is EXACTLY what the paper does.                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Future Implications

The Beijing AGI Summit Context

At the January 2026 Beijing AGI summit, senior figures from Chinese AI labs made striking admissions (South China Morning Post, 2026; Wall Street Journal, 2026):

Zhipu AI co-founder Tang Jie: “The truth may be that the gap is actually widening”
Alibaba scientist Lin Junyang: “Less than 20%” chance of overtaking US in 3-5 years; US labs enjoy “one to two orders of magnitude more training compute”

External assessments support this pessimism (Asialink, 2026; Science Business, 2026):

US controls ~75% of global AI compute; China ~15%
US AI supercomputing capacity approximately 9× China’s
HBM (not just GPUs) identified as binding constraint (Design-Reuse, 2026)

Engram as Strategic Response

The Engram architecture directly addresses acknowledged constraints:

Constraint	Engram Response
HBM bottleneck	Host DRAM offloading via prefetch
Compute disadvantage	More capability per FLOP
Hardware access limits	Architectural efficiency as substitute for scale

This alignment is not coincidental. Engram represents the operationalization of efficiency as competitive strategy.

The Paradox of Constraint-Driven Innovation

Export controls may paradoxically accelerate Chinese architectural innovation:

US labs with abundant compute have less pressure to innovate on efficiency
Chinese labs, facing structural constraints, are forced into architectural creativity
Innovations that extract more from less benefit everyone but are discovered under constraint

If US labs have 10× compute but DeepSeek extracts 2× more per FLOP, the effective gap narrows to 5×. With compounding efficiency innovations, this represents the scenario where China narrows the capability gap despite hardware disadvantage.

DeepSeek V4 Prediction

Evidence for Engram Adoption

Track record: Every major DeepSeek paper has been deployed
Strategic fit: Directly addresses HBM constraints and compute efficiency
Architecture alignment: Paper uses V3 components (tokenizer, MLA, mHC)
Publication timing: Matches historical pattern (paper 1-4 months before deployment)
Scale validation gap: Frontier results deliberately withheld

Signals to Watch

Strong adoption signals:

Follow-up paper showing Engram at 100B+ scale
Production-grade distributed training code release
V4 technical report citing this paper in architecture section
DeepSeek communications mentioning “conditional memory”

Weak/negative signals:

Long delay between paper and any follow-up
Other DeepSeek papers explore alternative efficiency approaches
No mention of Engram in subsequent communications

Research Directions

Domain-Specialized Engram

Research questions:

Does domain-specific Engram training improve domain accuracy?
What’s the optimal $\rho^*$ for knowledge-intensive domains?
Can domain and general Engram compose without interference?
Can Engram retrievals provide citations for knowledge provenance?

Technical Extensions

Integration with other sparse primitives (Mixture-of-Depths, early exit)
Learned addressing (beyond fixed hash functions)
Higher-order N-grams with larger memory budgets
Continual learning for Engram modules

Broader Implications

If Engram’s thesis is correct—that Transformers “waste” depth on static pattern reconstruction—implications extend beyond efficiency:

Interpretability: Cleaner separation between knowledge storage and reasoning
Editability: Modify factual knowledge without affecting reasoning capabilities
Verification: Audit knowledge sources via retrieval provenance

Future Potential: Activation-Conditional Memory Modulation

A promising direction for future investigation emerges from the intersection of Engram’s branch-specific gating architecture and recent advances in neural network steering via representation engineering (Turner et al., 2023; Zou et al., 2023). If semantic modes—such as factual retrieval versus abstract reasoning—are geometrically identifiable within activation space, then Engram’s existing architectural hooks may permit dynamic, activation-conditional modulation of memory contribution.

The Existing Architectural Hook

The Engram design already creates the necessary structure for semantic-conditional behaviour:

\[\alpha_m = \sigma\left(\frac{h_t^{(m)} \cdot k_t^{(m)}}{\sqrt{d}}\right)\]

The gate $\alpha_m$ is already a function of the branch-specific hidden state $h_t^{(m)}$, which encodes semantic content from preceding layers. This means the model can implicitly learn to modulate Engram contribution based on whatever semantic features emerge in that activation pattern. The sensitivity analysis results (TriviaQA retaining only 29% performance versus C3 retaining 93% when Engram is suppressed) demonstrate that the model does learn some version of functional separation between factual retrieval and comprehension modes.

The research direction proposed here would make this separation explicit and controllable rather than leaving it to emergent learning dynamics.

Proposed Implementation Approaches

Approach 1: Learned Factuality Probe

Following the methodology established in probing classifier literature (Belinkov, 2022), a linear probe could be trained on intermediate activations to detect “factual retrieval mode”:

\[f_t = \sigma(w_{\text{fact}}^\top h_t + b)\]

This factuality score $f_t \in (0,1)$ would then globally modulate Engram gates:

\[\tilde{\alpha}_m = \alpha_m \cdot (1 + \lambda \cdot f_t)\]

Where $\lambda$ controls sensitivity to the factuality signal. Higher $f_t$ (factual retrieval detected) increases Engram contribution; lower $f_t$ (reasoning mode detected) decreases it.

Approach 2: Steering Vector Modulation

Building on representation engineering methods (Zou et al., 2023), a “factuality direction” $v_{\text{fact}}$ could be identified in activation space via contrastive methods applied to factual versus reasoning task completions:

\[\text{factuality\_score}_t = \frac{h_t \cdot v_{\text{fact}}}{\|h_t\| \|v_{\text{fact}}\|}\]

This directional projection would bias gates toward higher or lower Engram contribution based on the model’s inferred semantic mode at each position.

Approach 3: Branch Specialization via Auxiliary Training Signal

Rather than post-hoc steering, functional separation could be encouraged during training through auxiliary losses that reward branch specialization:

Branches 1-2: High Engram affinity (factual retrieval specialists)
Branches 3-4: Low Engram affinity (reasoning specialists)

This would create explicit functional roles within the mHC branch structure, analogous to how MoE experts specialize through load balancing losses.

Approach 4: Inference-Time Control Surface

For deployment flexibility, a user-controllable “factuality dial” could be exposed:

\[\alpha_m' = \sigma\left(\frac{h_t^{(m)} \cdot k_t^{(m)}}{\sqrt{d}} + \beta_{\text{user}}\right)\]

Where $\beta_{\text{user}} \in [-2, +2]$ allows runtime adjustment of the prior toward factual precision (high Engram, positive $\beta$) or flexible reasoning (low Engram, negative $\beta$). This would provide practitioners direct control over the knowledge-reasoning tradeoff without model retraining.

Conceptual Architecture

┌─────────────────────────────────────────────────────────────────┐
│           ACTIVATION-CONDITIONAL ENGRAM MODULATION              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Hidden State h_t                                              │
│        │                                                        │
│        ├──→ [Factuality Probe] ──→ f_t ∈ (0,1)                 │
│        │                              │                         │
│        │         ┌────────────────────┘                         │
│        │         │                                              │
│        │         ▼                                              │
│        │    ┌─────────┐                                         │
│        │    │  Scale  │                                         │
│        │    └────┬────┘                                         │
│        │         │                                              │
│        ▼         ▼                                              │
│   ┌─────────────────────────────────────────┐                  │
│   │  Branch 1: α₁' = α₁ · (1 + λ·f_t)      │  High factuality │
│   │  Branch 2: α₂' = α₂ · (1 + λ·f_t)      │  → More Engram   │
│   │  Branch 3: α₃' = α₃ · (1 - λ·f_t)      │  Low factuality  │
│   │  Branch 4: α₄' = α₄ · (1 - λ·f_t)      │  → Less Engram   │
│   └─────────────────────────────────────────┘                  │
│                                                                 │
│   Result: Model dynamically routes between factual retrieval   │
│   and reasoning modes based on detected activation state.      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Toward Unified Conditional Allocation

This direction points toward a more general architectural principle: conditional memory with semantic routing. Just as MoE routes computation to specialized experts based on learned gating functions, retrieval could be routed to specialized memory modules based on semantic state:

Detected Mode	Engram Contribution	Expert Routing
Factual retrieval	High	Low diversity (retrieval experts)
Abstract reasoning	Low	High diversity (reasoning experts)
Creative generation	Moderate	Specific expert clusters

This would create a unified framework where both computation and memory are conditionally allocated based on task demands inferred from activation patterns—extending the conditional computation paradigm to encompass conditional memory as a first-class design axis.

Research Considerations

Several factors warrant careful investigation:

Implicit learning sufficiency: The current gating mechanism may already capture semantic-conditional behaviour through end-to-end learning. Explicit steering could prove redundant or could interfere with learned patterns that are more nuanced than a linear probe can capture.
Probe fidelity: Factuality detection is not binary—many tasks require both retrieval and reasoning in interleaved fashion. A linear probe may be insufficiently expressive to capture the full spectrum of semantic modes.
Training dynamics: If Engram contribution becomes highly state-dependent, the model might learn inconsistent or unstable dependencies that complicate optimization and hurt convergence.
Interpretability prerequisites: Identifying robust semantic directions requires substantial representation engineering work that the Engram paper does not undertake. Applying methods from Zou et al. (2023) and Turner et al. (2023) to Engram-augmented models specifically would be a necessary precursor.
Evaluation methodology: Measuring success requires benchmarks that cleanly separate factual retrieval from reasoning—existing benchmarks often confound these capabilities.

Intended Future Work

The author of this analysis intends to explore these directions through systematic investigation of:

Semantic geometry in Engram-augmented models: Applying representation engineering methods to characterize how factual versus reasoning modes manifest geometrically in mHC branch activations.
Branch specialization dynamics: Analysing whether branches naturally specialize for different functional roles during standard Engram training, or whether auxiliary losses are required to induce clean separation.
Controllable factuality-reasoning tradeoffs: Developing and evaluating inference-time control mechanisms that allow practitioners to explicitly navigate the knowledge-computation tradeoff.
Interaction effects with MoE routing: Investigating whether Engram contribution and expert selection exhibit correlated or complementary patterns, and whether joint optimization of both routing decisions yields benefits over independent optimization.

This research direction represents a natural extension of the Engram contribution—moving from static architectural choices about where to inject memory toward dynamic, semantically-informed decisions about when and how much memory to contribute based on the model’s internal state.

References

Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1), 207-219.

Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.

DeepSeek-AI. (2024a). DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.

DeepSeek-AI. (2024b). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437.

DeepSeek-AI. (2026). Engram: Conditional memory via scalable lookup—A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372.

Design-Reuse. (2026). China’s AI chip ambitions limited by HBM memory supply, notes report. Retrieved from https://www.design-reuse.com/news/

Ghandeharioun, A., Kim, B., & others. (2024). PatchScope: A framework for inspecting language model representations at scale. Proceedings of the 41st International Conference on Machine Learning.

Huang, X., & others. (2025a). OverEncoding: Hash-based N-gram embeddings for vocabulary expansion. arXiv preprint.

Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189-206.

Jurafsky, D., & Martin, J. H. (2024). Speech and language processing (3rd ed.). Pearson.

Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning, 3519-3529.

Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., & Jégou, H. (2019). Large memory layers with product keys. Advances in Neural Information Processing Systems, 32.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

nostalgebraist. (2020). Interpreting GPT: The logit lens. LessWrong. Retrieved from https://www.lesswrong.com/posts/

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., … & Zheng, Y. (2023). YaRN: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.

Reuters. (2026, January 10). China is closing US technology lead despite constraints, AI researchers say. Retrieved from https://www.reuters.com/world/china/

Science Business. (2026). State of AI 2025: Five key charts for Europeans. Retrieved from https://sciencebusiness.net/news/ai/

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.

South China Morning Post. (2026, January). China AI has less than 20% chance to exceed US over next 3-5 years: Alibaba scientist. Retrieved from https://www.scmp.com/tech/big-tech/

Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.

Wall Street Journal. (2026, January). China AI race: US chips. Retrieved from https://www.wsj.com/tech/ai/

Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. Proceedings of the 26th International Conference on Machine Learning, 1113-1120.

Xie, S., & others. (2025). mHC: Manifold-constrained hyper-connections for improved gradient flow in deep networks. arXiv preprint.

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., … & Hendrycks, D. (2023). Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.