Paper: mHC: Manifold-Constrained Hyper-Connections Authors: Zhenda Xie et al., DeepSeek-AI Date: December 2025 arXiv: 2512.24880v1
TL;DR
- Manifold-Constrained Hyper-Connections (mHC) is a neural network architecture modification from DeepSeek-AI that fixes a fundamental instability in Hyper-Connections (HC)—an approach that expands the residual stream from one to multiple parallel streams with learnable mixing matrices.
- HC offers performance gains but causes training crashes at scale due to signal explosion. The mixing matrix \(H^{\text{res}}\) can have spectral norm > 1, causing exponential signal growth across layers (measured at ~3000× amplification in the paper).
- mHC solves this by constraining the mixing matrices to be doubly stochastic (all rows and columns sum to 1, all entries \(\geq 0\)) using the Sinkhorn-Knopp iterative projection algorithm. This guarantees spectral norm \(\leq 1\), preventing signal explosion.
- The doubly stochastic constraint projects matrices onto the Birkhoff polytope—the set of all doubly stochastic matrices—which is closed under matrix multiplication, ensuring stability at any depth.
- Validated at 27B parameters, mHC eliminates the loss spikes and gradient explosions observed in HC while achieving equal or better performance across reasoning and language understanding benchmarks (+2.1% on BBH, +2.3% on DROP vs HC).
- Engineering overhead is 6.7% additional training time after optimization via custom CUDA kernels (TileLang), memory-efficient recomputation, and modified pipeline parallelism (DualPipe).
- The broader principle: constraining learnable parameters to specific geometric manifolds can restore stability without sacrificing expressivity—a design pattern applicable beyond HC to other architectural components.
Quick Reference: Mathematical Notation
| Symbol | Name | Meaning |
|---|---|---|
| x | Vector | A list of numbers, e.g., [1.5, 2.0, 0.8, 1.2] |
| \(x_l\) | Subscript l | The vector x at layer l (layer index) |
| \(x_0, x_1, x_2\) | Subscript 0,1,2 | Individual streams within the expanded residual |
| \(\mathbb{R}\) | Real numbers | The set of all real numbers |
| \(\mathbb{R}^n\) | n-dimensional space | Vectors with n entries |
| \(\mathbb{R}^{n \times m}\) | Matrix space | Matrices with n rows and m columns |
| C | Channel dimension | The width of a single stream (e.g., 2048) |
| n | Expansion rate | Number of parallel streams (e.g., 4) |
| n×C | Expanded dimension | Total width of the multi-stream residual |
| H | Matrix | A transformation that mixes/combines inputs |
| \(H^{\text{res}}\) | Residual mapping | The n×n matrix that mixes streams together |
| \(H^{\text{pre}}\) | Pre mapping | Aggregates n streams → 1 input for layer |
| \(H^{\text{post}}\) | Post mapping | Distributes 1 output → n streams |
| \(\mathcal{F}\) | Layer function | The actual computation (attention, FFN) |
| \(\lVert x \rVert\) | Norm | The “length” or magnitude of vector x |
| \(\lVert H \rVert\) | Spectral norm | Maximum amplification factor of matrix H |
| Σ | Summation | Add up all terms |
| ∏ | Product | Multiply all terms together |
| \(\prod_i H_i\) | Composite | \(H_1 \times H_2 \times H_3 \times \ldots\) (matrix multiplication) |
| \(P_M(H)\) | Projection | Snap matrix H onto manifold M |
| ∈ | Element of | “belongs to” or “is in” |
| ≤ | Less than or equal | |
| ⊤ | Transpose | Flip rows and columns |
| I | Identity matrix | Diagonal of 1s, zeros elsewhere |
| \(B_n\) | Birkhoff polytope | Set of all \(n \times n\) doubly stochastic matrices |
Glossary of Mathematical Terms
| Term | Definition | Relevance to mHC |
|---|---|---|
| Residual Connection | Architecture where layer output = input + transformation. Enables training of deep networks by preserving direct signal paths. | The foundation HC/mHC extends. Standard form: \(x_{l+1} = x_l + F(x_l)\) |
| Identity Mapping | The direct pass-through of input to output without modification. In residual networks, this is the \(x_l\) term that bypasses the layer function. | HC breaks this; mHC restores it via constraints |
| Spectral Norm | The maximum factor by which a matrix can stretch any vector. Equals the largest singular value. Calculated as \(\sqrt{\max \text{ eigenvalue of } H^T \times H}\). | Doubly stochastic matrices have spectral norm \(\leq 1\), preventing signal explosion |
| Doubly Stochastic Matrix | A square matrix where: (1) all entries \(\geq 0\), (2) every row sums to 1, (3) every column sums to 1. | The constraint mHC imposes on \(H^{\text{res}}\) to ensure stability |
| Birkhoff Polytope | The geometric shape formed by all doubly stochastic matrices. Vertices are permutation matrices; interior points are “soft permutations.” | The manifold onto which mHC projects \(H^{\text{res}}\) |
| Permutation Matrix | A 0/1 matrix with exactly one 1 per row and column. Reorders elements without blending. | Vertices of the Birkhoff polytope |
| Sinkhorn-Knopp Algorithm | Iterative method to project any positive matrix onto the Birkhoff polytope by alternately normalizing rows and columns. | How mHC efficiently computes the doubly stochastic constraint |
| Manifold | A mathematical space that locally resembles flat space but may have global structure/constraints. | The Birkhoff polytope is the manifold mHC uses; future work may explore others |
| Convex Combination | A weighted average where weights are non-negative and sum to 1. Output is always “between” the inputs. | Doubly stochastic matrices compute convex combinations—outputs cannot exceed input range |
| Amax Gain Magnitude | A proxy metric for signal amplification. Forward gain = maximum absolute row sum; backward gain = maximum absolute column sum. Used in the paper to measure stability. | HC shows values ~3000; mHC keeps values ~1.0-1.6 |
| FFN (Feed-Forward Network) | A component of Transformer layers that applies a learned transformation to each token independently. Typically: expand → nonlinearity → compress. | One of the layer functions F that mHC wraps |
| Pipeline Parallelism | Distributing model layers across multiple GPUs, with activations communicated between stages. | mHC increases communication costs; paper optimizes via DualPipe modifications |
| Gradient Explosion | When gradients grow unboundedly during backpropagation, causing numerical overflow and training failure. | The primary failure mode of HC that mHC prevents |
Section-by-Section
Section 1: Introduction
The problem: Neural networks have used the same residual connection design since ResNet (2016). The equation \(x_{l+1} = x_l + F(x_l)\) enables deep training because the identity mapping (the \(x_l\) term passed directly through) ensures stable signal flow across many layers.
The opportunity: Hyper-Connections (HC) extends this by expanding from one stream to n parallel streams with learnable matrices controlling how information mixes. This increases representational capacity without increasing computational cost (FLOPs) of individual layers.
The challenge: HC’s learnable mixing matrices break the identity mapping property. When multiplied across many layers, these unconstrained matrices cause signals to either explode (grow unboundedly) or vanish. This manifests as training crashes, loss spikes, and gradient explosions at scale.
The solution: mHC projects the mixing matrices onto the Birkhoff polytope—the set of doubly stochastic matrices—which mathematically guarantees bounded signal propagation while preserving the ability to learn useful mixing patterns.
Section 2: Related Works
Micro-design concerns what happens inside each layer: convolutions evolved to attention mechanisms and feed-forward networks. Efficiency variants emerged (Multi-Query Attention, Grouped-Query Attention, Multi-Head Latent Attention). Sparse computation via Mixture-of-Experts allows parameter scaling without proportional compute costs.
Macro-design concerns how layers connect to each other. After ResNet’s residual connections, architectures like DenseNet and FractalNet increased topological complexity. Recent work expands residual stream width: Hyper-Connections, Residual Matrix Transformer, MUDDFormer, DeepCrossAttention. All of these compromise identity mapping and introduce stability concerns. mHC is positioned as restoring stability to this expanded-stream paradigm.
Section 3: Preliminary (The Problem Analysis)
HC mechanics: The input x is expanded from C dimensions to \(n \times C\) dimensions (n parallel streams). Three learnable matrices govern the flow:
- \(H^{\text{pre}}\): Aggregates n streams into 1 input for the layer function
- \(H^{\text{post}}\): Distributes layer output back across n streams
- \(H^{\text{res}}\): Mixes information between streams in the residual path
Key finding: Ablations show \(H^{\text{res}}\) provides the largest performance gain but is also the source of instability.
Numerical instability: When \(H^{\text{res}}\) is applied across L layers, the composite mapping is the product of all \(H^{\text{res}}\) matrices. Unconstrained matrices can have spectral norm greater than 1, causing exponential signal growth. The paper measures “Amax Gain Magnitude” reaching approximately 3000 in HC—signals amplified 3000× from input to output.
Empirical evidence: Training curves show HC experiencing a loss spike around step 12,000 in a 27B model, correlated with gradient norm explosions. The model may recover but wastes compute and risks permanent divergence.
System overhead: Beyond stability, HC increases memory access costs proportionally to n (the expansion rate). This creates I/O bottlenecks and increased GPU memory requirements. Pipeline parallelism communication also scales with n.
Section 4: Method (The Solution)
Core idea: Constrain \(H^{\text{res}}\) to the Birkhoff polytope—the set of all doubly stochastic matrices (non-negative entries, rows sum to 1, columns sum to 1).
Why doubly stochastic matrices work:
- Norm preservation: Spectral norm is bounded by 1. The matrix cannot amplify signals.
- Compositional closure: The product of doubly stochastic matrices is also doubly stochastic. Stability is preserved at any depth.
- Geometric interpretation: Doubly stochastic matrices are convex combinations of permutation matrices. The learned mixing is a “soft permutation” of features.
Implementation via Sinkhorn-Knopp: The algorithm iteratively normalizes rows and columns until both sum to 1. Starting from any positive matrix (achieved by exponentiating the learned values), alternate between dividing each row by its sum and dividing each column by its sum. After approximately 20 iterations, the matrix is effectively doubly stochastic.
Additional constraints: \(H^{\text{pre}}\) and \(H^{\text{post}}\) are constrained to non-negative values via sigmoid activation, preventing signal cancellation from mixing positive and negative coefficients.
Infrastructure optimizations:
- Kernel fusion: Combine multiple operations into single GPU kernels to reduce memory bandwidth bottlenecks. Reduces I/O from (5n+1)C to (n+1)C reads.
- Recomputation: Discard intermediate activations and recompute them during the backward pass to reduce memory footprint. Optimal recomputation block size is derived mathematically.
- DualPipe modifications: Extend the pipeline parallelism schedule to overlap mHC computations with inter-GPU communication, hiding the increased communication latency.
Section 5: Experiments
Setup: Models inspired by DeepSeek-V3 architecture with Mixture-of-Experts. Three sizes for compute scaling (3B, 9B, 27B parameters) plus a 3B model trained on 1 trillion tokens for token scaling analysis. Expansion rate n=4 for both HC and mHC.
Main results (27B model):
- mHC eliminates the loss spike observed in HC
- Gradient norms remain stable throughout training
- Final loss reduction of 0.021 compared to baseline
Downstream benchmarks: mHC outperforms baseline on all 8 benchmarks and outperforms HC on most. Notable improvements on reasoning tasks: +2.1% on BBH, +2.3% on DROP compared to HC.
Scaling experiments: The performance advantage of mHC over baseline is maintained across compute budgets (3B to 27B) and training durations (token scaling curve shows consistent improvement throughout training).
Stability analysis:
- Single-layer gain magnitude stays near 1.0 for mHC (compared to 1-20+ for HC)
- Composite gain over 60 layers stays below 1.6 for mHC (compared to ~3000 for HC)
- Visualizations of learned matrices show mHC produces bounded, well-behaved mixing patterns while HC produces extreme values
System performance: With all optimizations, mHC adds only 6.7% training time overhead at n=4.
Section 6: Conclusion and Outlook
Summary: mHC successfully restores the identity mapping property to expanded-stream architectures by constraining mixing matrices to the Birkhoff polytope, enabling stable training at scale with minimal overhead.
Future directions:
- Explore alternative manifold constraints beyond doubly stochastic matrices
- Investigate different tradeoffs between plasticity (expressivity) and stability
- Apply geometric constraint principles to other architectural components
- Deeper understanding of how topological structure affects optimization and representation learning
Foundational Concepts Explained
Residual Connections
Residual connections are the standard way neural networks enable deep training. The equation:
\[x_{l+1} = x_l + F(x_l)\]means each layer’s output is the input plus a learned transformation. The direct pass-through of \(x_l\) (the identity mapping) ensures that gradients can flow backward through the network without vanishing, and signals can flow forward without degradation.
Visual representation:
x_l (input)
│
┌────┴────┐
│ │
│ ┌────▼────┐
│ │ F │ ← Layer function (attention or FFN)
│ └────┬────┘
│ │
│ ┌────▼────┐
└───►│ + │ ← Add input directly (identity mapping)
└────┬────┘
│
x_{l+1} (output)
Why this works across many layers:
\[x_L = x_l + \sum_{i=l}^{L-1} F(x_i)\]Signal from layer \(l\) maps directly to layer \(L\).
The identity path preserves information regardless of depth.
Hyper-Connections Architecture
HC extends residual connections by expanding from one stream to n parallel streams.
Visual representation:
Input: x ∈ ℝ^{n×C} (n parallel streams, each C-dimensional)
x (n streams)
│
▼
┌────────┐
│ H^pre │ ← Aggregate: n streams → 1 input (1×n matrix)
└────────┘
│
▼
┌────────┐
│ F │ ← Layer function (same as before)
└────────┘
│
▼
┌────────┐
│ H^post │ ← Distribute: 1 output → n streams (1×n matrix)
└────────┘
│
▼
┌────────┐
│ H^res │ ← Mix: n streams → n streams (n×n matrix) ⚠️ INSTABILITY SOURCE
└────────┘
│
▼
Output (n streams)
The equation:
\[x_{l+1} = H^{\text{res}} \cdot x_l + H^{\text{post}} \cdot \mathcal{F}(H^{\text{pre}} \cdot x_l)\]The first term is the residual path; the second is the transform path.
The problem—unconstrained \(H^{\text{res}}\):
Example unconstrained matrix:
\[H^{\text{res}} = \begin{bmatrix} 2.1 & -0.5 & 0.3 & -1.2 \\ -0.8 & 1.9 & 0.7 & -0.4 \\ 1.5 & -0.3 & 2.5 & 0.1 \\ -0.6 & 0.9 & -0.2 & 1.8 \end{bmatrix}\]Row sums: 0.7, 1.4, 3.8, 1.9 ← NOT 1 Column sums: 2.2, 2.0, 3.3, 0.3 ← NOT 1 Spectral norm: ~2.8 ← GREATER THAN 1 (amplifies signals)
Signal Amplification Across Layers
When matrices with spectral norm > 1 multiply across layers, amplification compounds exponentially.
Layer-by-layer example:
- Layer 1: \(\lVert H^{\text{res}}_1 \cdot x \rVert \approx 2.8 \times \lVert x \rVert\) — “After one layer, signal is 2.8× louder”
- Layer 2: \(\lVert H^{\text{res}}_2 \cdot H^{\text{res}}_1 \cdot x \rVert \approx 7.8 \times \lVert x \rVert\) — “After two layers, signal is 7.8× louder”
- Layer 10: \(2.8^{10} \approx 30{,}000 \times \lVert x \rVert\)
- Layer 30: \(2.8^{30} \approx 2{,}000{,}000{,}000 \times \lVert x \rVert\) → OVERFLOW
Example calculation:
Input vector: \(x = [1.0, 1.0, 1.0, 1.0]\)
After multiplication by unconstrained \(H^{\text{res}}\): \(Hx = [0.7, 1.4, 3.8, 1.9]\)
- Input norm: \(\lVert x \rVert = \sqrt{1^2 + 1^2 + 1^2 + 1^2} = 2.0\)
- Output norm: \(\lVert Hx \rVert = \sqrt{0.7^2 + 1.4^2 + 3.8^2 + 1.9^2} = 4.5\)
Gain: 4.5 / 2.0 = 2.25× amplification in ONE layer
Doubly Stochastic Matrices
A doubly stochastic matrix has:
- All entries \(\geq 0\)
- Every row sums to exactly 1
- Every column sums to exactly 1
Example:
col₁ col₂ col₃ col₄
row₁ [0.40 0.20 0.25 0.15] → 1.0 ✓
row₂ [0.15 0.45 0.25 0.15] → 1.0 ✓
row₃ [0.25 0.15 0.40 0.20] → 1.0 ✓
row₄ [0.20 0.20 0.10 0.50] → 1.0 ✓
↓ ↓ ↓ ↓
1.0 1.0 1.0 1.0 ✓
Why this prevents explosion:
With doubly stochastic \(H^{\text{res}}\) and uniform input \(x = [1.0, 1.0, 1.0, 1.0]\):
\[Hx = \begin{bmatrix} 0.40 \times 1 + 0.20 \times 1 + 0.25 \times 1 + 0.15 \times 1 \\ 0.15 \times 1 + 0.45 \times 1 + 0.25 \times 1 + 0.15 \times 1 \\ 0.25 \times 1 + 0.15 \times 1 + 0.40 \times 1 + 0.20 \times 1 \\ 0.20 \times 1 + 0.20 \times 1 + 0.10 \times 1 + 0.50 \times 1 \end{bmatrix} = \begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix}\]Input norm = Output norm = 2.0 Gain = 1.0× (no amplification)
With varied inputs:
\[x = [2.0, 0.5, 1.0, 0.5]\]After \(P_M(H^{\text{res}})\):
\[\begin{bmatrix} 0.40 & 0.20 & 0.25 & 0.15 \\ 0.15 & 0.45 & 0.25 & 0.15 \\ 0.25 & 0.15 & 0.40 & 0.20 \\ 0.20 & 0.20 & 0.10 & 0.50 \end{bmatrix} \times \begin{bmatrix} 2.0 \\ 0.5 \\ 1.0 \\ 0.5 \end{bmatrix} = \begin{bmatrix} 1.225 \\ 0.850 \\ 1.075 \\ 0.850 \end{bmatrix}\]- Input norm: \(\lVert x \rVert = \sqrt{4 + 0.25 + 1 + 0.25} = 2.35\)
- Output norm: \(\lVert Hx \rVert = \sqrt{1.5 + 0.72 + 1.16 + 0.72} = 2.02\)
Gain: 2.02 / 2.35 = 0.86× (slight contraction, never an expansion)
The Birkhoff Polytope
The Birkhoff polytope is the set of ALL doubly stochastic matrices of a given size.
Key theorem (Birkhoff-von Neumann):
Every doubly stochastic matrix is a convex combination of permutation matrices.
Geometric visualization (2×2 case):
A \(2 \times 2\) doubly stochastic matrix has the form:
\[\begin{bmatrix} p & 1-p \\ 1-p & p \end{bmatrix}\]where \(0 \leq p \leq 1\)
This is a line segment between two permutation matrices:
p=1: Identity p=0.5: Average p=0: Swap
[1 0] [0.5 0.5] [0 1]
[0 1] [0.5 0.5] [1 0]
•──────────────────────•──────────────────•
Vertex Interior Vertex
(permutation) (doubly stochastic) (permutation)
Properties that guarantee stability:
| Property | Mathematical Statement | Implication |
|---|---|---|
| Norm bounded | \(\lVert H \rVert \leq 1\) | Cannot amplify signals |
| Closure | \(H_1 \times H_2 \in\) Birkhoff | Product of doubly stochastic is doubly stochastic |
| Convex | \(\alpha H_1 + (1-\alpha)H_2 \in\) Birkhoff | Any blend of valid matrices is valid |
Sinkhorn-Knopp Algorithm
The algorithm projects any matrix onto the Birkhoff polytope.
Step-by-step example:
Step 0: Start with learned matrix (exponentiated to make positive)
M = [2.7 0.8 1.2 0.3] Row sums: 5.0, 4.1, 5.8, 3.1
[0.5 2.3 1.0 0.3]
[1.8 0.6 3.2 0.2]
[0.4 1.1 0.4 2.2]
Step 1: Normalize rows (divide each row by its sum)
M = [0.54 0.16 0.24 0.06] Row sums: 1.0, 1.0, 1.0, 1.0 ✓
[0.12 0.56 0.24 0.07] Col sums: 1.38, 1.25, 1.29, 0.68 ✗
[0.31 0.10 0.55 0.03]
[0.10 0.27 0.10 0.54]
Step 2: Normalize columns (divide each column by its sum)
M = [0.39 0.13 0.19 0.09] Row sums: 0.80, 0.79, 0.72, 0.73 ✗
[0.09 0.45 0.19 0.10] Col sums: 1.0, 1.0, 1.0, 1.0 ✓
[0.22 0.08 0.43 0.04]
[0.07 0.22 0.08 0.79]
Steps 3-20: Continue alternating… Both rows AND columns sum to 1. ✓ Done!
Architecture Comparison Summary
┌─────────────────────────────────────────────────────────────────────────┐
│ STANDARD RESIDUAL │
├─────────────────────────────────────────────────────────────────────────┤
│ x_{l+1} = x_l + F(x_l) │
│ │
│ • Width: C (single stream) │
│ • Skip connection: Identity (always stable) │
│ • Gain per layer: Exactly 1.0 │
│ • Status: ✓ Stable, limited expressivity │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ HYPER-CONNECTIONS (HC) │
├─────────────────────────────────────────────────────────────────────────┤
│ x_{l+1} = H^res × x_l + H^post × F(H^pre × x_l) │
│ │
│ • Width: n×C (n parallel streams) │
│ • Skip connection: Unconstrained learnable H^res │
│ • Gain per layer: 1.0 - 3.0+ (unbounded) │
│ • Composite gain (30 layers): ~3000× │
│ • Status: ⚠️ Unstable at scale, training crashes │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ MANIFOLD-CONSTRAINED HYPER-CONNECTIONS (mHC) │
├─────────────────────────────────────────────────────────────────────────┤
│ x_{l+1} = P_M(H^res) × x_l + H^post × F(H^pre × x_l) │
│ ───────── │
│ Projected onto Birkhoff polytope │
│ │
│ • Width: n×C (n parallel streams) │
│ • Skip connection: Constrained to doubly stochastic │
│ • Gain per layer: ≤ 1.0 (bounded by construction) │
│ • Composite gain (30 layers): ~1.0 - 1.6 │
│ • Status: ✓ Stable + Expressive │
└─────────────────────────────────────────────────────────────────────────┘
Signal Propagation Comparison
HC (Unconstrained):
Layer 1 Layer 2 Layer 3 Layer 30
│ │ │ │
▼ ▼ ▼ ▼
H^res ──▶ H^res ──▶ H^res ──▶ ... ──▶ H^res
│ │ │ │
▼ ▼ ▼ ▼
‖x‖=1 ‖x‖=2.8 ‖x‖=7.8 ... ‖x‖≈3000 💥 EXPLOSION
mHC (Constrained):
Layer 1 Layer 2 Layer 3 Layer 30
│ │ │ │
▼ ▼ ▼ ▼
P_M(H^res)─▶P_M(H^res)─▶P_M(H^res)──▶...──▶P_M(H^res)
│ │ │ │
▼ ▼ ▼ ▼
‖x‖=1 ‖x‖≤1 ‖x‖≤1 ... ‖x‖≤1.6 ✓ STABLE
Discussion
The Instability Problem in Practice
When signal explosion occurs in an LLM, the practical effects are:
Training crash (most common):
Step 5000: loss = 2.18, grad_norm = 4.1
Step 6000: loss = 2.02, grad_norm = 12.7 ← diverging
Step 7000: loss = 1.95, grad_norm = 89.3 ← unstable
Step 8000: loss = 4.72, grad_norm = 1247.5 ← exploding
Step 9000: loss = NaN, grad_norm = inf ← crashed
Inference garbage:
User: "Explain quantum computing"
Normal output:
"Quantum computing uses quantum mechanical phenomena..."
Exploded output:
"Quantum quantum quantum ████ NULL NULL 9999999..."
Numerical overflow:
float16 range: ±65,504
Layer 25 values: [45000, -52000, 61000] ← near limit
Layer 26: 45000 × 2.8 = 126,000 > 65,504 → inf → NaN
The Question of Excessive Contraction
mHC guarantees gain ≤ 1, preventing explosion. But what about excessive contraction (gain « 1)?
Empirical observation: Trained networks maintain gain near 1.0 (Figure 7 in paper).
Theoretical gap: No proof this must occur. The constraint is a ceiling, not a floor.
Plausible explanation: Severe contraction hurts loss (early layer information becomes inaccessible), so gradient descent avoids it.
The compound effect if contraction occurred:
| Contraction/layer | After 60 layers | Early contribution |
|---|---|---|
| 0.99 (1% loss) | 0.55 | 55% preserved |
| 0.98 (2% loss) | 0.30 | 30% preserved |
| 0.95 (5% loss) | 0.05 | 5% preserved |
Conclusion: This remains an empirical observation, not a guaranteed property.
History and Adoption
| Date | Event |
|---|---|
| 2015 | ResNet introduces residual connections |
| 2017 | Transformer uses residual connections throughout |
| Sept 2024 | HC proposed (Zhu et al.)—15 months ago |
| Dec 2025 | mHC proposed (this paper)—fixes HC’s instability |
Production status: No major foundational models use HC or mHC. The techniques are too new and, in HC’s case, too unstable for expensive frontier training.
Limitations of mHC
| Aspect | Overhead |
|---|---|
| Activation memory | 4× (at n=4) |
| Communication | 4× (at n=4) |
| Training time | +6.7% after optimization, likely drastically longer without |
| Engineering | Custom kernels, modified parallelism |
The Depth Scaling Hypothesis
The question: If mHC introduces mild contraction, does this mean diminishing returns as depth increases?
The math:
If each layer contracts by 0.98×:
- 60 layers: \(0.98^{60} = 30\%\) of early information preserved
- 200 layers: \(0.98^{200} = 2\%\) preserved
- 500 layers: effectively 0%
Implication: An optimal depth likely exists where cost/benefit is maximized. Adding layers beyond this yields diminishing returns.
Current state: The paper tests 30 layers. Whether deeper models encounter this ceiling is an open question.
Future Speculations and Implications
Broader Applications to Macro-Architecture
mHC demonstrates a powerful principle: constraining learnable parameters to specific geometric manifolds can restore desirable properties without sacrificing expressivity. This principle extends far beyond HC.
Potential applications:
| Component | Current Issue | Potential Manifold Constraint |
|---|---|---|
| Attention weights | Can become degenerate/uniform | Constrain to specific entropy range |
| MoE routing | Load imbalance | Doubly stochastic routing matrices |
| Layer outputs | Representation collapse | Orthogonal constraints |
| Cross-attention | Domain mismatch | Permutation-equivariant maps |
| Adapter modules | Catastrophic forgetting | Tangent space of pretrained loss |
The research program this suggests: Systematic characterization of manifolds by their learning dynamics, analogous to how activation functions are characterized by their gradient properties.
Architectures that might benefit:
- MUDDFormer, RMT, DeepCrossAttention: All suffer similar instability to HC; mHC’s constraint could stabilize them
- Mixture-of-Experts routing: Doubly stochastic constraints could enforce balanced load
- Multi-modal fusion: Constrained mixing between modalities could prevent one dominating
- State-space models (Mamba): Recurrent dynamics could benefit from norm-bounded transitions
Implications for Foundational Model Development
Training stability becomes architectural:
Previously, training stability was achieved through:
- Learning rate schedules
- Gradient clipping
- Careful initialization
- Loss scaling
mHC shows stability can be built into the architecture itself. This shifts the paradigm from “fix instability during training” to “design architectures that cannot be unstable.”
New scaling dimension:
Traditional scaling laws optimize:
- Parameters (model size)
- Training tokens (data)
- Compute (FLOPs)
mHC introduces a fourth dimension:
- Residual stream width (n)—capacity without FLOPs
This decouples representation capacity from computational cost, potentially enabling more efficient scaling.
Cost-Effectiveness Analysis
Direct costs:
| Factor | Impact |
|---|---|
| Training time | +6.7% overhead |
| Memory | +4× activation storage |
| Communication | +4× pipeline bandwidth |
| Engineering | Significant (custom kernels) |
Indirect benefits:
| Factor | Impact |
|---|---|
| No training crashes | Saves potentially millions in wasted compute |
| Stable gradients | Enables higher learning rates, faster convergence |
| Richer representations | Better performance per FLOP |
Net assessment:
For very large training runs (>$10M), the 6.7% overhead is likely justified by:
- Elimination of catastrophic failure risk
- Potential for faster convergence
- Better final performance
For smaller runs, the engineering complexity may not be worth it.
Depth vs Width Tradeoffs
mHC changes the calculus:
| Traditional View | mHC View |
|---|---|
| Deeper = more abstraction | Wider residual = more capacity without depth |
| Depth limited by gradient flow | Depth limited by contraction accumulation |
| Width limited by FLOPs | Width (n) independent of FLOPs |
Optimal architecture predictions:
For smaller models (< 10B parameters):
- mHC overhead may not be justified
- Standard residual connections sufficient
- Depth is cheap; go deeper rather than wider residual
For larger models (> 50B parameters):
- mHC overhead becomes negligible relative to total cost
- Wider residual (larger n) provides “free” capacity
- May enable shallower networks with equivalent performance
For very deep models (100+ layers):
- Contraction accumulation may become limiting factor
- Optimal n may vary by depth
- Trade-off between n and layer count needs exploration
Compute Efficiency Frontier
Does mHC break the compute barrier?
The compute efficiency frontier is the Pareto-optimal curve of performance vs. compute cost. To “break” it means achieving better performance for the same compute (or same performance for less compute).
mHC’s position:
Performance
│
│ ╭─── New frontier (with mHC)?
│ ╭────╯
│ ╭────╯──── Current frontier
│ ╭────╯
│╭────╯
└─────────────────────────── Compute
Assessment:
mHC likely shifts the frontier rather than fundamentally breaking it:
- Positive: More representational capacity per FLOP (via expanded residual without increased layer FLOPs)
- Negative: Overhead costs (6.7% time, 4× memory) partially offset gains
- Uncertain: Whether the capacity translates to proportional performance gains
The honest answer: mHC probably provides a modest efficiency improvement (maybe 5-15% better performance at equivalent compute), not a paradigm shift. The real value is stability at scale, not raw efficiency.
Implications for AGI/Superintelligence Development
Fundamental bottlenecks in current architectures:
| Bottleneck | Description | Does mHC help? |
|---|---|---|
| Context length | Limited working memory | No direct impact |
| Reasoning depth | Shallow inference chains | Potentially (richer representations) |
| Knowledge integration | Difficulty combining learned facts | Potentially (better cross-layer communication) |
| Generalization | Brittleness to distribution shift | Unknown |
| Sample efficiency | Requires massive data | Unknown |
| Alignment | Difficulty specifying values | No direct impact |
What mHC addresses:
mHC solves a scaling enabler problem, not a capability problem. It allows larger, more complex architectures to train stably. This is necessary but not sufficient for AGI.
The capability implications:
If AGI requires:
- Very deep reasoning (100+ step chains): mHC’s stability helps, but contraction may limit depth
- Rich multi-scale representations: mHC’s wider residual stream is beneficial
- Efficient information routing: Doubly stochastic mixing is a form of soft routing
What mHC does NOT solve:
- The data bottleneck: Still requires massive training data
- The alignment problem: Stable training ≠ aligned behavior
- The reasoning ceiling: Transformer-style pattern matching may have fundamental limits—it’s still transformers after all
- The embodiment gap: Digital intelligence limitations are a given
Additional Research
- Need to validate mHC at larger 100B+ scale
- Explore alternative manifold constraints
- Integrate with other architectural innovations (SSMs, mixture of experts)
- Develop theoretical understanding of why networks avoid contraction
Conclusion: What mHC Means for AI Development
| Aspect | Assessment |
|---|---|
| Immediate impact | Enables stable training of topologically complex architectures |
| Scaling | Opens new dimension (residual width) orthogonal to traditional scaling |
| Cost-effectiveness | Modest improvement; main value is stability not efficiency |
| Model size | Most valuable for large models where crash risk is costly |
| Depth implications | May enable shallower networks; extreme depth may face contraction limits |
| AGI relevance | Enabler technology, not capability breakthrough |
| Broader principle | Geometric constraints as architectural design tool |
The bottom line:
mHC is an important engineering contribution that makes a class of architectures practical at scale. It represents a maturing understanding that stability can be designed into architectures, not just trained around. For the path to AGI, it removes one obstacle (training instability at scale) while leaving the fundamental capability questions unanswered. Its greatest legacy may be the principle it demonstrates: that the right geometric constraints can resolve seemingly fundamental tradeoffs in deep learning.
Implementation Practicality and DeepSeek’s Engineering Context
Can Non-Specialist Researchers Implement mHC?
Short answer: The core algorithm is accessible; the infrastructure is not.
The mHC paper presents a technique with two distinct layers of complexity:
Layer 1: The Mathematical Algorithm (Accessible)
The core idea—projecting matrices onto the Birkhoff polytope via Sinkhorn-Knopp—is straightforward to implement in standard deep learning frameworks:
# Simplified Sinkhorn-Knopp projection (conceptual)
def sinkhorn_projection(H, iterations=20):
M = torch.exp(H) # Make positive
for _ in range(iterations):
M = M / M.sum(dim=1, keepdim=True) # Normalize rows
M = M / M.sum(dim=0, keepdim=True) # Normalize columns
return M
A graduate student with PyTorch experience could implement a naive version of mHC in a few hundred lines of code. The mathematical concepts (doubly stochastic matrices, iterative projection) are well-documented in linear algebra literature and do not require specialized knowledge beyond standard ML training.
Layer 2: The Infrastructure (Specialist-Only)
However, making mHC practical at scale requires engineering that is far beyond typical research capabilities:
| Component | Requirement | Accessibility |
|---|---|---|
| Sinkhorn-Knopp forward pass | Standard PyTorch | ✓ Accessible |
| Sinkhorn-Knopp backward pass | Custom autograd | Moderate difficulty |
| Kernel fusion (TileLang) | Custom CUDA kernels | Specialist only |
| Memory-efficient recomputation | Custom training loop | Moderate difficulty |
| DualPipe integration | Distributed systems expertise | Specialist only |
| FP8 mixed precision compatibility | Hardware-specific optimization | Specialist only |
Practical assessment by researcher type:
| Researcher Profile | Can Implement? | Limitations |
|---|---|---|
| PhD student (ML theory) | Proof-of-concept only | Will hit memory/speed walls at scale |
| PhD student (systems) | Partial implementation | May lack distributed training expertise |
| Industry ML engineer | Likely yes, with effort | Needs significant engineering time |
| Frontier lab team | Full implementation | Has necessary infrastructure |
| Independent researcher | Unlikely at scale | Lacks compute and systems expertise |
The reality:
A researcher could implement mHC for small-scale experiments (models up to ~1B parameters on a single GPU) using standard frameworks. However, reproducing the paper’s 27B-scale results requires:
- Access to large GPU clusters (hundreds to thousands of GPUs)
- Custom CUDA/PTX kernel development capabilities
- Expertise in distributed training systems (pipeline parallelism, expert parallelism)
- Months of engineering effort to optimize throughput
The 6.7% overhead figure cited in the paper is achievable only after extensive optimization. A naive implementation might see 50-100% overhead, making it impractical for resource-constrained researchers.
DeepSeek’s Low-Level Hardware Innovations
The mHC paper emerges from DeepSeek’s broader program of hardware-software co-design, which has pushed the boundaries of what’s possible with constrained hardware. Understanding this context is essential for appreciating both the paper’s contributions and its reproducibility challenges.
The Hardware Constraint Context
DeepSeek trains on NVIDIA H800 GPUs—a variant of the H100 designed for the Chinese market with reduced interconnect bandwidth due to U.S. export restrictions. Where the H100 offers 900 GB/s NVLink bandwidth, the H800 provides only 400 GB/s. This constraint forced DeepSeek to innovate at levels most labs never touch.
The Geopolitical Implication
An implication is that the geopolitical constraints forced upon DeepSeek—now seen as one of China’s premier frontier labs—have led to architectural innovations and expertise development in kernel-level optimization that would never have been explored by labs in other countries. This is a true frontier, born of necessity rather than choice.
Meta’s adoption of Qwen from Alibaba’s lab (which, funnily enough, is itself based on Llama) is testament that U.S. export restrictions are functioning as a forcing function to accelerate innovation—the opposite of their intended effect.
Couple this algorithmic and kernel-level innovation with the rate at which Chinese chip designers and manufacturers are brute-forcing improvements where they can, and the gap begins to close despite the hardware deficit. Consider Huawei’s CloudMatrix 384 systems: they don’t compete on individual transistor performance, using 7nm chiplets instead of 3nm. Instead, they compete on system architecture—combining 384 chiplets into a single coherent compute cluster with entirely optical interconnects between every chiplet. The 4× power consumption would be punishing in the U.S., but energy is cheap in China, making this tradeoff financially viable.
Given these trajectories, it’s not hard to see the capability gap narrowing relatively quickly. I would argue that by the time the U.S. administration changes in late 2028, we’ll see performance parity between Chinese and U.S. frontier labs.
PTX-Level Optimization
DeepSeek’s engineers work directly with PTX (Parallel Thread Execution), NVIDIA’s intermediate representation that sits between high-level CUDA C++ and the actual GPU machine code (SASS). This is analogous to writing assembly language instead of C—it offers fine-grained control over:
- Register allocation
- Thread and warp-level scheduling
- Memory access patterns
- Instruction-level parallelism
Abstraction hierarchy:
Python/PyTorch ← Most researchers work here
↓
CUDA C++ ← Some optimization work
↓
PTX ← DeepSeek works here (assembly-like)
↓
SASS ← Actual GPU machine code
As one analysis noted, DeepSeek’s engineers reconfigured H800 GPUs to dedicate 20 of 132 streaming multiprocessors (SMs) specifically for server-to-server communication, optimizing data compression and decompression to overcome bandwidth limitations. This level of hardware reconfiguration is far beyond standard CUDA development.
Custom Kernel Ecosystem
DeepSeek has open-sourced several components of their infrastructure:
| Tool | Purpose | Relevance to mHC |
|---|---|---|
| TileLang | Domain-specific language for readable GPU kernels | Used for mHC kernel fusion |
| DeepGEMM | High-performance FP8 matrix multiplication | Underlying compute primitives |
| FlashMLA | Sparse attention kernels for Multi-head Latent Attention | Attention computation |
| DualPipe | Pipeline parallelism with computation-communication overlap | Training distribution |
The mHC paper specifically mentions using TileLang for kernel fusion, reducing I/O overhead from (5n+1)C to (n+1)C reads. This optimization alone requires expertise that most ML researchers lack.
FP8 Mixed Precision Training
DeepSeek pioneered FP8 (8-bit floating point) training at extreme scale with their V3 model. Key innovations include:
- Fine-grained quantization: Tile-wise 1×128 quantization for activations, block-wise 128×128 for weights
- Strategic precision retention: Embedding, output head, MoE gating, normalization, and attention operators remain in BF16/FP32
- High-precision accumulation: Master weights, gradients, and optimizer states stored in FP32
- CUDA Core promotion: Periodically promoting FP8 computations to CUDA Cores (every 128 elements) for accumulated precision
This framework achieved less than 0.25% relative loss error compared to BF16 baselines—validating FP8 training at 671B parameter scale for the first time.
Stability as a Precursor: From DeepSeek-V3 to mHC
The V3 Stability Achievement
DeepSeek-V3’s technical report makes a remarkable claim:
“Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.”
For a 671B parameter model trained on 14.8 trillion tokens, this is extraordinary. Most frontier training runs experience multiple instability events requiring checkpoint rollbacks, costing days of compute time.
How V3 Achieved Stability
DeepSeek-V3 employed multiple strategies to ensure stable training:
| Strategy | Mechanism | Limitation |
|---|---|---|
| Selective precision retention | Keep critical components (embedding, attention, gating) in BF16/FP32 | Increases memory overhead |
| High-precision master weights | Store weights and optimizer states in FP32 | Memory cost |
| Auxiliary-loss-free load balancing | Bias terms for MoE routing instead of auxiliary losses | Requires careful tuning |
| Gradient clipping | Constrain gradient magnitudes | Can slow convergence |
| Fine-grained quantization | Adapt scaling factors to smaller element groups | Engineering complexity |
The key insight: V3’s stability was achieved through external constraints (precision management, clipping, careful initialization) rather than architectural guarantees.
mHC as Architectural Stability
The mHC paper represents a conceptual advance: instead of constraining outputs through training tricks, build stability into the architecture itself.
| Approach | V3 (External Constraints) | mHC (Architectural) |
|---|---|---|
| Mechanism | Clipping, precision management | Manifold projection |
| Guarantee | Empirical (works in practice) | Mathematical (provable bound) |
| Overhead | Runtime checks, mixed precision | Sinkhorn iterations |
| Failure mode | May need tuning per model | Guaranteed by construction |
The Research Trajectory
The progression from V3 to mHC reflects DeepSeek’s systematic approach to stability:
DeepSeek-V2 (2024)
│
├── Identified stability challenges with MoE at scale
├── Developed auxiliary-loss-free load balancing
│
▼
DeepSeek-V3 (Dec 2024)
│
├── Achieved stable 671B training with no rollbacks
├── Used precision management and careful engineering
├── Identified that HC-style architectures introduce new instability
├── Began experimenting with HC architectures, precursor to this paper
│
▼
mHC (Dec 2025)
│
├── Addressed HC instability through geometric constraints
├── Provides mathematical stability guarantee
└── Enables topologically complex architectures at frontier scale
Implications for the Field
DeepSeek’s work suggests a shift in how stability should be approached:
- Old paradigm: Train the model, add constraints when instability appears, tune hyperparameters
- New paradigm: Design architectures with provable stability properties from the start
This mirrors the historical evolution from ad-hoc regularization to principled techniques like batch normalization and residual connections. mHC represents the next potential step: manifold-constrained macro-architecture.
References
DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv. https://arxiv.org/abs/2412.19437
DeepSeek-AI. (2025). Insights into DeepSeek-V3: Scaling challenges and reflections on hardware for AI architectures. arXiv. https://arxiv.org/abs/2505.09343
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. In European Conference on Computer Vision (pp. 630-645). Springer.
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343-348.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).
Zhao, Y., et al. (2025). DeepGEMM: High-performance FP8 GEMM kernels. DeepSeek-AI.
Zhu, Y., et al. (2024). Hyper-Connections. arXiv. https://arxiv.org/abs/2409.19606