The Mathematics Behind LLM Finetuning Failures: A Deep Technical Analysis
New ICLR 2025 research reveals the hidden dynamics that make your models hallucinate – with the math to prove it
Welcome Back to Full-Stack Data Science Newsletter. Lot more Exclusive stuff is coming your way so Subscribe, if you haven’t yet.
The Technical Problem Statement
You've implemented DPO, IPO, or other preference optimization methods. Your loss curves look perfect. Your metrics are improving. But your model exhibits concerning behaviors:
Repetitive generation patterns (the "repeater" phenomenon)
Decreased confidence on ALL responses during off-policy DPO
Hallucination amplification after instruction tuning
Performance degradation with extended training
The root cause? A fundamental misunderstanding of learning dynamics in autoregressive models.
The Mathematical Framework
Learning Dynamics Decomposition
The researchers formalize how parameter updates Δθ
influence prediction changes Δf(x_o)
:
Δ log π^t(y | x_o) = -η A^t(x_o) K^t(x_o, x_u) G^t(x_u, y_u) + O(η²)
Where:
A^t(x_o) = I - 1π^⊤(x_o): Adaptation matrix (depends only on current predictions)
K^t(x_o, x_u): Empirical Neural Tangent Kernel (similarity between examples)
G^t(x_u, y_u): Gradient residual (update direction and magnitude)
Extension to Sequential Generation
For autoregressive models, this becomes significantly more complex:
[Δ log π^t(y | χ_o)]_m = -∑_{l=1}^L η[A^t(χ_o)]_m[K^t(χ_o, χ_u)]_{m,l}[G^t(χ_u)]_l + O(η²)
Where χ
represents the concatenated prompt-response sequences, and the eNTK now depends on both inputs AND responses.
The Squeezing Effect: A Mathematical Proof
For Standard Classification
When applying negative gradients to unlikely labels in softmax-based models, the researchers prove:
Guarantees:
π_{θ^{t+1}}(y^-) will decrease (intended effect)
Probability mass concentrates on y* = argmax_{i≠y^-} π_θ^t(i)
Mathematical relationship:
π_{θ^{t+1}}(y*) / π_θ^t(y*) > π_{θ^{t+1}}(y_i) / π_θ^t(y_i) for all i ≠ y*, y^-
The Pathological Case
When π_θ^t(y^-) is very small (common in pretrained models), the effect amplifies:
Rich get richer: High-probability tokens increase further
Poor get poorer: Low-probability tokens decrease more severely
Probability mass redistribution: Nearly all mass flows to the current argmax
Algorithm-Specific Analysis
DPO Learning Dynamics
For Direct Preference Optimization:
G^t_DPO+ = β(1-a)[π_θ^t(y|χ^+) - y^+]
G^t_DPO- = β(1-a)[π_θ^t(y|χ^-) - y^-]
Where a
is the margin: a = σ(β log π_θ^t(y^+)/π_ref(y^+) - β log π_θ^t(y^-)/π_ref(y^-))
Key insight: The strength of updates automatically adjusts based on current separation quality.
Why Off-Policy DPO Fails
In off-policy settings:
Both y^+ and y^- are typically low-probability under π_θ^t
Large negative gradients on y^- trigger severe squeezing
Probability mass flows to responses dissimilar to y^+
Model becomes overconfident in unrelated high-probability sequences
On-Policy vs Off-Policy Dynamics
Off-policy:
y^+ and y^- sampled from different distribution
Often in "valley" regions of π_θ^t
Squeezing effect dominates
On-policy:
y^+ and y^- sampled from π_θ^t
Higher initial probability → weaker squeezing
More stable training dynamics
Experimental Validation
Datasets & Models
Datasets: Anthropic-HH, UltraFeedback
Models: Pythia (410M-2.8B), Qwen1.5 (0.5B-1.8B)
Methodology: Track log π_θ^t(y|χ) for various response types
Key Findings
SFT Dynamics:
Direct responses (y^+) increase as expected
Similar responses initially increase, then decrease
Dissimilar responses monotonically decrease
Cross-contamination: responses from other training examples increase
DPO Dynamics:
Both y^+ and y^- decrease over time
Rephrases follow similar patterns with scaled magnitude
Greedy decoding confidence increases dramatically
Margin π_θ^t(y^+) - π_θ^t(y^-) improves despite absolute decreases
Hallucination Mechanism
The framework explains specific hallucination patterns:
Cross-question bleeding: Learning [x_u; y^+_u] increases π_θ^t(y^+_j|x_u) for j≠u
Phrase repetition: Squeezing effect concentrates mass on high-frequency patterns
Confidence miscalibration: Models become overconfident in squeezed responses
The Proposed Solution
Data Augmentation Strategy
Counterintuitive approach: Train on both preferred AND rejected responses during SFT:
L_extended = L_SFT(x_u, y^+_u) + L_SFT(x_u, y^-_u)
Why this works:
Increases π_θ^t(y^-) before DPO phase
Reduces squeezing effect magnitude
Preserves probability mass in relevant regions
Enables more effective preference learning
Implementation Details
Phase 1 (Extended SFT):
Train on both y^+ and y^- responses
Increases overall probability mass in relevant regions
Reduces initial likelihood gap
Phase 2 (Standard DPO):
Apply preference optimization as normal
Reduced squeezing due to higher initial π_θ^t(y^-)
More stable training dynamics
Results
Win rates against baseline:
After 2 DPO epochs: 65.18% (ChatGPT eval), 51.51% (Claude eval)
After 4 DPO epochs: 69.28% (ChatGPT eval), 60.45% (Claude eval)
Implications for Practice
For Algorithm Design
Gradient magnitude matters: Large negative gradients on unlikely sequences are dangerous
Initialization is crucial: Pretrained model biases heavily influence dynamics
On-policy methods: Natural mitigation of squeezing effects
For Training Protocols
Monitor all response types: Not just y^+ and y^-
Early stopping: Extended training can amplify pathological behaviors
Data curation: Balance between positive and negative examples
For Model Evaluation
Beyond accuracy: Track confidence calibration and response diversity
Probing datasets: Essential for understanding training dynamics
Greedy decoding analysis: Reveals squeezing effect magnitude
Advanced Technical Insights
eNTK Stability Assumption
The analysis relies on relative stability of K^t(x_o, x_u) during training. While full lazy training isn't required, the researchers verify this empirically across model scales.
Connection to Neural Tangent Theory
The decomposition leverages NTK theory but extends it to:
Sequential generation tasks
Multiple optimization objectives
Practical finite-width networks
Relationship to Other Phenomena
Grokking: Similar sudden transitions in learning dynamics
Double descent: Related to probability mass redistribution
Lottery ticket hypothesis: Connections to gradient flow patterns
Future Research Directions
Theoretical extensions: Formal analysis of multi-step dynamics
Algorithm improvements: Better negative gradient handling
Architectural considerations: How model design affects learning dynamics
Scaling behavior: Dynamics in larger models and datasets
For more information and deep dive into the actual Paper .
We also have reference implementation of this paper here
And Thats All for today ✅
Share with someone who may need it, using the below button
And Subscribe the Full-Stack Data Science Newsletter, if you haven’t yet. To learn the actual AI stuff and keep yourself updated.