The Mathematics Behind LLM Finetuning Failures: A Deep Technical Analysis

New ICLR 2025 research reveals the hidden dynamics that make your models hallucinate – with the math to prove it

Jul 12, 2025

Welcome Back to Full-Stack Data Science Newsletter. Lot more Exclusive stuff is coming your way so Subscribe, if you haven’t yet.

The Technical Problem Statement

You've implemented DPO, IPO, or other preference optimization methods. Your loss curves look perfect. Your metrics are improving. But your model exhibits concerning behaviors:

Repetitive generation patterns (the "repeater" phenomenon)
Decreased confidence on ALL responses during off-policy DPO
Hallucination amplification after instruction tuning
Performance degradation with extended training

The root cause? A fundamental misunderstanding of learning dynamics in autoregressive models.

The Mathematical Framework

Learning Dynamics Decomposition

The researchers formalize how parameter updates Δθ influence prediction changes Δf(x_o):

Δ log π^t(y | x_o) = -η A^t(x_o) K^t(x_o, x_u) G^t(x_u, y_u) + O(η²)

Where:

A^t(x_o) = I - 1π^⊤(x_o): Adaptation matrix (depends only on current predictions)
K^t(x_o, x_u): Empirical Neural Tangent Kernel (similarity between examples)
G^t(x_u, y_u): Gradient residual (update direction and magnitude)

Extension to Sequential Generation

For autoregressive models, this becomes significantly more complex:

[Δ log π^t(y | χ_o)]_m = -∑_{l=1}^L η[A^t(χ_o)]_m[K^t(χ_o, χ_u)]_{m,l}[G^t(χ_u)]_l + O(η²)

Where χ represents the concatenated prompt-response sequences, and the eNTK now depends on both inputs AND responses.

The Squeezing Effect: A Mathematical Proof

For Standard Classification

When applying negative gradients to unlikely labels in softmax-based models, the researchers prove:

Guarantees:

π_{θ^{t+1}}(y^-) will decrease (intended effect)
Probability mass concentrates on y* = argmax_{i≠y^-} π_θ^t(i)

Mathematical relationship:

π_{θ^{t+1}}(y*) / π_θ^t(y*) > π_{θ^{t+1}}(y_i) / π_θ^t(y_i) for all i ≠ y*, y^-

The Pathological Case

When π_θ^t(y^-) is very small (common in pretrained models), the effect amplifies:

Rich get richer: High-probability tokens increase further
Poor get poorer: Low-probability tokens decrease more severely
Probability mass redistribution: Nearly all mass flows to the current argmax

DPO Learning Dynamics

For Direct Preference Optimization:

G^t_DPO+ = β(1-a)[π_θ^t(y|χ^+) - y^+]
G^t_DPO- = β(1-a)[π_θ^t(y|χ^-) - y^-]

Where a is the margin: a = σ(β log π_θ^t(y^+)/π_ref(y^+) - β log π_θ^t(y^-)/π_ref(y^-))

Key insight: The strength of updates automatically adjusts based on current separation quality.

Why Off-Policy DPO Fails

In off-policy settings:

Both y^+ and y^- are typically low-probability under π_θ^t
Large negative gradients on y^- trigger severe squeezing
Probability mass flows to responses dissimilar to y^+
Model becomes overconfident in unrelated high-probability sequences

On-Policy vs Off-Policy Dynamics

Off-policy:

y^+ and y^- sampled from different distribution
Often in "valley" regions of π_θ^t
Squeezing effect dominates

On-policy:

y^+ and y^- sampled from π_θ^t
Higher initial probability → weaker squeezing
More stable training dynamics

Experimental Validation

Datasets & Models

Datasets: Anthropic-HH, UltraFeedback
Models: Pythia (410M-2.8B), Qwen1.5 (0.5B-1.8B)
Methodology: Track log π_θ^t(y|χ) for various response types

Key Findings

SFT Dynamics:

Direct responses (y^+) increase as expected
Similar responses initially increase, then decrease
Dissimilar responses monotonically decrease
Cross-contamination: responses from other training examples increase

DPO Dynamics:

Both y^+ and y^- decrease over time
Rephrases follow similar patterns with scaled magnitude
Greedy decoding confidence increases dramatically
Margin π_θ^t(y^+) - π_θ^t(y^-) improves despite absolute decreases

Hallucination Mechanism

The framework explains specific hallucination patterns:

Cross-question bleeding: Learning [x_u; y^+_u] increases π_θ^t(y^+_j|x_u) for j≠u
Phrase repetition: Squeezing effect concentrates mass on high-frequency patterns
Confidence miscalibration: Models become overconfident in squeezed responses

The Proposed Solution

Data Augmentation Strategy

Counterintuitive approach: Train on both preferred AND rejected responses during SFT:

L_extended = L_SFT(x_u, y^+_u) + L_SFT(x_u, y^-_u)

Why this works:

Increases π_θ^t(y^-) before DPO phase
Reduces squeezing effect magnitude
Preserves probability mass in relevant regions
Enables more effective preference learning

Implementation Details

Phase 1 (Extended SFT):

Train on both y^+ and y^- responses
Increases overall probability mass in relevant regions
Reduces initial likelihood gap

Phase 2 (Standard DPO):

Apply preference optimization as normal
Reduced squeezing due to higher initial π_θ^t(y^-)
More stable training dynamics

Results

Win rates against baseline:

After 2 DPO epochs: 65.18% (ChatGPT eval), 51.51% (Claude eval)
After 4 DPO epochs: 69.28% (ChatGPT eval), 60.45% (Claude eval)

Implications for Practice

For Algorithm Design

Gradient magnitude matters: Large negative gradients on unlikely sequences are dangerous
Initialization is crucial: Pretrained model biases heavily influence dynamics
On-policy methods: Natural mitigation of squeezing effects

For Training Protocols

Monitor all response types: Not just y^+ and y^-
Early stopping: Extended training can amplify pathological behaviors
Data curation: Balance between positive and negative examples

For Model Evaluation

Beyond accuracy: Track confidence calibration and response diversity
Probing datasets: Essential for understanding training dynamics
Greedy decoding analysis: Reveals squeezing effect magnitude

Advanced Technical Insights

eNTK Stability Assumption

The analysis relies on relative stability of K^t(x_o, x_u) during training. While full lazy training isn't required, the researchers verify this empirically across model scales.

Connection to Neural Tangent Theory

The decomposition leverages NTK theory but extends it to:

Sequential generation tasks
Multiple optimization objectives
Practical finite-width networks

Relationship to Other Phenomena

Grokking: Similar sudden transitions in learning dynamics
Double descent: Related to probability mass redistribution
Lottery ticket hypothesis: Connections to gradient flow patterns

Future Research Directions

Theoretical extensions: Formal analysis of multi-step dynamics
Algorithm improvements: Better negative gradient handling
Architectural considerations: How model design affects learning dynamics
Scaling behavior: Dynamics in larger models and datasets

For more information and deep dive into the actual Paper .

We also have reference implementation of this paper here

And Thats All for today ✅

Share with someone who may need it, using the below button

And Subscribe the Full-Stack Data Science Newsletter, if you haven’t yet. To learn the actual AI stuff and keep yourself updated.

Full-Stack Data Science