By Gopi Krishna Tummala
Post-Training and Parameter-Efficient Fine-Tuning
Post-training transforms pre-trained models into production-ready systems. Full fine-tuning is computationally expensive and prone to catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods address these limitations by adapting models with minimal parameter updates.
The PEFT Ecosystem: The Adapter Zoo
LoRA (Low-Rank Adaptation) Explained in Detail
LoRA is one of the most important practical inventions in the era of Large Language Models (2021–2026). It allows you to customize and fine-tune gigantic models—like LLaMA-3 (70B+), Mistral, or Stable Diffusion—without touching almost any of the original weights. It achieves this using a fraction of the GPU memory and storage traditionally required.
The Core Problem It Solves
When you fine-tune a standard 70-billion-parameter model the classic way (Full Fine-Tuning):
- Massive Hardware Requirements: You must store gradients, optimizer states (like Adam, which requires 2–3× more memory), and updated weights. This often demands >500 GB of VRAM just to train.
- Storage Bloat: After training, you generate one massive new checkpoint per task (70 GB+ each).
- Catastrophic Forgetting: The model risks losing its foundational general knowledge while over-indexing on your new, narrow task.
LoRA elegantly bypasses almost all of these bottlenecks.
The Big Intuition (The Key Insight)
The authors (Edward Hu et al., Microsoft, 2021) observed something profound during full fine-tuning experiments:
The change in weights (ΔW) when you fine-tune on a new task has a very “low intrinsic rank.”
In plain English: The adjustment the model needs to learn a new task is not scattered randomly across every single neural connection. The necessary changes live in a much lower-dimensional subspace.
Instead of learning a massive ΔW matrix (the exact same size as the original weight matrix), you can approximate that change using the product of two much smaller matrices.
The Mathematics of LoRA
Take any linear layer in a Transformer (such as an attention projection or feed-forward layer).
In the original forward pass, the output h is computed as:
h = W₀ x
(where W₀ is the original, frozen weight matrix of dimension d × k—often massive, e.g., 4096 × 4096).
In full fine-tuning, the model updates the weights by adding a massive delta matrix:
h = (W₀ + ΔW) x
(where ΔW is also d × k, resulting in millions to billions of trainable parameters per layer).
LoRA instead computes:
h = W₀ x + (α/r) · (B A) x
Breaking down the variables:
- A ∈ ℝ^{d × r}: Usually initialized randomly (Gaussian).
- B ∈ ℝ^{r × k}: Usually initialized to zeros (so at the very start of training, BA = 0, meaning LoRA initially acts as a no-op and doesn’t disrupt the model).
- r (rank): A tiny hyperparameter. Common values are 4, 8, 16, 32, or 64. (Surprisingly, an r=8 is often enough!)
- α (alpha): A scaling factor commonly set to 2×r or fixed at 16–32 to control the magnitude of the adapter’s influence.
The Parameter Savings: Trainable parameters drop from d·k down to just d·r + r·k.
For a typical LLaMA-7B attention layer (d=k≈4096):
- Full fine-tune: ~16.8 million parameters.
- LoRA (r=8): ~65,000 parameters.
This results in ~260× fewer parameters to train. Across the entire model, usually only 0.1% to 1% of the original parameters become trainable.
Training vs. Inference
During Training:
- Freeze all original weights (W₀).
- Train only the A and B matrices.
- The forward pass computes both paths and adds them together.
- GPU memory is dominated merely by holding the frozen model in place, plus the tiny gradients required for A and B.
During Inference (The Superpower):
- You can mathematically merge the trained adapters back into the base model before serving:
W_final = W₀ + (α/r) · (B A)
- Zero Latency Cost: After merging, the model operates with the exact same speed and memory footprint as the base model.
- Alternatively, modern inference engines (vLLM, TGI, llama.cpp) allow you to keep the adapters separate, rapidly hot-swapping different LoRAs for different users on the fly.
Where do we inject LoRA matrices?
While you can apply LoRA anywhere, modern consensus (2023–2026) dictates targeting the attention mechanism:
- Standard approach: Query (
q_proj) and Value (v_proj). - Comprehensive approach: Query, Key, Value, and Output (
q_proj,k_proj,v_proj,o_proj). - Occasionally, practitioners will also target the up/down projections in the MLP layers, though this increases parameter count.
Many libraries (PEFT, Unsloth, Axolotl) default to q_proj + v_proj or q+v+k+o—already excellent.
Why does it work so well? (Deeper Intuition)
-
Low intrinsic dimension of task adaptation
Fine-tuning mostly needs to “reorient” attention patterns in a low-dimensional way—not reinvent every weight. -
Gradient flow stays strong
Because A and B are small, gradients don’t vanish/explode as easily. -
No extra inference cost (after merge)
-
Modular—you can train many LoRAs for different tasks, swap them like plugins, merge only when needed.
DoRA (Weight-Decomposed LoRA)
DoRA decouples the pre-trained weight matrix into two components: magnitude and direction.
The Insight: Full fine-tuning usually exhibits a negative correlation between magnitude and directional updates, whereas standard LoRA exhibits a positive correlation. DoRA forces the model to learn more like full fine-tuning.
Mechanism:
W = m · (W₀/||W₀||c + ΔV)
Where:
mis the trainable magnitude vectorW₀/||W₀||cis the frozen directional weight (normalized)ΔVis the LoRA update (BA)
Why it matters: It achieves superior performance at lower ranks (e.g., r=4–8) across LLMs and VLMs without adding inference latency, as the weights can still be merged.
Performance: Closes 90–95% of the full fine-tuning gap at r=4–8.
QLoRA & QDoRA
QLoRA (Quantized LoRA):
- Quantizes the frozen base model (W₀) to 4-bit precision (NormalFloat4)
- Computes gradients in 16-bit to update the LoRA adapters (A and B)
- Essential for fitting massive models on single GPUs
- Single-GPU 70B fine-tuning becomes feasible
QDoRA: DoRA + 4-bit quantization. Often beats full fine-tuning while using a fraction of the memory.
PiSSA (Principal Singular Values and Singular Vectors Adaptation)
Instead of initializing the LoRA matrices randomly and with zeros, PiSSA initializes them using the principal singular values of the original weight matrix. This allows the adapter to capture the most critical structural information immediately.
Result: 2–3× faster convergence than standard LoRA.
Other Power Tools
| Method | Key Innovation | Best For |
|---|---|---|
| VeRA | Vector-based, ultra-light | Ultra-low parameter budgets |
| AdaLoRA | Dynamic rank allocation per layer | Multi-task scenarios |
| LongLoRA | Shifted sparse attention + LoRA | 100k+ context extension |
| RsLoRA | Rank-stabilized training | Better stability at low ranks |
| LoftQ | Quantization-aware initialization | Better QLoRA performance |
Comparison Table
| Method | Trainable Params | VRAM Needed (70B Model) | Inference Speed | Forgetting Risk | Mergeable? |
|---|---|---|---|---|---|
| Full Fine-Tune | ~70B+ | 8–16× A100s | Base Speed | High | N/A |
| LoRA (r=16) | ~20–80M | 1–2× A100s | Base Speed (Merged) | Low | Yes |
| QLoRA (4-bit) | ~20–80M | 24–40 GB | Base Speed (Merged) | Low | Yes |
| DoRA (2024) | Similar to LoRA | Similar to LoRA | Base Speed (Merged) | Very Low | Yes |
The Alignment Stack: From Messy RL to Clean Math
The industry has decisively moved away from brittle PPO-based RLHF. The new standard is direct optimization on preference data.
DPO (Direct Preference Optimization)
DPO skips the reward model entirely.
How it works:
- Directly optimizes policy on preference data (a “chosen” response y_w and a “rejected” response y_l)
- Mathematically maps the reward function directly to the optimal policy
- Uses a binary cross-entropy loss that increases the likelihood of the chosen response while decreasing the likelihood of the rejected one
- Implicitly constrained by a reference model (π_ref) to prevent diverging too far
The DPO Objective:
L_DPO = -log σ(β log(π_θ(y_w|x) / π_ref(y_w|x)) - β log(π_θ(y_l|x) / π_ref(y_l|x)))
Where:
σis the sigmoid functionβis a temperature parameterπ_refis the reference model (usually the SFT model)
Advantages:
- No separate reward model needed
- More stable than RLHF
- Faster to train
- Better sample efficiency
ORPO (Odds Ratio Preference Optimization)
ORPO merges SFT + alignment into one stage.
Mechanism:
- Adds an odds ratio penalty to the standard negative log-likelihood loss
- Penalizes the generation of rejected responses during the initial instruction tuning phase
- Saves 30–50% compute compared to SFT → DPO pipeline
- Strong safety performance
KTO (Kahneman-Tversky Optimization)
KTO only needs “desirable” vs “undesirable” labels (no pairwise data).
Mechanism: Inspired by prospect theory—penalizes bad outcomes more than it rewards good ones.
Advantages:
- Works with weak supervision (e.g., user thumbs-down logs)
- No need for carefully curated preference pairs
- Effective for noisy or safety data
Newer Contenders (2025–2026)
- SimPO: Even simpler DPO variant with length normalization
- IPO (Identity Preference Optimization): Better handling of noisy preferences
- CPO (Contrastive Preference Optimization): Improved stability
Post-Training for Multimodal & Diffusion Models
PEFT has fully colonized continuous domains. No longer restricted to LLMs.
Vision-Language Models (VLMs)
For models like LLaVA, Qwen-VL, autonomous driving stacks:
- Apply LoRA/DoRA to:
- Vision encoder layers
- Cross-attention layers (vision → language)
- MLP projector layers
Mask-Aware LoRA (2025 breakthrough):
- During fine-tuning, route frozen base weights to background and adapters to masked regions
- Result: pixel-perfect localized edits with perfect temporal consistency
- Critical for autonomous driving edge cases where you need to modify specific objects without affecting the scene
Diffusion Models
For Stable Diffusion 3, Flux, Video models:
Standard approach:
- LoRA on U-Net attention blocks (self-attention and cross-attention)
- Typically r=4–16 depending on task complexity
AudioLDM / MusicGen style:
- LoRA on CLAP-conditioned U-Net for genre-specific audio
- Allows fine-tuning for specific musical styles or audio domains
Temporal Consistency LoRA:
- Injects adapters into temporal layers for video
- Maintains consistency across frames
- Critical for video editing and generation tasks
Emerging techniques:
- Diffusion-DRF (Differentiable Reward Flow): Direct reward-guided video fine-tuning
- MMaDA-style unified diffusion LLMs: Combining diffusion and language models
Mask-Aware LoRA for Video Editing
Standard Image-to-Video (I2V) models struggle with localized edits (e.g., changing the color of a moving car while keeping the background static).
The Technique: Instead of naive per-video fine-tuning, researchers now use Mask-guided LoRA. A spatial mask is applied during the LoRA fine-tuning process. The model learns to route its attention:
- Using the frozen base weights for the background (spatial structure and motion cues)
- Utilizing the LoRA adapters strictly for the masked region (appearance control)
Result: High temporal consistency without bleeding the style transfer into the environment.
Supervised Fine-Tuning (SFT)
Instruction Tuning
- Format data as instruction-response pairs
- Use consistent chat templates (ChatML, Llama-3 format, etc.)
- Include diverse task types in your dataset
Catastrophic Forgetting Mitigation
Fine-tuning on new data can cause the model to forget pre-training knowledge.
Mitigation strategies:
- Replay buffers: Mix a small percentage of pre-training data into fine-tuning batches
- LoRA on critical layers only: Freeze more layers, only adapt attention or specific modules
- Multi-task training: Train on both new and old tasks simultaneously
Chat Formatting
Modern LLMs expect specific chat formats. Common formats:
- ChatML:
<im_start>user\n{prompt}<im_end>\n<im_start>assistant\n - Llama-3:
[INST] {prompt} [/INST] - Vicuna:
USER: {prompt}\nASSISTANT:
Adapter Merging and Inference
Merging Adapters
- vLLM, TGI, and Outlines support merged adapter serving natively
- Merging eliminates inference overhead
- Use
peft.merge_and_unload()or equivalent
Inference Options
- Merged: Zero latency cost, same speed as base model
- Separate: Keep adapters separate for hot-swapping (supported by vLLM, TGI, llama.cpp)
Hyperparameters
For LoRA/DoRA:
- Learning rate: 1e-4 to 5e-4 (10× base model LR)
- Rank: Common values are 4, 8, 16, 32
- Alpha: Usually 2× rank (alpha=16 for r=8)
- Target modules:
q_proj, v_projfor attention,gate_proj, up_proj, down_projfor MLP
For DPO/ORPO:
- Learning rate: 5e-6 to 1e-5 (lower than SFT)
- Beta: 0.1 to 0.5 (higher = stronger regularization)
- Reference model: Use your SFT checkpoint
For KTO:
- Learning rate: Similar to DPO
- Desirable weight: 1.0
- Undesirable weight: 1.0 to 2.0 (penalize bad more)
Related Resources
- ML Cheatsheets: Tree-Based Machine Learning - Comprehensive overview of ensemble methods
- Life of a Tensor: Production Inference - Deep dive into inference optimization
References
- DoRA: Weight-Decomposed Low-Rank Adaptation
- DPO: Direct Preference Optimization
- ORPO: Odds Ratio Preference Optimization
- KTO: Kahneman-Tversky Optimization
- PiSSA: Principal Singular Values and Singular Vectors Adaptation
This guide reflects the state of the art as of Q1 2026. The field moves fast—check papers and implementations for the latest updates.