Skip to content
Gopi Krishna Tummala
Back

The 2026 Post-Training Playbook: Mastering PEFT, Alignment, and Multimodal Adaptation

By Gopi Krishna Tummala


Post-Training and Parameter-Efficient Fine-Tuning

Post-training transforms pre-trained models into production-ready systems. Full fine-tuning is computationally expensive and prone to catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods address these limitations by adapting models with minimal parameter updates.

The PEFT Ecosystem: The Adapter Zoo

LoRA (Low-Rank Adaptation) Explained in Detail

LoRA is one of the most important practical inventions in the era of Large Language Models (2021–2026). It allows you to customize and fine-tune gigantic models—like LLaMA-3 (70B+), Mistral, or Stable Diffusion—without touching almost any of the original weights. It achieves this using a fraction of the GPU memory and storage traditionally required.

The Core Problem It Solves

When you fine-tune a standard 70-billion-parameter model the classic way (Full Fine-Tuning):

  • Massive Hardware Requirements: You must store gradients, optimizer states (like Adam, which requires 2–3× more memory), and updated weights. This often demands >500 GB of VRAM just to train.
  • Storage Bloat: After training, you generate one massive new checkpoint per task (70 GB+ each).
  • Catastrophic Forgetting: The model risks losing its foundational general knowledge while over-indexing on your new, narrow task.

LoRA elegantly bypasses almost all of these bottlenecks.

The Big Intuition (The Key Insight)

The authors (Edward Hu et al., Microsoft, 2021) observed something profound during full fine-tuning experiments:

The change in weights (ΔW) when you fine-tune on a new task has a very “low intrinsic rank.”

In plain English: The adjustment the model needs to learn a new task is not scattered randomly across every single neural connection. The necessary changes live in a much lower-dimensional subspace.

Instead of learning a massive ΔW matrix (the exact same size as the original weight matrix), you can approximate that change using the product of two much smaller matrices.

The Mathematics of LoRA

Take any linear layer in a Transformer (such as an attention projection or feed-forward layer).

In the original forward pass, the output h is computed as:

h = W₀ x

(where W₀ is the original, frozen weight matrix of dimension d × k—often massive, e.g., 4096 × 4096).

In full fine-tuning, the model updates the weights by adding a massive delta matrix:

h = (W₀ + ΔW) x

(where ΔW is also d × k, resulting in millions to billions of trainable parameters per layer).

LoRA instead computes:

h = W₀ x + (α/r) · (B A) x

Breaking down the variables:

  • A ∈ ℝ^{d × r}: Usually initialized randomly (Gaussian).
  • B ∈ ℝ^{r × k}: Usually initialized to zeros (so at the very start of training, BA = 0, meaning LoRA initially acts as a no-op and doesn’t disrupt the model).
  • r (rank): A tiny hyperparameter. Common values are 4, 8, 16, 32, or 64. (Surprisingly, an r=8 is often enough!)
  • α (alpha): A scaling factor commonly set to 2×r or fixed at 16–32 to control the magnitude of the adapter’s influence.

The Parameter Savings: Trainable parameters drop from d·k down to just d·r + r·k.

For a typical LLaMA-7B attention layer (d=k≈4096):

  • Full fine-tune: ~16.8 million parameters.
  • LoRA (r=8): ~65,000 parameters.

This results in ~260× fewer parameters to train. Across the entire model, usually only 0.1% to 1% of the original parameters become trainable.

Training vs. Inference

During Training:

  • Freeze all original weights (W₀).
  • Train only the A and B matrices.
  • The forward pass computes both paths and adds them together.
  • GPU memory is dominated merely by holding the frozen model in place, plus the tiny gradients required for A and B.

During Inference (The Superpower):

  • You can mathematically merge the trained adapters back into the base model before serving:

W_final = W₀ + (α/r) · (B A)

  • Zero Latency Cost: After merging, the model operates with the exact same speed and memory footprint as the base model.
  • Alternatively, modern inference engines (vLLM, TGI, llama.cpp) allow you to keep the adapters separate, rapidly hot-swapping different LoRAs for different users on the fly.

Where do we inject LoRA matrices?

While you can apply LoRA anywhere, modern consensus (2023–2026) dictates targeting the attention mechanism:

  • Standard approach: Query (q_proj) and Value (v_proj).
  • Comprehensive approach: Query, Key, Value, and Output (q_proj, k_proj, v_proj, o_proj).
  • Occasionally, practitioners will also target the up/down projections in the MLP layers, though this increases parameter count.

Many libraries (PEFT, Unsloth, Axolotl) default to q_proj + v_proj or q+v+k+o—already excellent.

Why does it work so well? (Deeper Intuition)

  1. Low intrinsic dimension of task adaptation
    Fine-tuning mostly needs to “reorient” attention patterns in a low-dimensional way—not reinvent every weight.

  2. Gradient flow stays strong
    Because A and B are small, gradients don’t vanish/explode as easily.

  3. No extra inference cost (after merge)

  4. Modular—you can train many LoRAs for different tasks, swap them like plugins, merge only when needed.

DoRA (Weight-Decomposed LoRA)

DoRA decouples the pre-trained weight matrix into two components: magnitude and direction.

The Insight: Full fine-tuning usually exhibits a negative correlation between magnitude and directional updates, whereas standard LoRA exhibits a positive correlation. DoRA forces the model to learn more like full fine-tuning.

Mechanism:

W = m · (W₀/||W₀||c + ΔV)

Where:

  • m is the trainable magnitude vector
  • W₀/||W₀||c is the frozen directional weight (normalized)
  • ΔV is the LoRA update (BA)

Why it matters: It achieves superior performance at lower ranks (e.g., r=4–8) across LLMs and VLMs without adding inference latency, as the weights can still be merged.

Performance: Closes 90–95% of the full fine-tuning gap at r=4–8.

QLoRA & QDoRA

QLoRA (Quantized LoRA):

  • Quantizes the frozen base model (W₀) to 4-bit precision (NormalFloat4)
  • Computes gradients in 16-bit to update the LoRA adapters (A and B)
  • Essential for fitting massive models on single GPUs
  • Single-GPU 70B fine-tuning becomes feasible

QDoRA: DoRA + 4-bit quantization. Often beats full fine-tuning while using a fraction of the memory.

PiSSA (Principal Singular Values and Singular Vectors Adaptation)

Instead of initializing the LoRA matrices randomly and with zeros, PiSSA initializes them using the principal singular values of the original weight matrix. This allows the adapter to capture the most critical structural information immediately.

Result: 2–3× faster convergence than standard LoRA.

Other Power Tools

MethodKey InnovationBest For
VeRAVector-based, ultra-lightUltra-low parameter budgets
AdaLoRADynamic rank allocation per layerMulti-task scenarios
LongLoRAShifted sparse attention + LoRA100k+ context extension
RsLoRARank-stabilized trainingBetter stability at low ranks
LoftQQuantization-aware initializationBetter QLoRA performance

Comparison Table

MethodTrainable ParamsVRAM Needed (70B Model)Inference SpeedForgetting RiskMergeable?
Full Fine-Tune~70B+8–16× A100sBase SpeedHighN/A
LoRA (r=16)~20–80M1–2× A100sBase Speed (Merged)LowYes
QLoRA (4-bit)~20–80M24–40 GBBase Speed (Merged)LowYes
DoRA (2024)Similar to LoRASimilar to LoRABase Speed (Merged)Very LowYes

The Alignment Stack: From Messy RL to Clean Math

The industry has decisively moved away from brittle PPO-based RLHF. The new standard is direct optimization on preference data.

DPO (Direct Preference Optimization)

DPO skips the reward model entirely.

How it works:

  • Directly optimizes policy on preference data (a “chosen” response y_w and a “rejected” response y_l)
  • Mathematically maps the reward function directly to the optimal policy
  • Uses a binary cross-entropy loss that increases the likelihood of the chosen response while decreasing the likelihood of the rejected one
  • Implicitly constrained by a reference model (π_ref) to prevent diverging too far

The DPO Objective:

L_DPO = -log σ(β log(π_θ(y_w|x) / π_ref(y_w|x)) - β log(π_θ(y_l|x) / π_ref(y_l|x)))

Where:

  • σ is the sigmoid function
  • β is a temperature parameter
  • π_ref is the reference model (usually the SFT model)

Advantages:

  • No separate reward model needed
  • More stable than RLHF
  • Faster to train
  • Better sample efficiency

ORPO (Odds Ratio Preference Optimization)

ORPO merges SFT + alignment into one stage.

Mechanism:

  • Adds an odds ratio penalty to the standard negative log-likelihood loss
  • Penalizes the generation of rejected responses during the initial instruction tuning phase
  • Saves 30–50% compute compared to SFT → DPO pipeline
  • Strong safety performance

KTO (Kahneman-Tversky Optimization)

KTO only needs “desirable” vs “undesirable” labels (no pairwise data).

Mechanism: Inspired by prospect theory—penalizes bad outcomes more than it rewards good ones.

Advantages:

  • Works with weak supervision (e.g., user thumbs-down logs)
  • No need for carefully curated preference pairs
  • Effective for noisy or safety data

Newer Contenders (2025–2026)

  • SimPO: Even simpler DPO variant with length normalization
  • IPO (Identity Preference Optimization): Better handling of noisy preferences
  • CPO (Contrastive Preference Optimization): Improved stability

Post-Training for Multimodal & Diffusion Models

PEFT has fully colonized continuous domains. No longer restricted to LLMs.

Vision-Language Models (VLMs)

For models like LLaVA, Qwen-VL, autonomous driving stacks:

  • Apply LoRA/DoRA to:
    • Vision encoder layers
    • Cross-attention layers (vision → language)
    • MLP projector layers

Mask-Aware LoRA (2025 breakthrough):

  • During fine-tuning, route frozen base weights to background and adapters to masked regions
  • Result: pixel-perfect localized edits with perfect temporal consistency
  • Critical for autonomous driving edge cases where you need to modify specific objects without affecting the scene

Diffusion Models

For Stable Diffusion 3, Flux, Video models:

Standard approach:

  • LoRA on U-Net attention blocks (self-attention and cross-attention)
  • Typically r=4–16 depending on task complexity

AudioLDM / MusicGen style:

  • LoRA on CLAP-conditioned U-Net for genre-specific audio
  • Allows fine-tuning for specific musical styles or audio domains

Temporal Consistency LoRA:

  • Injects adapters into temporal layers for video
  • Maintains consistency across frames
  • Critical for video editing and generation tasks

Emerging techniques:

  • Diffusion-DRF (Differentiable Reward Flow): Direct reward-guided video fine-tuning
  • MMaDA-style unified diffusion LLMs: Combining diffusion and language models

Mask-Aware LoRA for Video Editing

Standard Image-to-Video (I2V) models struggle with localized edits (e.g., changing the color of a moving car while keeping the background static).

The Technique: Instead of naive per-video fine-tuning, researchers now use Mask-guided LoRA. A spatial mask is applied during the LoRA fine-tuning process. The model learns to route its attention:

  • Using the frozen base weights for the background (spatial structure and motion cues)
  • Utilizing the LoRA adapters strictly for the masked region (appearance control)

Result: High temporal consistency without bleeding the style transfer into the environment.

Supervised Fine-Tuning (SFT)

Instruction Tuning

  • Format data as instruction-response pairs
  • Use consistent chat templates (ChatML, Llama-3 format, etc.)
  • Include diverse task types in your dataset

Catastrophic Forgetting Mitigation

Fine-tuning on new data can cause the model to forget pre-training knowledge.

Mitigation strategies:

  1. Replay buffers: Mix a small percentage of pre-training data into fine-tuning batches
  2. LoRA on critical layers only: Freeze more layers, only adapt attention or specific modules
  3. Multi-task training: Train on both new and old tasks simultaneously

Chat Formatting

Modern LLMs expect specific chat formats. Common formats:

  • ChatML: <im_start>user\n{prompt}<im_end>\n<im_start>assistant\n
  • Llama-3: [INST] {prompt} [/INST]
  • Vicuna: USER: {prompt}\nASSISTANT:

Adapter Merging and Inference

Merging Adapters

  • vLLM, TGI, and Outlines support merged adapter serving natively
  • Merging eliminates inference overhead
  • Use peft.merge_and_unload() or equivalent

Inference Options

  • Merged: Zero latency cost, same speed as base model
  • Separate: Keep adapters separate for hot-swapping (supported by vLLM, TGI, llama.cpp)

Hyperparameters

For LoRA/DoRA:

  • Learning rate: 1e-4 to 5e-4 (10× base model LR)
  • Rank: Common values are 4, 8, 16, 32
  • Alpha: Usually 2× rank (alpha=16 for r=8)
  • Target modules: q_proj, v_proj for attention, gate_proj, up_proj, down_proj for MLP

For DPO/ORPO:

  • Learning rate: 5e-6 to 1e-5 (lower than SFT)
  • Beta: 0.1 to 0.5 (higher = stronger regularization)
  • Reference model: Use your SFT checkpoint

For KTO:

  • Learning rate: Similar to DPO
  • Desirable weight: 1.0
  • Undesirable weight: 1.0 to 2.0 (penalize bad more)

References


This guide reflects the state of the art as of Q1 2026. The field moves fast—check papers and implementations for the latest updates.