Skip to content
Gopi Krishna Tummala

ML Cheatsheets

Visual mindmaps for machine learning concepts - Complete ML course coverage

Intro to ML

Motivation, applications, types of learning, history, and ML pipeline

Data & Evaluation

Data preprocessing, train/val/test splits, cross-validation, and evaluation metrics

Classical Supervised Learning

Linear models, regularization, decision trees, and k-NN

Stats & Learning Theory

Generalization bounds, VC dimension, bias-variance, and PAC learning

Advanced Classical Models

Support Vector Machines and Bayesian methods

Ensemble Methods

Bagging, Random Forests, Boosting, and modern implementations

Tree-Based Machine Learning

A comprehensive overview of decision trees, ensemble methods, and boosting algorithms

Optimization for ML

Gradient descent variants, advanced optimizers, learning rate strategies

Modern Deep Learning

Neural networks, activations, training, architectures, and representation learning

Unsupervised Learning

Clustering, dimensionality reduction, and generative models

Probabilistic & Graphical Models

Mixture models, EM algorithm, Markov models, and Bayesian networks

Modern Topics / Extensions

Self-supervised learning, meta-learning, federated learning, RL, and continual learning

Interpretability & Fairness

Interpretability methods, SHAP, fairness definitions, and responsible AI

Scaling & Production ML

Large-scale training, hyperparameter tuning, MLOps, and AutoML

Project & Research Skills

Problem formulation, experiment design, model selection, and research mindset

Computer Vision

Classical image processing, deep learning architectures, detection/segmentation, generative models, 3D vision, and multimodal learning

Intro to ML

Motivation, applications, types of learning, history, and ML pipeline

🔹 Intro to ML
Motivation & Applications
Why ML?
Rules-based systems fail on complex data
Data is abundant & cheap
Human expertise is scarce/expensive
Killer Apps
Computer Vision
NLP / LLMs
Recommenders
Healthcare, Finance, Robotics
Types of Learning
Supervised
Labeled data
Regression (continuous)
Classification (discrete)
Unsupervised
No labels → discover patterns
Reinforcement
Agent + environment + rewards
Semi-/Self-supervised
Leverage unlabeled data heavily
History & Milestones
1950s–60s: Perceptron, early neural nets
1980s: Backpropagation
1990s: SVMs, Boosting
2010s: Deep Learning revolution (AlexNet → Transformers)
2020s: Foundation models, multimodal, agents
ML Pipeline (End-to-End)
Problem → Data → Features → Model → Eval → Deploy → Monitor
Bias–Variance & Generalization
Bias: Underfitting (high training error)
Variance: Overfitting (low training, high test error)
Decomposition: Error = Bias² + Variance + Noise
Goal: Minimize test error (generalization)

Data & Evaluation

Data preprocessing, train/val/test splits, cross-validation, and evaluation metrics

🔹 Data & Evaluation
Data Preprocessing
Cleaning
Missing values (impute / drop)
Outliers
Scaling
Min-Max, Standard, Robust, Quantile
Encoding
One-hot, Label, Target, Embeddings
Train/Val/Test Split
Simple 70/15/15 or 80/10/10
Stratified (for imbalance)
Time-series / Group splits
Cross-Validation
k-Fold, Stratified k-Fold
Leave-One-Out, Repeated CV
Nested CV (hyperparam tuning)
Evaluation Metrics
Classification
Accuracy, Precision, Recall, F1
ROC-AUC, PR-AUC
Confusion matrix, Calibration
Regression
MSE, RMSE, MAE, R², Adjusted R²
Ranking / Retrieval
NDCG, MAP, MRR
Imbalanced / Real-world
Fβ, Cohen's Kappa, Matthews Corr.

Classical Supervised Learning

Linear models, regularization, decision trees, and k-NN

🔹 Classical Supervised Learning
Linear Models
Linear Regression
Closed form: (XᵀX)⁻¹Xᵀy
Assumptions (linearity, homoscedasticity, independence)
Logistic Regression
Sigmoid + Cross-entropy
Multiclass: Softmax
Regularization
L2 (Ridge) → shrinks coefficients
L1 (Lasso) → sparsity / feature selection
Elastic Net (L1 + L2)
Decision Trees
Splitting: Gini / Entropy / MSE
Pruning: Cost-complexity, Reduced-error
Pros: Interpretable, non-linear
Cons: Unstable, greedy
k-Nearest Neighbors
Distance metrics: Euclidean, Manhattan, Cosine
Curse of dimensionality
Weighted KNN, Approximate NN (FAISS, HNSW)

Stats & Learning Theory

Generalization bounds, VC dimension, bias-variance, and PAC learning

🔹 Stats & Learning Theory
Generalization Bounds
Hoeffding / Chernoff
Uniform convergence
VC Dimension
Shattering
Linear classifiers: VC = d+1
Sample complexity ≈ VC / ε²
Bias-Variance Decomposition
E[(y − ŷ)²] = Bias² + Var + σ²
PAC Learning
Probably Approximately Correct
Agnostic PAC, Realizable case
Other Key Ideas
No Free Lunch Theorem
Occam's Razor
Double Descent (modern view)

Advanced Classical Models

Support Vector Machines and Bayesian methods

🔹 Advanced Classical Models
Support Vector Machines
Max-margin hyperplane
Soft-margin (slack + C)
Kernel Trick
RBF: exp(−γ‖x−x′‖²)
Polynomial, Sigmoid
Bayesian Methods
Bayes Rule
P(θ|D) ∝ P(D|θ)P(θ)
Naive Bayes
Gaussian, Multinomial, Bernoulli
Bayesian Networks
DAG + CPDs
Exact inference (variable elimination)
Approximate (MCMC, variational)

Ensemble Methods

Bagging, Random Forests, Boosting, and modern implementations

🔹 Ensemble Methods
Bagging
Bootstrap + aggregate
Reduces variance
Random Forests
Bagging + random feature subsets
OOB error, feature importance
Boosting
AdaBoost (exponential loss, weights)
Gradient Boosting (fit residuals)
Modern Boosting (Industry Standard)
XGBoost (regularized, approx splits, DART)
LightGBM (histogram, leaf-wise, GOSS)
CatBoost (ordered boosting, native categoricals)

Tree-Based Machine Learning

A comprehensive overview of decision trees, ensemble methods, and boosting algorithms

Tree-Based Machine Learning
Decision Trees
Structure
Root Node
Internal Nodes
Leaf Nodes
Depth / Height
Types
Classification Tree
Regression Tree
Splitting Criteria
Classification
Gini Impurity
Entropy
Information Gain
Regression
MSE
MAE
Variance Reduction
Stopping Criteria
Max Depth
Min Samples Split
Min Samples Leaf
Pure Node
Pruning
Pre-pruning
Post-pruning
Issues
Overfitting
High Variance
Sensitive to Noise
Bias-Variance Tradeoff
Deep Tree -> Low Bias High Variance
Shallow Tree -> High Bias Low Variance
Ensembles Reduce Variance
Ensemble Methods
Bagging
Bootstrap Sampling
Parallel Training
Majority Vote / Averaging
Reduces Variance
Random Forest
Bagging + Feature Randomness
Random Feature Subset per Split
OOB Error
Feature Importance
Extra Trees
Random Thresholds
More Randomness
Lower Variance
Boosting
Core Idea
Sequential Learning
Focus on Errors
Weak Learners
AdaBoost
Reweight Samples
Weighted Voting
Gradient Boosting
Fit Residuals
Gradient Descent in Function Space
Learning Rate
Additive Model
Regularization
Learning Rate
Number of Trees
Max Depth
Subsampling
XGBoost
Regularized Objective
Tree Pruning
Second-order Gradients
Missing Value Handling
LightGBM
Leaf-wise Growth
Histogram Splitting
GOSS Sampling
CatBoost
Native Categorical Handling
Ordered Boosting
Target Leakage Reduction
Interpretability
Feature Importance
Impurity-based
Permutation Importance
SHAP Values
Partial Dependence Plots
Decision Path Visualization
Practical Considerations
No Feature Scaling Needed
Handles Mixed Data Types
Strong for Tabular Data
Poor Extrapolation
Memory Heavy for Large Forests
Complexity
Tree ~ O(n log n)
Boosting Sequential Slower
Random Forest Parallelizable

Optimization for ML

Gradient descent variants, advanced optimizers, learning rate strategies

🔹 Optimization for ML
Gradient Descent Variants
Batch GD
SGD (noisy but fast)
Mini-batch (sweet spot)
Advanced Optimizers
Momentum
RMSProp / AdaGrad
Adam (β1=0.9, β2=0.999)
AdamW, Lion, Sophia (2024–25)
Learning Rate Strategies
Step decay, Exponential, Cosine
Warmup + decay (common in transformers)
One-cycle policy
Convex vs Non-Convex
Convex → global optimum
Non-convex → local minima, saddles, plateaus
Second-Order Methods
Newton, Quasi-Newton (BFGS, L-BFGS)
Limited by scale

Modern Deep Learning

Neural networks, activations, training, architectures, and representation learning

🔹 Modern Deep Learning
Neural Networks Basics
Perceptron → MLP
Universal approximation
Activation Functions
ReLU family (ReLU, Leaky, GELU, Swish)
Avoid vanishing gradients
Training
Backpropagation + Chain rule
Initialization (He, Xavier)
Batch Norm / Layer Norm / Group Norm
Architectures
CNNs (ResNet, EfficientNet, ConvNeXt)
RNNs → LSTMs/GRUs
Transformers (Self-attention, Multi-head, Positional encoding)
Representation Learning
Embeddings (Word2Vec → BERT → modern LLMs)
Contrastive learning (SimCLR, CLIP)

Unsupervised Learning

Clustering, dimensionality reduction, and generative models

🔹 Unsupervised Learning
Clustering
k-Means (Lloyd's, elbow, silhouette)
Hierarchical (agglomerative + dendrogram)
DBSCAN (density-based)
GMM (soft clustering)
Dimensionality Reduction
PCA (linear, variance-max)
t-SNE (perplexity, KL divergence)
UMAP (faster, better topology preservation)
Generative Models
Autoencoders (undercomplete, denoising, VAE)
GANs (minimax, modern variants like StyleGAN, Diffusion)

Probabilistic & Graphical Models

Mixture models, EM algorithm, Markov models, and Bayesian networks

🔹 Probabilistic & Graphical Models
Mixture Models & EM
Gaussian Mixture Models
EM Algorithm (E-step: responsibilities, M-step: MLE)
Markov Models
HMMs (Forward-Backward, Viterbi)
Markov Random Fields
Bayesian Networks
Structure learning
Inference (exact vs approximate)
Modern Connections
Probabilistic programming (Pyro, NumPyro)
Diffusion models as hierarchical latents

Modern Topics / Extensions

Self-supervised learning, meta-learning, federated learning, RL, and continual learning

🔹 Modern Topics / Extensions
Self-Supervised Learning
Contrastive (SimCLR, MoCo)
Masked modeling (BERT, MAE)
BYOL, SimSiam, DINO
Meta-Learning
Few-shot: MAML, Reptile, ProtoNets
Optimization-based vs metric-based
Federated Learning
FedAvg, FedProx
Privacy (differential privacy, secure aggregation)
Reinforcement Learning
MDPs, Q-Learning, Policy Gradients
Modern: PPO, SAC, Dreamer, AlphaFold-style
Continual / Lifelong Learning
Catastrophic forgetting
Replay buffers, EWC, GEM

Interpretability & Fairness

Interpretability methods, SHAP, fairness definitions, and responsible AI

🔹 Interpretability & Fairness
Interpretability Toolbox
Intrinsic: Decision trees, linear models
Post-hoc: Feature importance, Partial Dependence Plots
Model-Agnostic Methods
LIME (local surrogate)
SHAP (Shapley values, KernelSHAP, TreeSHAP)
Fairness
Definitions
Demographic Parity
Equalized Odds
Equal Opportunity
Mitigation
Pre-processing, In-processing, Post-processing
Responsible AI
Bias detection, Adversarial debiasing, Explainable AI regulations

Scaling & Production ML

Large-scale training, hyperparameter tuning, MLOps, and AutoML

🔹 Scaling & Production ML
Large-Scale Training
Data parallelism, Model parallelism, Pipeline
ZeRO, FSDP, DeepSpeed, Megatron
Hyperparameter Tuning
Grid / Random search
Bayesian optimization (Optuna, Hyperopt)
Neural Architecture Search (DARTS, NAS)
MLOps / Production
Experiment tracking (MLflow, Weights & Biases)
Model serving (TorchServe, TF Serving, vLLM)
Monitoring (data drift, concept drift, performance)
Feature stores (Feast, Tecton)
AutoML
Full pipelines: Auto-sklearn, H2O, Google AutoML
Modern: LLM-powered (e.g., AutoGPT-style agents)

Project & Research Skills

Problem formulation, experiment design, model selection, and research mindset

🔹 Project & Research Skills
Problem Formulation
Define task, success metric, baseline
Literature review (arXiv, PapersWithCode)
Experiment Design
Ablation studies
Statistical significance (t-tests, bootstrap)
Reproducibility (seeds, Docker, Hydra)
Model Selection & Deployment
Tradeoffs: accuracy vs latency vs cost
A/B testing, Canary releases
Research Mindset
Reproducibility crisis awareness
Ethics & societal impact
Open-source contribution
Writing papers, blogging, presenting

Computer Vision

Classical image processing, deep learning architectures, detection/segmentation, generative models, 3D vision, and multimodal learning

🔹 Computer Vision
Classical Image Processing & Frequency
Spatial Filters vs. Frequency Domain
Spatial Domain
Convolution using kernels directly on pixels
Gaussian: smoothing/blurring
Sobel: edge detection
Frequency Domain (2D Fourier Transform)
Converts image: spatial → spatial frequency
High frequencies: edges/noise
Low frequencies: smooth areas
Convolution Theorem
Convolution in spatial domain
= Element-wise multiplication in frequency domain
Image Restoration
Inverse Filter
Recovers blurred image
Catastrophically amplifies noise
Wiener Filter
Optimal tradeoff
Minimizes MSE between estimated & true image
Accounts for degradation function & noise power spectra
Feature Matching (Classic)
Detectors (e.g., Harris Corner)
Find interest points
Invariant to rotation/translation
Descriptors (e.g., SIFT)
Describe patch around keypoint
Invariant to scale & illumination
Bag of Visual Words (BoW)
Cluster descriptors via k-means
Create 'visual vocabulary'
Represent image as histogram of 'words'
Core Deep Learning Architectures
Convolutional Neural Networks (CNNs)
Key Properties
Weight sharing
Local connectivity
Translation invariance/equivariance
Receptive Field
Region of input image
Affects specific feature map in deeper layers
Vision Transformers (ViTs)
Divide images into non-overlapping patches
Flatten patches
Apply linear projections + positional embeddings
Self-Attention Equation
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Tradeoff
Lacks CNN inductive biases (translation invariance)
Requires larger datasets to train from scratch
Scales better to massive data
Explainability
Grad-CAM
Uses gradients of target concept
Flows into final convolutional layer
Produces coarse localization map
Highlights important regions for prediction
Key Vision Tasks: Detection, Segmentation, Video
Object Detection Architectures
Two-Stage (Faster R-CNN)
Stage 1: Generate region proposals (RPN)
Stage 2: Classify & refine bounding boxes
High accuracy, slower inference
Single-Stage (YOLO, SSD)
Frames detection as single regression problem
Over dense spatial grid
Faster, better for real-time
Historically struggled with tiny objects
Image Segmentation Paradigms
Semantic Segmentation
Classifies every pixel into category
e.g., 'car', 'road'
Does not distinguish between different cars
Instance Segmentation
Identifies & delineates each distinct object
e.g., 'Car 1', 'Car 2'
Panoptic Segmentation
Unifies semantic & instance
Segments distinct objects ('things')
Segments amorphous background ('stuff')
Tracking & Video
Optical Flow (Lucas-Kanade)
Estimates pixel motion between frames
Based on brightness constancy assumption
Pixel intensities don't change between frames
Spatial smoothness constraint
Generative Models
Variational Autoencoders (VAEs)
Learn continuous latent space
Optimized by maximizing ELBO
Evidence Lower Bound
Balances reconstruction loss
KL-divergence term forces latent to match prior (Gaussian)
Generative Adversarial Networks (GANs)
Minimax game
Generator tries to fool Discriminator
Known for sharp images
Challenges
Training instability
Mode collapse (limited variety of outputs)
Diffusion Models
Forward Process
Gradually adds Gaussian noise
Over T steps
Reverse Process
Neural network (often U-Net) learns to denoise
Step-by-step recovery of data
Advantages
Beats GANs in diversity
Better stability
3D Vision & Geometry
Camera Models & Coordinates
Homogeneous Coordinates
2D points: [x, y, 1]
3D points: [X, Y, Z, 1]
Allows translation & perspective projection
Represented as matrix multiplications
Epipolar Geometry
Relates two views of same 3D scene
Fundamental Matrix (F)
Algebraic representation of epipolar geometry
If x and x' are corresponding points
They satisfy: x'^T F x = 0
Triangulation
Using camera projection matrices
Matching 2D points across multiple views
Calculate 3D depth of point
Bird's Eye View (BEV) Transformation
Project 2D camera features → 3D/BEV grid
Depth estimation or transformer cross-attention
BEVFormer: transformer-based approach
Foundational for multi-camera systems
Critical for trajectory planning
Sensor Fusion
Sensor Types
Cameras: dense, rich semantics, no depth
LiDAR: sparse, accurate depth, no color
Radar: velocity tracking, weather resistant
Fusion Strategies
Early Fusion: raw data (e.g., LiDAR → images)
Mid Fusion: feature maps from neural networks
Late Fusion: final bounding boxes/predictions
Visual Odometry (VO) & SLAM
Estimate ego-motion from sequential images
Map the environment simultaneously
Key Concepts
Feature extraction/matching (SIFT, ORB)
Epipolar Geometry
Essential vs. Fundamental matrices
Bundle Adjustment
Optical Flow & Scene Flow
Pixel-level or point-level motion estimation
Between consecutive frames
Evaluation: End-Point Error (EPE)
Modern Representation & Multimodal Learning
Self-Supervised Representation Learning
Contrastive Learning (SimCLR)
Pulls augmented views of same image together
Pushes different images apart
In latent space
Masked Image Modeling (MAE)
Masks high percentage of image
Trains autoencoder to reconstruct missing patches
Vision & Language
Image Captioning
Encoder-Decoder architecture
CNN/ViT encodes image
RNN/Transformer generates text autoregressively
Large Vision Models (LVMs) & VQA
Fusing visual tokens with LLMs
Via cross-attention or linear projection layers
Answer questions about visual content
Vision-Language Models (VLMs)
Contrastive Learning (e.g., CLIP)
Maximize cosine similarity for matching pairs
Minimize similarity for incorrect pairs
Within-batch negative sampling
Generative VLMs (e.g., LLaVA, BLIP)
Frozen image encoder (e.g., CLIP ViT)
Connected to LLM decoder via projection layer
Visual reasoning & question answering
Implementation & Optimization
Distributed Training
DDP (Distributed Data Parallel)
Replicate model on every GPU
Split batch across GPUs
Best for models fitting on single GPU
FSDP (Fully Sharded Data Parallel)
Shard parameters, gradients, optimizer states
Across multiple GPUs
Crucial for massive models (VLMs)
Deployment & Edge Optimization
Quantization
FP32 → INT8 or FP16
Reduce memory footprint & latency
Post-Training Quantization (PTQ)
Quantization-Aware Training (QAT)
TensorRT / ONNX
Export PyTorch → optimized execution graph
Low-latency inference on target hardware
FlashAttention
Hardware-aware optimization
Speeds up transformer attention layers
Reduces GPU HBM memory reads/writes
Loss Functions & Evaluation Metrics
Detection Metrics
Intersection over Union (IoU)
IoU = Area of Overlap / Area of Union
mean Average Precision (mAP)
Area under Precision-Recall curve
Averaged across all classes
Various IoU thresholds
NuScenes Detection Score (NDS)
Composite metric combining mAP
Errors in: translation, scale, orientation
Velocity & attributes
Loss Functions
Focal Loss
Addresses severe class imbalance
Down-weights well-classified examples
FL(p_t) = -α_t(1 - p_t)^γ log(p_t)
InfoNCE / Contrastive Loss
Self-supervised learning & VLMs
Pull positive pairs together
Push negative pairs apart
In latent space
Data Engines & MLOps
Active Learning
Intelligently sample informative unlabelled data
Send for human annotation
Sampling Strategies
Highest model uncertainty
Highest entropy
Greatest ensemble disagreement
Handling the Long Tail
Oversampling minority classes
Synthetic Data Generation
Via simulation
Via diffusion models
Decoupled Training
Freeze backbone representation
Retrain classifier head only