Comparisons

AI & ML approaches side by side — what each does, the real numbers, and when to use which.

Transformers vs Mamba — All-pairs attention versus a selective state-space recurrence — quadratic recall against linear-time throughput.
FlashAttention vs PagedAttention — Two attention optimizations that solve different problems — and are used together, not instead of each other.
Dense vs Mixture-of-Experts — Activate every parameter for every token, or route each token to a few of many experts.
ReAct vs Toolformer vs ToolRL — Three eras of teaching a model to use a tool — prompt the loop, filter the data on its own loss, or reward the policy.
PPO vs DPO vs GRPO — Three ways to turn preferences into a better policy — a full RL loop, a single classification loss, or group-relative RL without a critic.
MHA vs GQA vs MLA — Three points on the attention-memory curve — how much of the KV cache you keep decides how long a context you can afford to serve.
GAN vs VAE vs Diffusion — Three ways to learn a distribution and sample from it — an adversarial game, a probabilistic autoencoder, and an iterative denoiser.
FlashAttention vs FlashAttention-3 — The same exact-attention algorithm, rebuilt for a new generation of GPU — IO-aware tiling, then Hopper-era asynchrony and FP8.
Speculative Decoding vs Medusa vs EAGLE — Three ways to draft tokens for a target model to verify in parallel — a separate draft model, self-drafting heads, or feature-level autoregression.
Scaling Laws vs Chinchilla — Two readings of the same power laws — one prescribed bigger models, one showed compute-optimal training needs far more data per parameter.
BERT vs GPT vs T5 — Three ways to pretrain the same transformer — read both directions, predict the next token, or cast every task as text-to-text.
AWQ vs GPTQ vs BitNet — Three ways to shrink an LLM — scale the salient weights, compensate the rounding with second-order math, or train ternary so the matmul becomes addition.
S4 vs Mamba vs RWKV — The post-Transformer sequence lineage — a structured state space, a selective one, and a linear-attention RNN, all chasing linear cost without losing quality.
CoT vs Self-Consistency vs Tree-of-Thoughts — One chain, many chains, or a searched tree of chains — three rungs of a reasoning ladder, none of which touch the weights.
DDPM vs Flow Matching vs Consistency Models — One family, three answers to the same question — how should a model walk from noise to data?