Serving LLMs efficiently

Understand the inference stack: why the KV cache dominates, how paging and quantization shrink it, how speculative decoding hides latency, and how disaggregated serving maximizes throughput.