GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Quantize a 175B model to 3–4 bits in a few GPU-hours with a one-shot, Hessian-aware solver.
Frantar et al. · ICLR 2023 · Serving. Read the paper ↗
A free, interactive, animated visual explainer of GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — every exhibit computed from the real formulas, with verbatim quotes from the source.
Questions
- What is GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers?
- Quantize a 175B model to 3–4 bits in a few GPU-hours with a one-shot, Hessian-aware solver.
- Who published GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, and where?
- Frantar et al. — ICLR 2023 (arXiv:2210.17323).
- Where can I find a visual explainer of GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers?
- Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.
Related explainers
- PagedAttention (vLLM)
- Efficiently Scaling Transformer Inference
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Fast Inference from Transformers via Speculative Decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion