Post-training LLMs

Start with a pretrained model that only predicts text. End understanding every lever that turns it into an aligned, reasoning assistant — RLHF, DPO, the GRPO family, and the test-time-compute frontier.

Training language models to follow instructions with human feedback
Constitutional AI: Harmlessness from AI Feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Group Sequence Policy Optimization
DeepSeek-R1
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
s1: Simple test-time scaling
Tülu 3: Pushing Frontiers in Open Language Model Post-Training