Post-training LLMs
Start with a pretrained model that only predicts text. End understanding every lever that turns it into an aligned, reasoning assistant — RLHF, DPO, the GRPO family, and the test-time-compute frontier.
- Training language models to follow instructions with human feedback
- Constitutional AI: Harmlessness from AI Feedback
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- Group Sequence Policy Optimization
- DeepSeek-R1
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- s1: Simple test-time scaling
- Tülu 3: Pushing Frontiers in Open Language Model Post-Training