Reasoning & agents
Trace reasoning from the prompt that started it all through self-consistency, search, verification, retrieval, and the agentic benchmarks — the path from "answer once" to a model that reasons, checks itself, and acts.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Let's Verify Step by Step
- GenPRM: Generative Process Reward Models
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- ReAct: Synergizing Reasoning and Acting in Language Models
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Toolformer: Language Models Can Teach Themselves to Use Tools
- ToolRL: Reward is All Tool Learning Needs
- Group-in-Group Policy Optimization for LLM Agent Training
- τ-bench: Tool-Agent-User Interaction in Real-World Domains
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Agent Learning via Early Experience
- Why Language Models Hallucinate
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
- Rewarding Doubt: Calibrated Confidence Expression of LLMs
- Proximal Policy Optimization Algorithms
- Spurious Rewards: Rethinking Training Signals in RLVR
- The Entropy Mechanism of RL for Reasoning Language Models
- MiniMax-M1: Scaling Test-Time Compute with Lightning Attention
- ProRL: Prolonged RL Expands Reasoning Boundaries
- From Hard Refusals to Safe-Completions