Reasoning & agents

Trace reasoning from the prompt that started it all through self-consistency, search, verification, retrieval, and the agentic benchmarks — the path from "answer once" to a model that reasons, checks itself, and acts.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Let's Verify Step by Step
GenPRM: Generative Process Reward Models
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
ReAct: Synergizing Reasoning and Acting in Language Models
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Toolformer: Language Models Can Teach Themselves to Use Tools
Gorilla: Large Language Model Connected with Massive APIs
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
ToolRL: Reward is All Tool Learning Needs
Group-in-Group Policy Optimization for LLM Agent Training
τ-bench: Tool-Agent-User Interaction in Real-World Domains
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Agent Learning via Early Experience
Voyager: An Open-Ended Embodied Agent with Large Language Models
Agent Workflow Memory
Why Language Models Hallucinate
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Rewarding Doubt: Calibrated Confidence Expression of LLMs
Proximal Policy Optimization Algorithms
Spurious Rewards: Rethinking Training Signals in RLVR
The Entropy Mechanism of RL for Reasoning Language Models
MiniMax-M1: Scaling Test-Time Compute with Lightning Attention
ProRL: Prolonged RL Expands Reasoning Boundaries
From Hard Refusals to Safe-Completions