Contents

1 Introduction

2 Post-Training for Enhanced Reasoning and Alignment of LLMs

Data Preparation and Generation

Reinforcement Learning (RL) and Post-Training for Agent LLMs

5 RL-Based Post-Training (I): The Case of DeepSeek R1 Series Models

6 RL-Based Post-Training (II): The Case of OpenAI o Series Models

7 RL-Scaling Law and Emergence of Reasoning Capabilities of LLMs

8 Discussions and Conclusions

9 References


1. Introduction

The emergence of DeepSeek, a Chinese artificial intelligence (AI) startup, signifies a transformative shift in the AI industry towards open-source development. This development carries profound implications for the open-source community and the broader AI landscape.

DeepSeek’s flagship model, DeepSeek-R1, is an open-source reasoning model that rivals OpenAI’s o1, despite being trained on a fraction of the computational resources. This challenges the prevailing belief that state-of-the-art AI development requires immense financial and computing power. By open-sourcing its models, DeepSeek democratizes access to cutting-edge AI, empowering developers, researchers, and organizations worldwide to build upon its advancements. This approach fosters innovation, accelerates AI application development, and promotes a collaborative knowledge-sharing ecosystem.

LLMs are usually pre-trained on vast internet text corpora, provide foundational language understanding and reasoning capabilities. While these models are adept at handling general tasks, they often fall short of real-world applicability. Many practical applications typically demand stronger reasoning and alignment capabilities, necessitating LLMs tailored for interactivity, adaptability, goal-oriented performance, and advanced reasoning.

In this article, we explore RL-based post-training, an emerging practice in LLM development that enhances multi-step reasoning and safety alignment for reasoning LLMs like R1. While valuable on its own, post-training—particularly reinforcement learning (RL)-based techniques—helps refine reasoning capabilities, tailor responses to user preferences, and improve alignment with human values. Compared to pre-training, these methods achieve greater reasoning efficacy with significantly fewer computational resources. They enable LLMs, such as V3/R1 and o1/3, to produce outputs aligned with real-world requirements for reasoning, planning, and action, ultimately making LLM-based agents more intelligent, helpful, and trustworthy.

In particular, we examine the key RL-based post-training technologies behind the DeepSeek R1 reasoning model. Portions of this content were initially published in Sections 7 and 9 of Agents in the Era of Large Language Models: A Systematic Overview (II), with revisions and updates for clarity and completeness.

2. Post-Training for Enhanced Reasoning and Alignment of LLMs

2.1 Pre-Trained LLMs and Reasoning LLMs