Andrej Karpathy: We need to let large models "go to school," and reinforcement learning is just beginning

AI expert Andrej Karpathy compared the training process of large language models (LLM) to educating students in a tweet, elaborating on the current state and future of LLM training. He pointed out that the training of LLM can be divided into three stages: the pre-training stage is akin to the background information in textbooks, the supervised fine-tuning stage corresponds to example problems and solutions, while the reinforcement learning stage is like practice problems, emphasizing learning through trial and error

AI expert Andrej Karpathy just posted a tweet where he cleverly compares the process of training large language models (LLM) to educating students, using the structure of a textbook to explain the current state and future directions of LLM training.

This might be the best and most straightforward explanation I've seen regarding pre-training, supervised fine-tuning, and reinforcement learning, so I'm sharing it with everyone.

Karpathy points out that when we open any textbook, we see three main types of information:

Background information / exposition: This is the core content of the textbook, used to explain various concepts and knowledge. Students build their knowledge system by reading and learning this content, which is akin to the pretraining phase of LLMs. In the pretraining phase, the model learns the rules of language and knowledge of the world by reading vast amounts of internet text, accumulating extensive background knowledge to lay the foundation for subsequent learning.
Worked problems with solutions: Textbooks provide specific example problems and detail how experts solve these problems. These examples serve as demonstrations, guiding students to learn by imitation. This corresponds to the supervised finetuning phase of LLMs. In the fine-tuning phase, the model learns the "ideal answers" provided by human experts, learning how to generate high-quality responses that meet human expectations, such as the "ideal answers" in assistant applications.
Practice problems: At the end of each chapter, textbooks usually set a large number of practice problems, which often only provide the final answers without detailed solution steps. Practice problems aim to guide students to learn through trial & error. Students need to try various methods to find the correct answer. Karpathy believes this is highly similar to the concept of reinforcement learning.

Karpathy emphasizes that while we have already put LLMs through a lot of "reading" and "example learning," which corresponds to pretraining and supervised fine-tuning, we are still in an emerging and underdeveloped stage regarding the "practice problems" aspect, which is reinforcement learning.

He believes that when we create datasets for LLMs, it is essentially no different from writing textbooks for them. To truly enable LLMs to "learn," we need to provide these three types of data, just like writing a textbook: A wealth of background knowledge: Corresponds to pre-training, allowing the model to accumulate extensive knowledge.

Demonstrative example problems: Corresponds to supervised fine-tuning, enabling the model to learn high-quality outputs.

A large number of practice problems: Corresponds to reinforcement learning, allowing the model to learn through practice, continuously improving through trial and error and feedback.

Final Thoughts

Kapasi concluded that we have allowed the LLM to undergo extensive "reading" and "learning examples," but more importantly, we need to guide them through a large amount of "practical exercises." LLM needs to read, but more importantly, it needs practice. Only through extensive practical exercises can we truly enhance the LLM's capabilities, enabling them to better understand the world and solve problems.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at one's own risk