Two years ago, a pair of 22-year-old friends who met in high school in Michigan found themselves sitting inside Tsinghua University’s brain lab in Beijing, staring down a multimillion-dollar offer from Elon Musk.

The two had just done something unusual for the moment: they built a small large-language model (LLM) trained not on massive internet data dumps, but on a tiny, carefully chosen set of high-quality conversations. And they taught it to improve itself using reinforcement learning (RL), a technique where a model learns the way a person or animal does: by making decisions, receiving feedback, and then refining behavior through rewards and penalties.

At the time, almost no one was doing this with language models. The only other group exploring RL for LLMs was DeepSeek, the Chinese Op

See Full Page