Transcript for State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

This is a transcript of Lex Fridman Podcast #490 with Nathan Lambert & Sebastian Raschka. The timestamps in the transcript are clickable links that take you directly to that point in the main video. Please note that the transcript is human generated, and may have errors. Here are some useful links:

Go back to this episode’s main page
Watch the full YouTube version of the podcast

Here are the loose “chapters” in the conversation. Click link to jump approximately to that part in the transcript:

0:00 – Introduction
1:57 – China vs US: Who wins the AI race?
10:38 – ChatGPT vs Claude vs Gemini vs Grok: Who is winning?
21:38 – Best AI for coding
28:29 – Open Source vs Closed Source LLMs
40:08 – Transformers: Evolution of LLMs since 2019
48:05 – AI Scaling Laws: Are they dead or still holding?
1:04:12 – How AI is trained: Pre-training, Mid-training, and Post-training
1:37:18 – Post-training explained: Exciting new research directions in LLMs
1:58:11 – Advice for beginners on how to get into AI development & research
2:21:03 – Work culture in AI (72+ hour weeks)
2:24:49 – Silicon Valley bubble
2:28:46 – Text diffusion models and other new research directions
2:34:28 – Tool use
2:38:44 – Continual learning
2:44:06 – Long context
2:50:21 – Robotics
2:59:31 – Timeline to AGI
3:06:47 – Will AI replace programmers?
3:25:18 – Is the dream of AGI dying?
3:32:07 – How AI will make money?
3:36:29 – Big acquisitions in 2026
3:41:01 – Future of OpenAI, Anthropic, Google DeepMind, xAI, Meta
3:53:35 – Manhattan Project for AI
4:00:10 – Future of NVIDIA, GPUs, and AI compute clusters
4:08:15 – Future of human civilization

Introduction

Lex Fridman (00:00:00) The following is a conversation all about the state of the art in artificial intelligence, including some of the exciting technical breakthroughs and developments in AI that happened over the past year, and some of the interesting things we think might happen this upcoming year. At times, it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Raschka and Nathan Lambert. They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and X posters.

Lex Fridman (00:00:51) Sebastian is the author of two books I highly recommend for beginners and experts alike. First is Build a Large Language Model from Scratch, and Build a Reasoning Model from Scratch. I truly believe in the machine learning and computer science world, the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI, and author of the definitive book on reinforcement learning from human feedback. Both of them have great X accounts, great Substacks. Sebastian has courses on YouTube, Nathan has a podcast. And everyone should absolutely follow all of those. This is the Lex Fridman podcast.

Lex Fridman (00:01:40) To support it, please check out our sponsors in the description, where you can also find links to contact me, ask questions, get feedback, and so on. And now, dear friends, here’s Sebastian Raschka and Nathan Lambert.

China vs US: Who wins the AI race?

Lex Fridman (00:01:57) So I think one useful lens to look at all this through is the so-called DeepSeek moment. This happened about a year ago in January 2025, when the open weight Chinese company DeepSeek released DeepSeek R1 that I think it’s fair to say surprised everyone with near or at state-of-the-art performance, with allegedly much less compute for much cheaper. And from then to today, the AI competition has gotten insane, both on the research level and the product level. It’s just been accelerating.

Lex Fridman (00:02:32) Let’s discuss all of this today, and maybe let’s start with some spicy questions if we can. Who’s winning at the international level? Would you say it’s the set of companies in China or the set of companies in the United States? Sebastian, Nathan, it’s good to see you guys. So Sebastian, who do you think is winning?

Sebastian Raschka (00:02:53) So winning is a very broad term. I would say you mentioned the DeepSeek moment, and I do think DeepSeek is definitely winning the hearts of the people who work on open weight models because they share these as open models. Winning, I think, has multiple timescales to it. We have today, we have next year, we have in ten years. One thing I know for sure is that I don’t think nowadays, in 2026, that there will be any company having access to a technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs. They rotate. So I don’t think there will be a clear winner in terms of technology access.

Sebastian Raschka (00:03:37) However, I do think the differentiating factor will be budget and hardware constraints. I don’t think the ideas will be proprietary, but rather the resources that are needed to implement them. And so I don’t currently see a winner-takes-all scenario. I can’t see that at the moment.

Lex Fridman (00:03:59) Nathan, what do you think?

Nathan Lambert (00:04:00) You see the labs put different energy into what they’re trying to do. To demarcate the point in time when we’re recording this, the hype over Anthropic’s Claude Opus 4.5 model has been absolutely insane. I’ve used it and built stuff in the last few weeks, and it’s almost gotten to the point where it feels like a bit of a meme in terms of the hype. It’s kind of funny because this is very organic, and then if we go back a few months ago, Gemini 3 from Google got released, and it seemed like the marketing and wow factor of that release was super high. But then at the end of November, Claude Opus 4.5 was released and the hype has been growing, while Gemini 3 was before this.

Nathan Lambert (00:04:44) And it kind of feels like people don’t really talk about it as much, even though when it came out, everybody was like, this is Gemini’s moment to retake Google’s structural advantages in AI. Gemini 3 is a fantastic model, and I still use it. It’s just that differentiation is lower. I agree with what you’re saying, Sebastian, that the idea space is very fluid, but culturally Anthropic is known for betting very hard on code, and this Claude Code thing is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and the culture of organizations, where Anthropic seems to at least be presenting as the least chaotic.

Nathan Lambert (00:05:23) It’s a bit of an advantage if they can keep doing that for a while. But on the other side of things, there’s a lot of ominous technology from China where there are way more labs than DeepSeek. DeepSeek kicked off a movement within China similar to how ChatGPT kicked off a movement in the US where everything had a chatbot. There are now tons of tech companies in China that are releasing very strong frontier open weight models, to the point where I would say that DeepSeek is kind of losing its crown as the preeminent open model maker in China, and the likes of Z.ai with their GLM models, MiniMax’s models, and Kimi K2 Thinking from Moonshot, especially in the last few months, have shone more brightly.

Nathan Lambert (00:06:04) The new DeepSeek models are still very strong, but that could be looked back on as a big narrative point where in 2025 DeepSeek came and provided this platform for way more Chinese companies that are releasing these fantastic models to have this new type of operation. These models from these Chinese companies are open weight, and depending on this trajectory, the business models that these American companies are doing could be at risk. But currently, a lot of people are paying for AI software in the US, and historically in China and other parts of the world, people don’t pay a lot for software.

Lex Fridman (00:06:37) So some of these models like DeepSeek have the love of the people because they are open weight. How long do you think the Chinese companies keep releasing open weight models?

Nathan Lambert (00:06:47) I would say for a few years. I think that, like in the US, there’s not a clear business model for it. I have been writing about open models for a while, and these Chinese companies have realized it. I get inbound from some of them. They’re smart and realize the same constraints, which is that a lot of top US tech companies and other IT companies won’t pay for an API subscription to Chinese companies for security concerns. This has been a long-standing habit in tech, and the people at these companies then see open weight models as an ability to influence and take part in a huge growing AI expenditure market in the US. They’re very realistic about this, and it’s working for them.

Nathan Lambert (00:07:24) And I think the government will see that that is building a lot of influence internationally in terms of uptake of the technology, so there’s going to be a lot of incentives to keep it going. But building these models and doing the research is very expensive, so at some point, I expect consolidation. But I don’t expect that to be a story of 2026; there will be more open model builders throughout 2026 than there were in 2025. And a lot of the notable ones will be in China.

Lex Fridman (00:07:50) You were going to say something?

Sebastian Raschka (00:07:51) Yes. You mentioned DeepSeek losing its crown. I do think to some extent, yes, but we also have to consider that they are still slightly ahead. It’s not that DeepSeek got worse, it’s just like the other ones are using the ideas from DeepSeek. For example, you mentioned Kimi, same architecture, they’re training it. And then again, we have this leapfrogging where they might be at some point in time a bit better because they have the more recent model. I think this comes back to the fact that there won’t be a clear winner. One person releases something, the other one comes in, and the most recent model is probably always the best model.

Nathan Lambert (00:08:30) Yeah. We’ll also see the Chinese companies have different incentives. DeepSeek is very secretive, whereas some of these startups are like the MiniMaxes and Z.ais of the world. Those two literally have filed IPO paperwork, and they’re trying to get Western mindshare and do a lot of outreach there. So I don’t know if these incentives will change the model development, because DeepSeek famously is built by a hedge fund, Highflyer Capital, and we don’t know exactly what they use the models for or if they care about this.

Lex Fridman (00:08:59) They’re secretive in terms of communication, but they’re not secretive in terms of the technical reports that describe how their models work. They’re still open on that front. And we should also say on the Claude Opus 4.5 hype, there’s the layer of something being the darling of the X echo chamber, the Twitter echo chamber, and the actual amount of people that are using the model. I think it’s probably fair to say that ChatGPT and Gemini are focused on the broad user base that just wants to solve problems in their daily lives, and that user base is gigantic. So the hype about the coding may not be representative of the actual use.

Sebastian Raschka (00:09:38) I would say also a lot of the usage patterns are name recognition and brand, but also almost muscle memory, where ChatGPT has been around for a long time. People just got used to using it, and it’s almost like a flywheel where they recommend it to other users. One interesting point is also the customization of LLMs. For example, ChatGPT has a memory feature. So you may have a subscription and you use it for personal stuff, but I don’t know if you want to use that same thing at work because there is a boundary between private and work. If you’re working at a company, they might not allow that or you may not want that.

Sebastian Raschka (00:10:16) And I think that’s also an interesting point where you might have multiple subscriptions. One is just clean code; it has nothing of your personal images or hobby projects in there. It’s just for work. And then the other one is your personal thing. I think the future involves multiple models for different use cases. It doesn’t mean you only have to have one.

ChatGPT vs Claude vs Gemini vs Grok: Who is winning?

Lex Fridman (00:10:38) What model do you think won 2025, and what model do you think is going to win ’26?

Nathan Lambert (00:10:43) I think in the context of consumer chatbots, the question is: are you willing to bet on Gemini over ChatGPT? Which I would say in my gut feels like a bit of a risky bet because OpenAI has been the incumbent and there are so many benefits to that in tech. I think the momentum in 2025 was on Gemini’s side, but they were starting from such a low point. RIP Bard and those earlier attempts. I think huge credit to them for powering through the organizational chaos to make that happen. But also it’s hard to bet against OpenAI because they always come off as so chaotic, but they’re very good at landing things.

Nathan Lambert (00:11:26) Personally, I have very mixed reviews of GPT-5, but it must have saved them so much money with the high-line feature being a router where most users are no longer charging their GPU costs as much. So I think it’s very hard to dissociate the things that I like out of models versus the things that are actually going to be a general public differentiator.

Lex Fridman (00:11:50) What do you think about 2026? Who’s going to win?

Nathan Lambert (00:11:52) I’ll say something, even though it’s risky. I think Gemini will continue to make progress on ChatGPT. Google has the scale when both of these are operating at such extreme scales, and Google has the ability to separate research and product a bit better, whereas you hear so much about OpenAI being chaotic operationally and chasing the high-impact thing, which is a very startup culture. Then on the software and enterprise side, I think Anthropic will have continued success as they’ve again and again been set up for that. Obviously Google Cloud has a lot of offerings, but I think this Gemini name brand is important for them to build.

Nathan Lambert (00:12:28) Google Cloud will continue to do well, but that’s a more complex thing to explain in the ecosystem because that’s competing with the likes of Azure and AWS rather than on the model provider side.

Lex Fridman (00:12:40) So in infrastructure, you think TPUs give them an advantage?

Nathan Lambert (00:12:45) Largely because the margin on NVIDIA chips is insane and Google can develop everything from top to bottom to fit their stack and not have to pay this margin, and they’ve had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has a kind of historical advantage there. And if there’s going to be a new paradigm, it’s most likely to come from OpenAI. Their research division again and again has shown this ability to land a new research idea or a product. Like Deep Research, Sora, o1 thinking models—all these definitional things have come from OpenAI, and that’s got to be one of their top traits as an organization.

Nathan Lambert (00:13:28) So it’s kind of hard to bet against that, but I think a lot of this year will be about scale and optimizing what could be described as low-hanging fruit in models.

Lex Fridman (00:13:37) And clearly there’s a trade-off between intelligence and speed. This is what GPT-5 was trying to solve behind the scenes. It’s like, do people actually want intelligence, the broad public, or do they want speed?

Sebastian Raschka (00:13:52) I think it’s a nice variety actually, or the option to have a toggle there. For my personal usage, most of the time when I look something up, I use ChatGPT to ask a quick question and get the information I wanted fast. For most daily tasks, I use the quick model. Nowadays, I think the auto mode is pretty good where you don’t have to specifically say “thinking” or “non-thinking.” Then again, I also sometimes want the pro mode. Very often, when I have something written, I put it into ChatGPT and say, “Hey, do a very thorough check. Are all my references correct? Are all my thoughts correct? Did I make any formatting mistakes? Are the figure numbers wrong?” or something like that. And I don’t need that right away.

Sebastian Raschka (00:14:33) I can finish my stuff, maybe have dinner, let it run, come back and go through it. This is where I think it’s important to have this option. I would go crazy if for each query I had to wait 30 minutes, or even 10 minutes.

Nathan Lambert (00:14:46) That’s me. I’m sitting over here losing my mind that you use the router and the non-thinking model. I’m like, “How do you live with that?”

Nathan Lambert (00:14:55) That’s like my reaction. I’ve been heavily on ChatGPT for a while. I never touched GPT-5 non-thinking. I find it just… its tone and then its propensity for errors. It just has a higher likelihood of errors. Some of this is from back when OpenAI released o3, which was the first model to do this Deep Research and find many sources and integrate them for you. So I became habituated with that. I will only use GPT-5.2 thinking or pro when I’m finding any sort of information query for work, whether that’s a paper or some code reference. I will regularly have five pro queries going simultaneously, each looking for one specific paper or feedback on an equation.

Sebastian Raschka (00:15:38) I have a fun example where I just needed the answer as fast as possible for this podcast before I was going on the trip. I have a local GPU running at home and I wanted to run a long RL experiment. Usually I unplug things because if you’re not at home, you don’t want to have things plugged in, and I accidentally unplugged the GPU. My wife was already in the car and it was like, “Oh dang.” Basically, I wanted a Bash script as fast as possible that runs my different experiments and the evaluation. I know how to use the Bash terminal, but in that moment I just needed the command in 10 seconds.

Lex Fridman (00:16:18) This is a hilarious situation but yeah, so what did you use?

Sebastian Raschka (00:16:21) So I did the non-thinking fastest model. It gave me the Bash command. I wanted to chain different scripts to each other and route this to a log file with the `tee` command. Off the top of my head, I was just in a hurry; I could have thought about it myself.

Lex Fridman (00:16:37) By the way, I don’t know if there’s a representative case: wife waiting in the car, you have to run, unplug the GPU, you have to generate a Bash script. This sounds like a movie… …Mission Impossible.

Nathan Lambert (00:16:46) I use Gemini for that. I use thinking for all the information stuff and then Gemini for fast things or stuff that I could sometimes Google. It’s good at explaining things and I trust that it has this background of knowledge and it’s simple. And the Gemini app has gotten a lot better.

Nathan Lambert (00:17:01) It’s good for those sorts of things. And then for code and any sort of philosophical discussion, I use Claude Opus 4.5, also always with extended thinking. Extended thinking and inference-time scaling is just a way to make the models marginally smarter. I will always edge on that side when the progress is very high because you don’t know when that’ll unlock a new use case. And then I sometimes use Grok for real-time information or finding something on AI Twitter that I knew I saw and I need to dig up. Although when Grok 4 came out, the Grok 4 Heavy—which was their pro variant—was actually very good and I was pretty impressed with it, and then I just kind of lost track of it with muscle memory from having the ChatGPT app open. So I use many different things.

Lex Fridman (00:17:45) Yeah. I actually do use Grok 4 Heavy for debugging. For hardcore debugging that the other ones can’t solve, I find that it’s the best at. And it’s interesting because you say ChatGPT is the best interface. For me, for that same reason—but this could be just momentum— Gemini is the better interface for me. I think because I fell in love with their needle-in-the-haystack capabilities. If I ever put in something that has a lot of context but I’m looking for very specific information to make sure it tracks all of it, I find Gemini has been the best. So it’s funny with some of these models, if they win your heart over—

Lex Fridman (00:18:28) …for one particular feature on a particular day, for that particular query or prompt, you’re like, “This model’s better.” And so you’ll just stick with it for a bit until it does something really dumb. There’s like a threshold effect. Some smart thing happens and then you fall in love with it, and then it does some dumb thing and you’re like, “You know what? I’m gonna switch and try Claude or ChatGPT.” And all that kind of stuff.

Sebastian Raschka (00:18:51) This is exactly it. You use it until it breaks, until you have a problem, and then you change the LLM. I think it’s the same way we use anything, like our favorite text editor, operating system, or browser. I mean, there are so many browser options: Safari, Firefox, Chrome. They’re relatively similar, but then there are edge cases, maybe extensions you want to use, and then you switch. But I don’t think anyone types the same thing into different browsers and compares them. You only do that when the website doesn’t render or if something breaks. So that’s a good point. You use it until it breaks, and then you explore other options.

Nathan Lambert (00:19:28) On the long context thing, I was also a Gemini user for this, but the GPT-5.2 release blog had crazy long context scores, where a lot of people were like, “Did they just figure out some algorithmic change?” It went from like 30% to like 70% or something in this minor model update. So it’s also very hard to keep track of all of these things, but now I look more favorably at GPT-5.2’s long context. So it’s just kind of like a never-ending battle to actually get to testing this.

Lex Fridman (00:19:57) Well, it’s interesting that none of us talked about the Chinese models from a user perspective. What does that say? Does that mean the Chinese models are not as good, or does that mean we’re just very biased and US-focused?

Sebastian Raschka (00:20:11) I do think that’s currently the discrepancy between the model and the platform. I think the open models are more known for the open weights, not their platform yet.

Nathan Lambert (00:20:21) There are also a lot of companies that are willing to sell you open-model inference at a very low cost. I think, like OpenRouter, it’s easy to look at multi-model things. You can run DeepSeek on Perplexity. I think all of us sitting here are like, “We use OpenAI GPT-5 Pro consistently.” We’re all willing to pay for the marginal—

Nathan Lambert (00:20:39) …intelligence gain. And these models from the US are better in terms of the outputs. I think the question is, will they stay better for this year and for years going forward? But so long as they’re better, I’m going to pay for them. I think there’s also analysis that shows that the way the Chinese models are served—which you could argue is due to export controls or not—is that they use fewer GPUs per replica, which makes them slower and leads to different errors. It’s about speed and intelligence.

Nathan Lambert (00:21:09) If these things are in your favor as a user, I think in the US a lot of users will go for this. I think that is one thing that will spur these Chinese companies to want to compete in other ways, whether it’s free or substantially lower costs, or it’ll breed creativity in terms of offerings, which is good for the ecosystem. But I just think the simple thing is the US models are currently better, and we use them. I tried these other open models, and I’m like, “Fun, but I’m not gonna… I don’t go back to it.”

Best AI for coding

Lex Fridman (00:21:38) We didn’t really mention programming. That’s another use case that a lot of people deeply care about. I use basically half-and-half Cursor and Claude Code, because I find them to be fundamentally different experiences and both useful. You program quite a bit— …so what do you use? What’s the current vibe?

Sebastian Raschka (00:21:59) So, I use the Codeium plugin for VS Code. You know, it’s very convenient. It’s just a plugin, and then it’s a chat interface that has access to your repository. I know that Claude Code is a bit different. It is a bit more agentic. It touches more things; it does the whole project for you. I’m not quite there yet where I’m comfortable with that because maybe I’m a control freak, but I still like to see what’s going on. Codeium is the sweet spot for me right now where it is helping me, but it is not taking over completely.

Lex Fridman (00:22:29) I should mention, one of the reasons I do use Claude Code is to build the skill of programming with English. I mean, the experience is fundamentally different. As opposed to micromanaging the details of the generation and looking at the diff—which you can in Cursor if that’s the IDE you use—you are understanding the code deeply as you progress, versus just thinking in this design space and guiding it at a macro level. I think that is another way of thinking about the programming process. Also, Claude Code just seems to be a better utilization of Claude Opus 4.5.

Nathan Lambert (00:23:18) It’s a good side-by-side for people to do. You can have Claude Code open, you can have Cursor open, you can have VS Code open, and you can select the same models on all of them— …and ask questions, and it’s very interesting. Claude Code is way better in that domain. It’s remarkable.

Lex Fridman (00:23:32) All right, we should say that both of you are legit on multiple fronts: researchers, programmers, educators, and on the book front, too. Nathan, at some point soon, hopefully has an RLHF book coming out.

Nathan Lambert (00:23:50) It’s available for preorder, and there’s a full digital preprint. I’m just making it pretty and better organized for the physical thing, which is a lot of why I do it—it’s fun to create things that you think are excellent in physical form when so much of our life is digital.

Lex Fridman (00:24:05) I should say, going to Perplexity here, Sebastian Raschka is a machine learning researcher and author known for several influential books. A couple that I wanted to mention—and a book I highly recommend—is Build a Large Language Model From Scratch, and the new one, Build a Reasoning Model From Scratch. I’m really excited about that. Building stuff from scratch is one of the most powerful ways of learning.

Sebastian Raschka (00:24:27) Honestly, building an LLM from scratch is a lot of fun and a lot to learn. Like you said, it’s probably the best way to learn how something really works, because you can look at figures, but figures can have mistakes. You can look at conceptual explanations, but you might misunderstand them. But if there is code and the code works, you know it’s correct. There’s no misunderstanding; it’s precise. Otherwise, it wouldn’t work. I think that’s the beauty behind coding. It doesn’t lie. It’s math, basically. Even with math, you can have mistakes in a book you would never notice because you aren’t running the math while reading, so you can’t verify it. And with code, what’s nice is you can verify it.

Lex Fridman (00:25:09) Yeah, I agree with you about the Build a Large Language Model From Scratch book. It’s nice to tune out everything else, the internet and so on, and just focus on the book. But, you know, compared to history books, it’s just less lonely somehow. It’s really more fun. For example, on the programming front, I think it’s genuinely more fun to program with an LLM. And I think it’s genuinely more fun to read with an LLM. But you’re right. This distraction should be minimized. So you use the LLM to basically enrich the experience, maybe add more context. Maybe I just… the rate of ‘aha’ moments for me on a small scale is really high with LLMs.

Sebastian Raschka (00:25:54) 100%. I also want to correct myself: I’m not suggesting not to use LLMs. I suggest doing it in multiple passes. Like, one pass just offline, focus mode, and then after that… I mean, I also take notes, but I try to resist the urge to immediately look things up. I do a second pass. For me, it’s just more structured this way and I get less… I mean, sometimes things are answered in the chapter, but also it just helps to let it sink in and think about it. Other people have different preferences. I would highly recommend using LLMs when reading books. For me, it’s just not the first thing to do; it’s the second pass.

Lex Fridman (00:26:29) By way of recommendation, I do the opposite. I like to use the LLM at the beginning— …to lay out the full context of what is this world that I’m now stepping into. But I try to avoid clicking out of the LLM into the world of Twitter and blogs because then you’re down this rabbit hole. You’re reading somebody’s opinion, there’s a flame war about a particular topic, and all of a sudden you’re now in the realm of the internet and Reddit and so on. But if you’re purely letting the LLM give you the context of why this matters, what are the big picture ideas… sometimes books themselves are good at doing that, but not always.

Nathan Lambert (00:27:12) This is why I like the ChatGPT app, because it gives the AI a home in your computer where you can focus on it, rather than just being another tab in my mess of internet options. And I think Claude Code in particular does a good job of making that a joy, where it seems very engaging as a product design to be an interface that your AI will then go out into the world. There’s something very intangible between it and Codex; it just feels warm and engaging, whereas Codex from OpenAI can often be as good but it just feels a little bit rough around the edges.

Nathan Lambert (00:27:45) Whereas Claude Code makes it fun to build things, particularly from scratch where you trust that it’ll make something. Obviously this is good for websites and refreshing tooling, which I use it for, or data analysis. On my blog, we scrape Hugging Face so we keep the download numbers for every dataset and model over time now. Claude was just like, “Yeah, I’ve made use of that data, no problem.” And I was like, “That would’ve taken me days.” And then I have enough situational awareness to be like, “Okay, these trends obviously make sense,” and you can check things. But that’s just a wonderful interface where you can have an intermediary and not have to do the awful low-level work that you would have to do to maintain different web projects.

Open Source vs Closed Source LLMs

Lex Fridman (00:28:29) All right. So we just talked about a bunch of the closed-weight models. Let’s talk about the open ones. Tell me about the landscape of open LLM models. Which are interesting ones? Which stand out to you and why? We already mentioned DeepSeek.

Nathan Lambert (00:28:44) Do you wanna see how many we can name off the top of our head?

Lex Fridman (00:28:47) Yeah, yeah. Without looking at notes.

Nathan Lambert (00:28:48) DeepSeek, Kimi, MiniMax, Z.ai, Antlang. We’re just going Chinese.

Sebastian Raschka (00:28:57) Let’s throw in Mistral AI, Gemma— …gpt-oss, the open source model by OpenAI. Actually, NVIDIA had a really cool one, Nemotron 3. There’s a lot of stuff, especially at the end of the year. Qwen might be the one—

Nathan Lambert (00:29:12) Oh, yeah. Qwen was the obvious name I was gonna say. I was trying to get through… you can get at least 10 Chinese and at least 10 Western. I mean, OpenAI released their first open model—

Sebastian Raschka (00:29:21) A long time ago.

Nathan Lambert (00:29:22) …since GPT-2. When I was writing about OpenAI’s open model release, people were like, “Don’t forget about GPT-2,” which I thought was really funny because it’s just such a different time. But gpt-oss is actually a very strong model and does some things that the other models don’t do very well. Selfishly, I’ll promote a bunch of Western companies; both in the US and Europe have these fully open models. I work at the Allen Institute for AI where we’ve been building OLMo, which releases data and code and all of this. And now we have actual competition for people that are trying to release everything so that other people can train these models.

Nathan Lambert (00:29:57) So there’s the Institute for Foundation Models/LM360, which has had their K2 models of various types. Apertus is a Swiss research consortium. Hugging Face has SmolLM, which is very popular. And NVIDIA’s Nemotron has started releasing data as well. And then Stanford’s Marine Community Project, which is kind of making it so there’s a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modeling stack. So this space, that list was way smaller in 2024-

Nathan Lambert (00:30:31) … so I think it was just AI2. So that’s a great thing for more people to get involved and to understand language models, which doesn’t really have a Chinese company that is an analog. While I’m talking, I’ll say that the Chinese open language models tend to be much bigger and that gives them this higher peak performance as MoEs, whereas a lot of these things that we like a lot, whether it was Gemma or Nemotron, have tended to be smaller models from the US, which is starting to change. Mistral Large 3 came out, which was a giant MoE model, very similar to DeepSeek architecture in December. And then a startup, Reka AI, and both Nemotron have… Nemotron and NVIDIA have teased MoE models way bigger than 100 billion parameters-

Nathan Lambert (00:31:16) … in the 400 billion parameter range coming in this Q1 2026 timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus US open models for, which I’m personally going to be very excited to watch.

Lex Fridman (00:31:32) First of all, huge props for being able to name so many of these. Did you actually name LLaMA?

Nathan Lambert (00:31:38) No.

Lex Fridman (00:31:39) I feel like …

Nathan Lambert (00:31:40) RIP.

Sebastian Raschka (00:31:41) This was not on purpose.

Lex Fridman (00:31:43) RIP LLaMA. All right. Can you mention what are some interesting models that stand out? You mentioned Qwen 3 is obviously a standout.

Sebastian Raschka (00:31:51) So I would say the year’s almost book-ended by DeepSeek-V3 and DeepSeek R1. And then on the other hand, in December, DeepSeek-V3.2. Because what I like about those is they always have an interesting architecture tweak- … that others don’t have. But otherwise, if you want to go with the familiar but really good performance, Qwen 3 and, like Nathan said, also gpt-oss. And I think with gpt-oss, what’s interesting about it is it’s kind of the first open-weight model that was really trained with tool use in mind, which I do think is a bit of a paradigm shift where the ecosystem was not quite ready for it. So with tool use, I mean that the LLM is able to do a web search or call a Python interpreter.

Sebastian Raschka (00:32:33) And I do think it’s a standout because it’s a huge unlock. One of the most common complaints about LLMs is, for example, hallucinations, right? And so, in my opinion, one of the best ways to solve hallucinations is to not try to always remember information or make things up. For math, why not use a calculator app or Python?

Sebastian Raschka (00:32:54) If I ask the LLM, “Who won the soccer World Cup in 1998?” instead of just trying to memorize, it could go do a search. I think mostly it’s usually still a Google search. So ChatGPT and gpt-oss, they would do a tool call to Google, maybe find the FIFA website, and find that it was France. It would get you that information reliably instead of just trying to memorize it. So I think it’s a huge unlock which right now is not fully utilized yet by the open-weight ecosystem. A lot of people don’t use tool call modes because I think it’s a trust thing. You don’t want to run this on your computer where it has access to tools and could wipe your hard drive, so you want to containerize that. But I do think that is a really important step for the upcoming years to have this ability.

Lex Fridman (00:33:44) So a few quick things. First of all, thank you for defining what you mean by tool use. I think that’s a great thing to do in general for the concepts we’re talking about, even things as sort of well-established as MOEs. You have to say that means mixture of experts, and you kind of have to build up an intuition for people about what that means, how it’s actually utilized, what are the different flavors. So what does it mean that there’s just such an explosion of open models? What’s your intuition?

Nathan Lambert (00:34:13) If you’re releasing an open model, you want people to use it, is the first and foremost thing. And then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models, and I think a lot of people will not. If you look outside of the US, a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think there can also be data that you don’t want to send to the cloud. So the number one thing is getting people to use models, use AI, or use your AI that might not be able to do it without having access to the model.

Lex Fridman (00:34:46) I guess we should state explicitly, so we’ve been talking about these Chinese models and open weight models. Oftentimes, the way they’re run is locally. So it’s not like you’re sending your data to China or to whoever developed the model in Silicon Valley.

Nathan Lambert (00:35:04) A lot of American startups make money by hosting these models from China and selling them. It’s called selling tokens, which means somebody will call the model to do some piece of work. I think the other reason is for US companies like OpenAI. OpenAI is so GPU deprived; they’re at the limits of the GPUs. Whenever they make a release, they’re always talking about how their GPUs are hurting. And I think in one of these gpt-oss-120b release sessions, Sam Altman said, “Oh, we’re releasing this because we can use your GPUs. We don’t have to use our GPUs and OpenAI can still get distribution out of this,” which is another very real thing, because it doesn’t cost them anything.

Sebastian Raschka (00:35:43) And for the user, I think also, I mean, there are users who just use the model locally how they would use ChatGPT. But also for companies, I think it’s a huge unlock to have these models because you can customize them, you can train them, you can add more data post-training, like specialize them into, let’s say, law, medical models, whatever you have. And you mentioned Llama; the appeal of the open weight models from China is that the licenses are even friendlier. I think they are just unrestricted open source licenses, whereas if we use something like Llama or Gemma, there are some strings attached. I think it’s like an upper limit in terms of how many users you have.

Sebastian Raschka (00:36:21) And then if you exceed so many million users, you have to report your financial situation to, let’s say, Meta or something like that. And I think while it is a free model, there are strings attached, and people like things where strings are not attached. So I think that’s also one of the reasons besides performance why the open weight models from China are so popular, because you can just use them. There’s no catch in that sense.

Nathan Lambert (00:36:46) The ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulled up Perplexity and said, “Kimi K2 Thinking hosted in the US.” Which is an exact example of what we’re talking about where people are sensitive to this. Kimi K2 Thinking is a model that is very popular. People say that has very good creative writing and also in doing some software things. So it’s just these little quirks that people pick up on with different models that they like.

Lex Fridman (00:37:14) What are some interesting ideas that some of these models have explored that you can speak to, like that are particularly interesting to you?

Sebastian Raschka (00:37:21) Maybe we can go chronologically. I mean, there was, of course, DeepSeek R1 that came out in January of 2025. However, this was based on DeepSeek-V3, which came out the year before in December 2024. There are multiple things on the architecture side. What is fascinating is you can still—I mean, that’s what I do with my from-scratch coding projects—you can still start with GPT-2, and you can add things to that model to make it into this other model. So it’s all still kind of like the same lineage. There is a very close relationship between those. But top of my head, DeepSeek, what was unique there is the Mixture of Experts. I mean, they were not inventing Mixture of Experts.

Sebastian Raschka (00:38:00) We can maybe talk a bit more about what Mixture of Experts means. But just to list these things first before we dive into detail: Mixture of Experts, but then they also had multi-head latent attention, which is a tweak to the attention mechanism. This was, I would say in 2025, the main distinguishing factor between these open weight models: different tweaks to make inference or KV cache size more economical. We can also define KV cache in a few moments. But it makes it more economical to have long context, to shrink the KV cache size. So what are tweaks that we can do? Most of them focused on the attention mechanism. There is multi-head latent attention in DeepSeek; there is group query attention, which is still very popular.

Sebastian Raschka (00:38:44) It’s not invented by any of those models; it goes back a few years. But that would be the other option. Sliding window attention, I think OLMo 3 uses it if I remember correctly. So there are these different tweaks that make the models different. Otherwise, I put them all together in an article once where I just compared them; they are surprisingly similar. It’s just different numbers in terms of how many repetitions of the transformer block you have in the center and just little knobs that people tune. But what’s so nice about it is it works no matter what. You can tweak things, you can move the normalization layers around to get some performance gains.

Sebastian Raschka (00:39:23) And OLMo is always very good in ablation studies, showing what it actually does to the model if you move something around. Ablation studies: does it make it better or worse? But there are so many ways you can implement a transformer and make it still work. The big ideas that are still prevalent are Mixture of Experts, multi-head latent attention, sliding window attention, and group query attention. And then at the end of the year, we saw a focus on making the attention mechanism scale linearly with inference token prediction. So there was Qwen3-neXt, for example, which added a gated delta net. It’s inspired by state space models, where you have a fixed state that you keep updating. But it makes essentially this attention cheaper, or it replaces attention with a cheaper operation.

Transformers: Evolution of LLMs since 2019

Lex Fridman (00:40:08) And it may be useful to step back and talk about transformer architecture in general.

Sebastian Raschka (00:40:13) Yeah, so maybe we should start with GPT-2 architecture, the transformer that was derived from the “Attention Is All You Need” paper.

Sebastian Raschka (00:40:21) So the “Attention Is All You Need” paper had a transformer architecture that had two parts: an encoder and a decoder. And GPT went with just focusing in on the decoder part. It is essentially still a neural network and it has this attention mechanism inside. And you predict one token at a time. You pass it through an embedding layer. There’s the transformer block. The transformer block has attention modules and a fully connected layer. And there are some normalization layers in between. But it’s essentially neural network layers with this attention mechanism. So coming from GPT-2 when we move on to gpt-oss-120b, there is, for example, the Mixture of Experts layer. It’s not invented by GPT-OSS; it’s a few years old.

Sebastian Raschka (00:41:04) But it is essentially a tweak to make the model larger without consuming more compute in each forward pass. So there is this fully connected layer, and if listeners are familiar with multi-layer perceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer. And it’s very expensive because it’s fully connected. If you have a thousand inputs and a thousand outputs, that’s like a million connections. And it’s a very expensive part in this transformer. And the idea is to kind of expand that into multiple feedforward networks. So instead of having one, let’s say you have 256, but you don’t use all of them at the same time.

Sebastian Raschka (00:41:49) So you now have a router that says, “Okay, based on this input token, it would be useful to use this fully connected network.” And in that context, it’s called an expert. So a Mixture of Experts means you have multiple experts. And depending on what your input is—let’s say it’s more math-heavy—it would use different experts compared to, let’s say, translating input text from English to Spanish. It would maybe consult different experts. It’s not as clear-cut to say, “Okay, this is only an expert for math and this for Spanish.” It’s a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time.

Sebastian Raschka (00:42:27) That would be very wasteful. So yeah, kind of like during the token generation, you are more selective. There’s a router that selects which tokens should go to which expert. It adds more complexity. It’s harder to train. There’s a lot that can go wrong, like collapse and everything. So I think that’s why OLMo 3 still uses dense… I mean, you have, I think, OLMo models with Mixture of Experts, but dense models, where dense means… So also, it’s jargon. There’s a distinction between dense and sparse. So Mixture of Experts is considered sparse because we have a lot of experts, but only a few of them are active. And then dense would be the opposite, where you only have, like, one fully connected module, and it’s always utilized.

Lex Fridman (00:43:08) So maybe this is a good place to also talk about KV cache. But actually, before that, even zooming out, fundamentally, how many new ideas have been implemented from GPT-2 to today? Like, how different really are these architectures?

Sebastian Raschka (00:43:25) Picture like the Mixture of Experts. The attention mechanism in gpt-oss-120b, that would be the Group Query Attention mechanism. So it’s a slight tweak from multi-head attention to Group Query Attention, so that we have two. I think they replaced LayerNorm by RMSNorm, but it’s just like a different normalization there and not a big change. It’s just like a tweak. The nonlinear activation function—for people familiar with deep neural networks, I mean, it’s the same as changing sigmoid with ReLU. It’s not changing the network fundamentally. It’s just like a tweak. And that’s about it, I would say. It’s not really fundamentally that different. It’s still the same architecture. So you can convert one from one… You can go from one into the other by just adding these changes, basically.

Lex Fridman (00:44:09) It fundamentally is still the same architecture.

Sebastian Raschka (00:44:12) Mm-hmm. Yep. So for example, you mentioned my book earlier. That’s a GPT-2 model in the book because it’s simple and it’s very small, so 124 million parameters approximately. But in the bonus materials, I do have OLMo from scratch, Gemini 3 from scratch, and other types of from-scratch models. And I always start with my GPT-2 model and just, you know, add different components and you get from one to the other. It’s kind of like a lineage in a sense. Yeah.

Lex Fridman (00:44:37) Can you build up an intuition for people? Because sort of when you zoom out and look at it, there’s so much rapid advancement in the AI world, and at the same time, fundamentally the architectures have not changed. So where is all the turbulence, the turmoil of the advancement happening? Where are the gains to be had?

Sebastian Raschka (00:45:01) So there are the different stages where you develop or train the network. You have pre-training. Now back in the day, it was just pre-training with GPT-2. Now you have pre-training, mid-training, and post-training. So I think right now we are in the post-training focus stage. I mean, pre-training still gives you advantages if you scale it up to better, higher quality data. But then we have capability unlocks that were not there with GPT-2, for example. ChatGPT is basically a GPT-3 model. And GPT-3 is the same as GPT-2 in terms of architecture. What was new was adding the supervised fine-tuning and the Reinforcement Learning with Human Feedback. So, it’s more on the algorithmic side rather than the architecture.

Nathan Lambert (00:45:44) I would say that the systems also change a lot. I think if you listen to NVIDIA’s announcements, they talk about things like, “You now do FP8, you can now do FP4.” And what is happening is these labs are figuring out how to utilize more compute to put into one model, which lets them train faster and lets them put more data in. And then you can find better configurations faster by doing this. So you can look at the tokens per second per GPU as a metric that you look at when you’re doing large-scale training. And you can go from, like, 10K to 13K by turning on FP8 training, which means you’re using less memory per parameter in the model. And by saving less information, you do less communication and you can train faster.

Nathan Lambert (00:46:24) So all of these system things underpin way faster experimentation on data and algorithms. It’s this kind of loop that keeps going where it’s kinda hard to describe when you look at the architecture and they’re exactly the same. But the code base used to train these models is gonna be vastly different— …and you could probably… the GPUs are different, but you probably train gpt-oss-20b way faster in wall clock time than GPT-2— …was trained at the time.

Sebastian Raschka (00:46:54) Yeah. Like you said, they had, for example, in the Mixture of Experts, this NVIDIA FP4 optimization where you get more throughput. But I do think for the speed, this is true, but it doesn’t give the model new capabilities in a sense. It’s just: how much can we make the computation coarser without suffering in terms of model performance degradation? But I do think there are alternatives popping up to the transformer. There are text diffusion models, a completely different paradigm. And although text diffusion models might use transformer architectures, it’s not an autoregressive transformer. And also Mamba models; it’s a State Space Model.

Sebastian Raschka (00:47:34) But they do have trade-offs, and what’s true is there’s nothing that has replaced the autoregressive transformer as the state-of-the-art model. So, for state-of-the-art, you would still go with that thing, but there are now alternatives for the cheaper end—alternatives that are kind of making compromises, but it’s not just one architecture anymore. There are little ones coming up. But if we talk about the state-of-the-art, it’s pretty much still the transformer architecture, autoregressive, derived from GPT-2 essentially.

AI Scaling Laws: Are they dead or still holding?

Lex Fridman (00:48:06) I guess the big question here is—we talked quite a bit here on the architecture behind the pre-training—are the scaling laws holding strong across pre-training, post-training, inference, context size, data, and synthetic data?

Nathan Lambert (00:48:20) I’d like to start with the technical definition of a scaling law-

Nathan Lambert (00:48:23) …which kind of informs all of this. The scaling law is the power law relationship between… You can think of the x-axis—what you are scaling—as a combination of compute and data, which are kind of similar, and then the y-axis is like the held-out prediction accuracy over our next tokens. We talked about models being autoregressive. It’s like if you keep a set of text that the model has not seen, how accurate will it get when you train? And the idea of scaling laws came when people figured out that that was a very predictable relationship. I think that technical term is continuing, and then the question is, what do users get out of it? And then there are more types of scaling, where OpenAI’s o1 was famous for introducing inference-time scaling.

Nathan Lambert (00:49:07) And I think less famously for also showing that you can scale reinforcement learning training and get kind of this log x-axis and then a linear increase in performance on the y-axis. So there are kind of these three axes now where the traditional scaling laws are talked about for pre-training—which is how big your model is and how big your dataset is—and then scaling reinforcement learning, which is like how long can you do this trial and error learning that we’ll talk about. We’ll define more of this, and then this inference-time compute, which is just letting the model generate more tokens on a specific problem.

Nathan Lambert (00:49:37) So I’m kind of bullish; they’re all really still working, but the low-hanging fruit has mostly been taken, especially in the last year on Reinforcement Learning with Verifiable Rewards, which is this RLVR, and then inference-time scaling. That’s why these models feel so different to use, where previously you would get that first token immediately. And now they’ll go off for seconds, minutes, or even hours generating these hidden thoughts before giving you the first word of your answer. And that’s all about this inference-time scaling, which is such a wonderful kind of step function in terms of how the models change abilities. They enabled this tool use stuff and enabled this much better software engineering that we were talking about.

Nathan Lambert (00:50:17) And this is, when we say enabled, almost entirely downstream of the fact that this Reinforcement Learning with Verifiable Rewards training just let the models pick up these skills very easily. So if you look at the reasoning process when the models are generating a lot of tokens, what it’ll often be doing is: it tries a tool, it looks at what it gets back, it tries another API, it sees what it gets back and if it solves the problem. The models, when you’re training them, very quickly learn to do this.

Nathan Lambert (00:50:46) And then at the end of the day, that gives this kind of general foundation where the model can use CLI commands very nicely in your repo, handle Git for you, move things around, organize things, or search to find more information—which, if we were sitting in these chairs a year ago, is something that we didn’t really think of the models doing. So this is just something that has happened this year and has totally transformed how we think of using AI, which I think is very magical. It’s such an interesting evolution and unlocks so much value. But it’s not clear what the next avenue will be in terms of unlocking stuff like this.

Nathan Lambert (00:51:23) I think that there’s—we’ll get to continual learning later, but there’s a lot of buzz around certain areas of AI, but no one knows when the next step function will really come.

Lex Fridman (00:51:31) So you’ve actually said quite a lot of things there, and said profound things quickly. It would be nice to unpack them a little bit. You say you’re bullish basically on every version of scaling. So can we just start at the beginning? Pre-training: are we implying that the low-hanging fruit on pre-training scaling has been picked? Has pre-training hit a plateau, or are you still bullish on even pre-training?

Nathan Lambert (00:52:01) Pre-training has gotten extremely expensive. I think to scale up pre-training, it’s also implying that you’re going to serve a very large model to the users. So I think that it’s been loosely established the likes of GPT-4 and similar models were around one trillion parameters at the biggest size. There’s a lot of rumors that they’ve actually gotten smaller as training has gotten more efficient. You want to make the model smaller because then your costs of serving go down proportionately. The cost of training these models is really low relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates.

Nathan Lambert (00:52:40) In the OLMo 3 paper, section 2.4, we just detailed how long we had the GPU clusters sitting around for training—which includes engineering issues, multiple seeds—and it was about two million dollars to rent the cluster to deal with all the problems and headaches of training a model. So these models are… a lot of people could get one to 10 million dollars to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. A thousand GPU rental you can pay 100 grand a day for. And these companies could have millions of GPUs. You can look at how much these things cost to sit around.

Nathan Lambert (00:53:19) So that’s kind of a big thing, and then it’s like, if scaling is actually giving you a better model, is it going to be financially worth it? And I think we’ll slowly push it out as AI solves more compelling tasks—like the likes of Claude Opus 4.5 making Claude Code just work for things. I launched this project called the ATOM project, which is American Truly Open Models, in July, and that was like a true vibe-coded website. I have a job to make plots and stuff. Then I came back to refresh it in the last few weeks and Claude Opus 4.5, versus whatever model was available at the time, just crushed all the issues that it had from building in June and July. It might be a bigger model. There’s a lot of things that go into this, but there’s still progress coming.

Lex Fridman (00:54:04) So what you’re speaking to is the nuance of the y-axis of the scaling laws—that the way it’s experienced versus on a benchmark, the actual intelligence might be different. But still, your intuition about pre-training: if you scale the size of compute, will the models get better? Not whether it’s financially viable, but just from the law aspect of it, do you think the models will get smarter?

Nathan Lambert (00:54:28) Yeah. And I think that there’s… And this sometimes comes off as almost disillusioned from leadership at AI companies saying this, but they’re like, “It’s held for 13 orders of magnitude of compute; why would it ever end?” So I think fundamentally it is pretty unlikely to stop. It’s just like eventually we’re not even going to be able to test the bigger scales because of all the problems that come with more compute. I think that there’s a lot of talk on how 2026 is a year when very large NVIDIA Blackwell compute clusters—like gigawatt-scale facilities—are coming online. And these were all contracts for power and data centers that were signed and sought out in ’22 and 2023, before or right after ChatGPT.

Nathan Lambert (00:55:13) So it took this two-to-three-year lead time to build these bigger clusters to train the models, while there’s obviously immense interest in building even more data centers than that. So that is kind of the crux that people are saying: these new clusters are coming. The labs are going to have more compute for training. They’re going to utilize this, but it’s not a given. I’ve seen so much progress that I expect it, and I expect a little bit bigger models. I would say it’s more like we’ll see a $2,000 subscription this year; we’ve already seen $200 subscriptions. It’s like that could 10x again, and these are the kind of things that could come—and they’re all downstream of a bigger model that offers just a little bit more of a cutting edge.

Lex Fridman (00:55:53) So, it’s reported that xAI is going to hit that one-gigawatt scale early ’26, and a full two gigawatts by year end. How do you think they’ll utilize that in the context of scaling laws? Is a lot of that inference? Is a lot of that training?

Nathan Lambert (00:56:12) It ends up being all of the above. I think that all of your decisions when you’re training a model come back to pre-training. So if you’re going to scale RL on a model, you still need to decide on your architecture that enables this. We were talking about other architectures and using different types of attention. We’re also talking about Mixture of Experts models. The sparse nature of MoE models makes it much more efficient to do generation, which becomes a big part of post-training, and you need to have your architecture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model better, you still want to go and revisit this.

Nathan Lambert (00:56:53) You still want the best base model that you can. And in a few years that’ll saturate and the RL compute will just go longer.

Lex Fridman (00:57:00) Are there people who disagree with you that say basically pre-training is dead? That it’s all about scaling inference, scaling post-training, scaling context, continual learning, and scaling synthetic data?

Nathan Lambert (00:57:15) People vibe that way and describe it in that way, but I think it’s not the practice that is happening.

Lex Fridman (00:57:19) It’s just the general vibe of people saying this thing is dead—

Nathan Lambert (00:57:21) The excitement is elsewhere. So the low-hanging fruit— …in RL is elsewhere. For example, we released our model in November. Every company has deadlines. Our deadline was like November 20th, and for that, our run was five days, which compared to 2024 is a very long time to just be doing post-training on a model of about 30 billion parameters. It’s not a big model. And then in December, we had another release, which was just letting the RL run for another three and a half weeks, and the model got notably better, so we released it. And that’s a big amount of time to just allocate to something that is going to be your peak— …for the year. So it’s like—

Lex Fridman (00:57:57) The reasoning is—

Nathan Lambert (00:57:58) There’s these types of decisions that happen when they’re training a model where they just can’t leave it forever. You have to keep pulling in the improvements you have from your researchers. So you redo pre-training, you’ll do this post-training for a month, but then you need to give it to your users. You need to do safety testing. I think there’s a lot in place that reinforces this cycle of just keep updating the models. There’s things to improve. You get a new compute cluster that lets you do something maybe more stably or faster. You hear a lot about Blackwell having rollout issues, where at AI2 most of the models we’re pre-training are on like 1,000 to 2,000 GPUs.

Nathan Lambert (00:58:36) But when you’re pre-training on 10,000 or 100,000 GPUs, you hit very different failures. GPUs are known to break in weird ways, and doing a 100,000 GPU run is like… you’re pretty much guaranteed to always have at least one GPU that is down. And you need to have your training code handle that redundancy, which is just a very different problem. Whereas what we’re doing like, “Oh, I’m playing with post-training on DJI Spark,” or people learning ML, what they’re battling to train these biggest models is just like— …mass distributed scale, and it’s very different. But that’s somewhat different than… that’s a systems problem—

Nathan Lambert (00:59:11) …in order to enable the scaling laws, especially at pre-training. You need all of these GPUs at once. When we shift to reinforcement learning, it actually lends itself to heterogeneous compute because you have many copies of the model. To do a primer for language model reinforcement learning, what you’re doing is you have two sets of GPUs. One you can call the actor and one you call the learner. The learner is where your actual reinforcement learning updates happen. These are traditionally policy gradient algorithms. Proximal Policy Optimization, PPO, and Group Relative Policy Optimization, GRPO, are the two popular classes.

Nathan Lambert (00:59:50) On the other side, you’re going to have actors which are generating completions, and these completions are the things that you’re going to grade. Reinforcement learning is all about optimizing reward. In practice, you can have a lot of different actors in different parts of the world doing different types of problems, and then you send it back to this highly networked compute cluster to do this actual learning, where you take the gradients and you need to have a tightly meshed network where you can do different types of parallelism and spread out your model for efficient training. Every different type of training and serving has these considerations you need to scale.

Nathan Lambert (01:00:27) We talked about pre-training, we talked about RL, and then inference time scaling is: how do you serve a model that’s thinking for an hour to 100 million users? I don’t really know about that, but I know that’s a hard problem. In order to give people this intelligence, there’s all these systems problems, and we need more compute and you need more stable compute to do it.

Lex Fridman (01:00:46) But you’re bullish on all of these kinds of scaling is what I’m hearing. On the inference, on the reasoning, even on the pre-training?

Sebastian Raschka (01:00:54) Yeah, so that’s a big can of worms, but there are basically two knobs: training and inference scaling, where you can get gains. In a world where we had infinite compute resources, you’d want to do all of them. You have training, you have inference scaling, and training is like a hierarchy: pre-training, mid-training, and post-training. Changing the model size, more training data, training a bigger model—it gives you more knowledge. Then the model is a better base model, or what we still call a foundation model, and it unlocks capabilities. But you don’t necessarily have the model be able to solve your most complex tasks—

Sebastian Raschka (01:01:34) …tasks during pre-training or after pre-training. You still have these other unlock phases, mid-training or post-training with RL, that unlocks capabilities that the model has in terms of knowledge from the pre-training. And I think, sure, if you do more pre-training, you get a better base model that you can unlock later. But like Nathan said, it just becomes too expensive. We don’t have infinite compute, so you have to decide: do I want to spend that compute more on making the model larger? It’s a trade-off. In an ideal world, you want to do all of them. And I think in that sense, scaling is still pretty much alive.

Sebastian Raschka (01:02:08) You would still get a better model, but like we saw with Claude 4.5, it’s just not worth it. I mean, because you can unlock more performance with other techniques at that moment, especially if you look at inference scaling. That’s one of the biggest gains this year with o1, where it took a smaller model further than pre-training a larger model like Claude 4.5. So, I wouldn’t say pre-training scaling is dead; it’s just that there are other more attractive ways to scale right now. But at some point, you will still want to make some progress on the pre-training. The thing to consider is where you want to spend your money.

Sebastian Raschka (01:02:47) If you spend it more on pre-training, it’s a fixed cost. You train the model, and then it has this capability forever. You can always use it. With inference scaling, you don’t spend money during training; you spend money later per query, and then it’s about the math. How long is my model going to be on the market if I replace it in half a year? Maybe it’s not worth spending 5 million, 10 million, or 100 million dollars on training it longer. Maybe I will just do more inference scaling and get the performance from there. It maybe costs me 2 million in terms of user queries. It becomes a question of how many users you have and doing the math. I think that’s also where it’s interesting, where ChatGPT is in a position.

Sebastian Raschka (01:03:27) I think they have a lot of users where they need to go a bit cheaper, where they have that GPT-5 model that is a bit smaller. For other companies, their customers have other trade-offs. For example, there were the math problems or the Math Olympiad where they had a proprietary model, and I’m pretty sure it’s just a model that has been fine-tuned a little bit more, but most of it was inference scaling to achieve peak performance in certain tasks where you don’t need that all the time. But yeah, long story short, I do think pre-training, mid-training, post-training, and inference scaling are all still things you want to do. At the moment, this year, it’s finding the right ratio that gives you the best bang for the buck, basically.

How AI is trained: Pre-training, Mid-training, and Post-training

Lex Fridman (01:04:13) I think this might be a good place to define pre-training, mid-training, and post-training.

Sebastian Raschka (01:04:18) So, pre-training is the classic training one next token prediction at a time. You have a big corpus of data. Nathan probably also has very interesting insights there because of OLMo 3. A big portion of the paper focuses on the right data mix. So, pre-training is essentially just training across entropy loss, training on next token prediction on a vast corpus of internet data, books, papers and so forth. It has changed a little bit over the years in the sense people used to throw in everything they can. Now, it’s not just raw data. It’s also synthetic data where people rephrase certain things. So synthetic data doesn’t necessarily mean purely AI-made-up data.

Sebastian Raschka (01:04:58) It’s also taking something from a Wikipedia article and then rephrasing it as a Q&A question or summarizing it, rewarding it, and making better data that way. I think of it like with humans. If someone reads a book compared to a messy—no offense, but like—Reddit post or something like that. I do think you learn—no offense, but I think—

Lex Fridman (01:05:25) There’s going to be a post about this, Sebastian.

Nathan Lambert (01:05:28) Some Reddit data is very coveted and excellent for training. You just have to filter it.

Sebastian Raschka (01:05:33) And I think that’s the idea. I think it’s like if someone took that and rephrases it in a, let’s say, more concise and structured way— I think it’s higher quality data that gets the LLM maybe the same—you get the same LLM out of it at the end, but it gets there faster. It trains faster because if the grammar and the punctuation are correct, it already learns the correct way versus getting information from a messy way and then learning later how to correct that. So, I think that is how pre-training evolved and why scaling still works; it’s not just about the amount of data, it’s also the tricks to make that data better for you. And then mid-training is… I mean, it used to be called pre-training.

Sebastian Raschka (01:06:21) I think it’s called mid-training because it was awkward to have pre-training and post-training but nothing in the middle, right? It sounds a bit weird. You have pre-training and post-training, but what’s the actual training? So, the mid-training is usually similar to pre-training, but it’s a bit more specialized. It’s the same algorithm, but what you do is you focus, for example, on long context documents. The reason you don’t do that during pre-training is because you don’t have that many long context documents. We have a specific phase. And one problem of LLMs is still that it’s a neural network; it has the problem of catastrophic forgetting.

Sebastian Raschka (01:06:56) So, you teach it something, it forgets other things. It’s not 100% forgetting, but there’s no free lunch. It’s also the same with humans. If you ask me some math I learned 10 years ago, I wouldn’t know; I would have to look at it again.

Lex Fridman (01:07:09) Nathan was actually saying that he’s consuming so much content that there’s a catastrophic forgetting issue.

Nathan Lambert (01:07:14) Yeah, I’m trying to learn so much about AI, and it’s like when I was learning about pre-training parallelism, I’m like, “I lost something and I don’t know what it was.”

Sebastian Raschka (01:07:22) I don’t want to anthropomorphize LLMs, but I think it’s the same in terms of how humans learn. Quantity is not always better because it’s about being selective. Mid-training is being selective in terms of quality content at the end, so the last thing the LLM has seen is the quality stuff. And then post-training is all the fine-tuning: supervised fine-tuning, DPO, RLVR with human feedback and so forth. So, the refinement stages. And it’s also interesting, the cost thing, right? Pre-training, you spend a lot of money on that right now. RL a bit less. RL, you don’t really teach it knowledge; it’s more like unlocking the knowledge.

Sebastian Raschka (01:08:03) It’s more like skill learning, like how to solve problems with the knowledge that it has from pre-training. There are actually three papers this year, or last year, 2025, on RL for pre-training. But I don’t think anyone does that in production.

Nathan Lambert (01:08:17) Toy, toy examples for now.

Sebastian Raschka (01:08:18) Toy examples, right. Но to generalize, RL post-training is more like the skill unlock, where pre-training is like soaking up the knowledge essentially.

Nathan Lambert (01:08:26) A few things that could be helpful for people. A lot of people think of synthetic data as being bad for training models. You mentioned that DeepSeek got an OCR—Optical Character Recognition—paper. A lot of labs did; AI2 had one, others had multiple. And the reason each of these labs has these is because there’s vast amounts of PDFs and other digital documents on the web in formats that aren’t encoded with text easily. So you use these, like DeepSeek OCR or what we called OLMo OCR, to extract what can be trillions of tokens of candidate data. Pre-training dataset size is on the order of trillions; it’s measured in trillions of tokens.

Nathan Lambert (01:09:10) Smaller models from researchers can be something like 5 to 10 trillion. Qwen is documented going up to like 50 trillion, and there’s rumors that these closed labs can go to 100 trillion tokens. Getting this potential data is a very big funnel, and the data you actually train the model on is a small percentage of this. This character recognition data would be described as synthetic data for pre-training in a lab. And then there’s also the fact that ChatGPT now gives wonderful answers, and you can train on those best answers; that’s synthetic data. It’s very different than the early ChatGPT hallucinations data.

Sebastian Raschka (01:09:48) One interesting question is, if I recall correctly, OLMo 3 was trained with less data than specifically some other open-weight models, maybe even OLMo 2. But you still got better performance, and that might be one of the examples of how the data helped.

Nathan Lambert (01:10:01) It’s mostly down to data quality. I think if we had more compute, we would train for longer. I think we’d ultimately see that as something we would want to do. Especially with big models, you need more compute because big models can absorb more from data, and you get more benefit out of this. It’s like one of those logarithmic graphs—a small model will level off sooner if you’re measuring tons of tokens, and bigger models need more. But mostly, we aren’t training that big of models right now at AI2, and getting the highest quality data we can is the natural starting point.

Lex Fridman (01:10:38) Is there something to be said about the topic of data quality? Is there some low-hanging fruit there still where the quality could be improved?

Nathan Lambert (01:10:46) It’s like turning the crank. So I think historically, in the open, there’s been a canonical best pre-training dataset that has moved around between who has the most recent one or the best recent effort. Like AI2’s Dolma was very early with the first OLMo and Hugging Face had FineWeb. And there’s the DCLM project, which stands for Data Comp Language Model. There’s been Data Comp for other machine learning projects, and they had a very strong dataset. A lot of it is the internet becoming fairly closed off, so we have Common Crawl, which I think is hundreds of trillions of tokens, and you filter it.

Nathan Lambert (01:11:21) And it looks like a lot of scientific work where you’re training classifiers and making decisions based on how you prune down this dataset into the highest quality stuff and the stuff that suits your tasks. Previously, language models were tested a lot more on knowledge and conversational things, but now they’re expected to do math and code. To train a reasoning model, you need to remix your whole dataset. And there are actually some wonderful scientific methods here where you can take your gigantic dataset and sample a lot of really tiny things from different sources, like GitHub, Stack Exchange, Reddit, or Wikipedia.

Nathan Lambert (01:11:56) You can sample small things from them, train small models on each of these mixes, and measure their performance on your evaluations. And you can just do basic linear regression, and it’s like, “Here’s your optimal dataset.” But if your evaluations change, your dataset changes a lot. So a lot of OLMo 3 was adding new sources for reasoning to be better at math and code, and then you do this mixing procedure and it gives you the answer. I think a lot of that’s happened at labs this year; there are new hot things, whether it’s coding environments or web navigation, and you just need to bring in new data and change your whole pre-training so that your post-training can work better. And that’s like the constant re-evolution and the re-determining of what they care about for their models.

Lex Fridman (01:12:35) Are there fun anecdotes of what sources of data are particularly high quality that we wouldn’t expect? You mentioned Reddit sometimes can be a source.

Nathan Lambert (01:12:45) Reddit was very useful. I think that PDFs are definitely one.

Sebastian Raschka (01:12:51) Oh, especially arXiv.

Nathan Lambert (01:12:52) Yeah, AI2 has run Semantic Scholar for a long time, which is a competitor to Google Scholar with a lot more features. To do this, AI2 has found and scraped a lot of PDFs for openly accessible papers that might not be behind the closed paid garden of a certain publisher—truly open scientific PDFs. If you sit on all of these and process them, you can get value out of it. I think a lot of that style of work has been done by the frontier labs much earlier. You need to have a pretty skilled researcher that understands how things change models, and they bring it in and clean it; it’s a lot of labor.

Nathan Lambert (01:13:34) I think at a lot of frontier labs, when they scale researchers, a lot more goes into data. If you join a frontier lab and you want to have impact, the best way to do it is just find new data that’s better. The fancy, glamorous algorithmic things, like figuring out how to make o1, is like the sexiest thought for a scientist. It’s like, “Oh, I figured out how to scale RL.” There’s a group that did that, but I think most of the contributions are-

Lex Fridman (01:13:58) On the dataset.

Nathan Lambert (01:13:58) … “I’m gonna make the data better,” or, “I’m gonna make the infrastructure better so that everybody on my team can run experiments 5% faster.”

Sebastian Raschka (01:14:04) At the same time, I think it’s also one of the closest guarded secrets—what your training data is—for legal reasons. And so there’s also a lot of work that goes into hiding what your training data was, essentially, trying to get the model to not give away the sources because of those legal reasons.

Nathan Lambert (01:14:19) The other thing, to be complete, is that some people are trying to train on only licensed data, whereas Common Crawl is a scrape of the whole internet. If I host multiple websites, I’m happy to have them train language models, but I’m not explicitly licensing what governs it. Therefore, Common Crawl is largely unlicensed, which means your consent really hasn’t been provided for how to use the data. There’s another idea where you can train language models only on data that has been licensed explicitly so that the kind of governing contract is provided. I’m not sure if Apertus is the copyright thing or the license thing. I know that the reason they did it was for an EU compliance thing, where they wanted to make sure that their model fit one of those checks.

Sebastian Raschka (01:15:01) Mm-hmm. And on that note, there’s also the distinction between the licensing. Some people, like you said, just purchase the license. Let’s say they buy an Amazon Kindle book or a Manning book, and then use that in the training data; that is a gray zone because you paid for the content and you might want to train on it. But then there are also restrictions where even that shouldn’t be allowed. That is where it gets a bit fuzzy.

Sebastian Raschka (01:15:28) And I think that is still a hot topic right now. Big companies like OpenAI approached private companies for their proprietary data, and private companies are becoming more and more protective of their data because they know, “Okay, this is going to be my moat in a few years.” And I do think that’s the interesting question. If LLMs become more commoditized, and a lot of people learn about LLMs, there will be a lot more people able to train them. Of course, there are infrastructure challenges.

Sebastian Raschka (01:16:00) But if you think of big industries like pharmaceuticals, law, or finance, I do think they at some point will hire people from other frontier labs to build their in-house models on their proprietary data, which will be another unlock with pre-training that is currently not there. Because even if you wanted to, you can’t get that data—you can’t get access to clinical trials most of the time and these types of things. So I do think scaling in that sense might still be pretty much alive if you look at domain-specific applications, because right now we are just looking at general-purpose LLMs like ChatGPT, Anthropic, and so forth. They are just general purpose. They’re not even scratching the surface of what an LLM can do if it is really specifically trained and designed for a specific task.

Nathan Lambert (01:16:47) I think on the data thing, this is one of the things where, like, this happened in 2025 and we totally forget it: Anthropic lost in court and owed $1.5 billion to authors. Anthropic, I think, bought thousands of books and scanned them and was cleared legally for that because they bought the books, and that is going through the system. And then on the other side, they also torrented some books, and I think this torrenting was the path where the court said that they were then culpable to pay these billions of dollars to authors, which is just such a mind-boggling lawsuit that kind of just came and went. Like, that is so much money- … from the VC ecosystem.

Lex Fridman (01:17:22) These are court cases that will define the future of human civilization because it’s clear that data drives a lot of this, and there’s this very complicated human tension. I mean, you can empathize. You’re both authors. And there’s some degree to which, I mean, you put your heart and soul and your sweat and tears into the writing that you do. It feels a little bit like theft for somebody to train on your data without giving you credit.

Sebastian Raschka (01:17:49) And there are, like Nathan said, also two layers to it. Someone might buy the book and then train on it, which could be argued fair or not fair, but then there are the straight-up companies who use pirated books where they’re not even compensating the author. That is, I think, where people got a bit angry about it specifically, I would say.

Lex Fridman (01:18:06) Yeah, but there has to be some kind of compensation scheme. This is like moving towards something like Spotify streaming did originally for music. You know, what does that compensation look like? You have to define those kinds of models. You have to think through all of that. One other thing I think people are generally curious about, I’d love to get your thoughts: as LLMs are used more and more, if you look at even arXiv or GitHub, more and more of the data is generated by LLMs. What do you do in that kind of world? How big of a problem is that?

Nathan Lambert (01:18:38) The largest problem is the infrastructure and systems, but from an AI point of view, it’s kind of inevitable.

Lex Fridman (01:18:45) So it’s basically LLM-generated data that’s curated by humans essentially, right?

Nathan Lambert (01:18:49) Yes, and I think that a lot of open source contributors are legitimately burning out. If you have a popular open source repo, somebody’s like, “Oh, I want to do open source AI. It’s good for my career,” and they just vibe code something and throw it in. You might get more of this than I do.

Sebastian Raschka (01:19:05) Yeah, so I actually have a case study here. I have a repository called mlxtend that I developed as a student, around 10 or 15 years ago, and it is a reasonably popular library still for certain algorithms, especially frequent data mining stuff. There were recently two or three people who submitted a lot of PRs in a very short amount of time. I do think LLMs have been involved in submitting these PRs. Me, as the maintainer, there are two things. First, I’m a bit overwhelmed; I don’t have time to read through it because, especially since it’s an older library, that is not a priority for me. At the same time, I kind of also appreciate it because I think something people forget is it’s not just using the LLM.

Sebastian Raschka (01:19:46) There’s still a human layer that verifies something, and that is in a sense also how data is labeled, right? One of the most expensive things is getting labeled data for RLHF (Reinforcement Learning from Human Feedback) phases. This is kind of like that, where it goes through phases and then you actually get higher quality data out of it. So I don’t mind it, in a sense. It can feel overwhelming, but I do think there is also value in it.

Lex Fridman (01:20:11) It feels like there’s a fundamental difference between raw LLM-generated data and LLM-generated data with a human in the loop that does some kind of verification, even if that verification is a small percentage- … of the lines of code.

Sebastian Raschka (01:20:25) I think this goes with anything where people think, “Oh, yeah. I can just use an LLM to learn about XYZ,” which is true. You can, but there might be a person who is an expert who might have used an LLM to write specific code. There is this human work that went into it to make it nice and throwing out the not-so-nice parts to pre-digest it for you, and that saves you time. And I think that’s the value-add where you have someone filtering things or even using the LLMs correctly. I think this is still labor that you get for free. For example, when you read a Substack article.

Sebastian Raschka (01:21:05) I could maybe ask an LLM to give me opinions on that, but I wouldn’t even know what to ask. And I think there is still value in reading that article compared to me going to the LLM because you are the expert. You select what knowledge is actually spot on and should be included, and you give me this executive summary. This is a huge value-add because now I don’t have to waste three to five hours to go through this myself and maybe get some incorrect information. And so I think that’s also where the future still is for writers, even though there are LLMs that can save you time.

Lex Fridman (01:21:43) It’s kind of fascinating to actually watch—and I’m sure you guys do this, but for me to look at the difference between a summary and the original content. Even if it’s a page-long summary of page-long content, it’s interesting to see how the LLM-based summary takes the edge off. What is the signal it removes from the thing?

Nathan Lambert (01:22:07) The voice is what I talk about a lot.

Lex Fridman (01:22:09) Voice? Well, voice… I would love to hear what you mean by voice, that’s really powerful, but sometimes there’s like literally insights. Like in removing an insight, you’re actually fundamentally changing the meaning of the thing. So I’m continuously disappointed by how bad LLMs are at really getting to the core insights, which is what a great summary does. Yet even if you use extensive, extremely elaborate prompts where I’m really trying to dig for the insights, it’s still not quite there which… I mean, that’s a whole deep philosophical question about what is human knowledge and wisdom and what does it mean to be insightful. But when you talk about the voice, what do you mean?

Nathan Lambert (01:22:52) So when I write, I think a lot of what I’m trying to do is take what you think as a researcher, which is very raw. A researcher is trying to encapsulate an idea at the frontier of their understanding, and they’re trying to put what is a feeling into words. And I think that in my writing, I try to do this, which makes it come across as raw but also high-information in a way that some people will get and some won’t. And that’s kind of the nature of research. And I think this is something that language models don’t do well. Particularly, they’re all trained with this reinforcement learning from human feedback which is designed to take feedback from a lot of people and, in a way, average how the model behaves from this.

Nathan Lambert (01:23:30) And I think that it’s going to be hard for a model to be very incisive when there’s that sort of filter in it. This is a wonderful fundamental problem for researchers in RLHF: this provides so much utility in making the models better, but also the problem formulation has this knot in it that you can’t get past. These language models don’t have this prior in their deep expression that they’re trying to get at. I don’t think it’s impossible to do. I think there are stories of models that really shock people. Like, I would love to have tried Bing Sydney—did that have more voice? Because it would so often go off the rails on people and affect…

Nathan Lambert (01:24:13) And what is historically, obviously, a scary way—like telling a reporter to leave his wife—is a crazy model to potentially put in general adoption. But that’s kind of the trade-off: is this RLHF process, in some ways, adding limitations?

Lex Fridman (01:24:28) That’s a terrifying place to be as one of these frontier labs and companies because millions of people are using them.

Nathan Lambert (01:24:35) There was a lot of backlash last year with GPT-4o getting removed. I’ve personally never used the model, but I’ve talked to people at OpenAI who get emails from users that might be detecting subtle differences in the deployments in the middle of the night. And they email them and say, “My friend is different.” They find these employees’ emails and send them things because they are so attached to what is a set of model weights and a configuration that is deployed to the users. We see this with TikTok. I don’t use TikTok, but supposedly, in five minutes, the algorithm gets you. It’s locked in. And those are language models doing recommendations.

Nathan Lambert (01:25:15) Like, I think there are ways that you can do this with a language model where, within five minutes of chatting with it, the model just gets you. And that is something that people aren’t really ready for. I think that—don’t give that to kids. Don’t give that to kids- at least until we know what’s happening.

Lex Fridman (01:25:30) But there’s also going to be this mechanism… What’s going to happen with these LLMs as they’re used more and more… Unfortunately, the nature of the human condition is such that people commit suicide. And what journalists will do is report extensively on the people who commit suicide, and they will very likely link it to the LLMs because they have that data about the conversations. If you’re really struggling, if you’re depressed, if you’re thinking about suicide, you’re going probably to talk to LLMs about it. And so what journalists will do is say, “The suicide was committed because of the LLM.” And that’s going to lead to the companies, because of legal issues and so on, more and more taking the edge off of the LLM.

Lex Fridman (01:26:13) So it’s going to be as generic as possible. It’s so difficult to operate in this space because, of course, you don’t want an LLM to cause harm to humans at that level, but also, this is the nature of the human experience—to have a rich conversation, a fulfilling conversation, one that challenges you and from which you grow. You need that edge. And that’s something extremely difficult for AI researchers on the RLHF front to actually have to solve because you’re actually dealing with the human condition.

Nathan Lambert (01:26:47) A lot of researchers at these companies are so well-motivated. Anthropic and OpenAI are culturally so wanting to do good for the world through this. And it’s such a… I’m like, “Ooh, I don’t want to work on this,” because, on the one hand, a lot of people see AI as a health ally, as somebody they can talk to about their health confidentially, but then it bleeds all the way into talking about mental health. It’s heartbreaking that this might be the thing where somebody goes over the edge, but other people might be saved. And there’s things that as a researcher training models, it’s like, I don’t want to train image generation models and release them openly because I don’t want to enable somebody to have a tool on their laptop that can harm other people.

Nathan Lambert (01:27:34) I don’t have the infrastructure in my company to do that safely. There are a lot of areas like this where it needs people who will approach it with complexity and the conviction that it’s just such a hard problem.

Lex Fridman (01:27:47) But also, we as a society and as users of these technologies need to make sure that we’re having the complicated conversation about it versus just fearmongering that big tech is causing harm to humans or stealing your data. It’s more complicated than that. And you’re right, there’s a very large number of people inside these companies, many of whom you know and many of whom I know, that deeply care about helping people. They are considering the full human experience of people from across the world, not just Silicon Valley—what their needs are and what that means. It’s really difficult to design this one system that is able to help all these different kinds of people across different age groups, cultures, and mental conditions.

Nathan Lambert (01:28:31) I wish that the timing of AI was different regarding the relationship of big tech to the average person. Big tech’s reputation is so low, and because AI is so expensive, it’s inevitably going to be a big tech thing. It takes so many resources, and people say the US is, quote-unquote, “betting the economy on AI” with this build-out. To have these be intertwined at the same time makes for such a hard communication environment. It would be good for me to go talk to more people in the world who hate big tech and see AI as a continuation of that.

Lex Fridman (01:29:02) One of the things you actually recommend, one of the antidotes that you talk about, is to find agency in this whole system, as opposed to sitting back in a powerless way and consuming the AI slop as it rapidly takes over the internet. Find agency by using AI to build things—build apps, build… One, that actually helps you build intuition, but two, it’s empowering because you can understand how it works and what the weaknesses are. It gives your voice power to say, “This is bad use of the technology, and this is good use of technology.” You’re more plugged into the system then, so you can understand it better and steer it better as a consumer.

Sebastian Raschka (01:29:48) I think that’s a good point you brought up about agency. Instead of ignoring it and saying, “Okay, I’m not going to use it,” I think it’s probably long-term healthier to say, “Okay, it’s out there. I can’t put it back.” It’s like the internet and computers when they first came out. How do I make the best use of it, and how does it help me up-level myself? The one thing I worry about here, though, is if you just fully use it for something you love to do, the thing you love to do is no longer there. That could potentially lead to burnout. For example, if I use an LLM to do all my coding for me, now there’s no coding; I’m just managing something that is coding for me.

Sebastian Raschka (01:30:24) Two years later, let’s say, if I just do that eight hours a day—having something code for me—do I still feel fulfilled? Is this hurting me in terms of being excited about my job and what I’m doing? Am I still proud to build something?

Lex Fridman (01:30:43) On that topic of enjoyment, it’s quite interesting. We should just throw this in there, that there’s this recent survey of about 791 professional developers—professional meaning 10-plus years of experience.

Nathan Lambert (01:30:55) That’s a long time. As a junior developer?

Lex Fridman (01:31:01) Yeah, in this day and age. The results are surprising on many fronts. They break it down by junior and senior developers, and it shows that both groups use AI-generated code in the code they ship. This is not just for fun or learning; this is code they ship. Most of them use it for around 50% or more. What’s interesting is that for the category where over 50% of the shipped code is AI-generated, senior developers are much more likely to do so. But you don’t want AI to take away the thing you love. I think this speaks to my experience. These particular results show that about 80% of people find it either somewhat more enjoyable or significantly more enjoyable to use AI as part of their work.

Sebastian Raschka (01:31:59) I think it depends on the task. From my personal usage, for example, I have a website where I sometimes tweak things. I personally don’t enjoy this, so if the AI can help me implement something on my website, I’m all for it. It’s great. But at the same time, when I solve a complex problem—if there’s a bug, and I hunt this bug and find it—it’s the best feeling in the world. You get so much joy. But now, if you don’t even think about the bug and just go directly to the LLM, you never have that kind of feeling, right?

Sebastian Raschka (01:32:38) But then there could be a middle ground where you try it yourself, you can’t find it, you use the LLM, and then you don’t get frustrated because it helps you move on to something that you enjoy. Looking at these statistics, what is not factored in is that it’s averaging over all different scenarios. We don’t know if it’s for the core task or for something mundane that people would not have enjoyed otherwise. In a sense, AI is really great for doing mundane things that take a lot of work.

Sebastian Raschka (01:33:09) For example, my wife has a podcast for book club discussions, and she was transferring the show notes from Spotify to YouTube, and the links somehow broke. She had some episodes with 100 links or something, and it would have been really painful to go in there and fix each link manually. So I suggested, “Hey, let’s try ChatGPT.” We copied the text into ChatGPT, and it fixed them. Instead of two hours going from link to link, it made that work seamless. I think everyone has a use case where AI is useful for something like that—something that would be really boring and mundane.

Lex Fridman (01:33:51) For me personally, since we’re talking about coding, a lot of the enjoyment comes from the cursor side—Claude Code side—where I have a pair programmer. It’s less lonely. You made debugging sound like this great joy. No, I would say debugging is like a drink of water after you’ve been going through a desert for— —for days. You skip the whole desert part where you’re suffering. Sometimes it’s nice to have a friend who can’t really find the bug, but can give you some intuition about the code, and together you go through the desert and find that drink of water. For me, maybe it speaks to the loneliness of the programming experience. That is a source of joy.

Sebastian Raschka (01:34:48) It’s maybe also related to delayed gratification. I’m a person who even as a kid liked the idea of Christmas presents better than actually getting them. I would look forward to the day, but then it’s over and I’m disappointed. Maybe it’s like food—it tastes better when you’re really hungry. With debugging, it’s not always great; it’s often frustrating, but if you can solve it, then it’s great. But there’s also a Goldilocks zone where if it’s too hard, then you’re wasting your time. I think another challenge, though, is: how will people learn?

Sebastian Raschka (01:35:33) The chart we looked at showed that more senior developers are shipping AI-generated code than the junior ones. I think it’s interesting because intuitively you would think it’s the junior developers because they don’t know how to do the thing yet. It could mean the AI is not good enough yet to solve those tasks, but it could also mean experts are more effective at using it—they know how to review the code and they trust it more. One issue in society in the future will be: how do you become an expert if you never try to do the thing yourself?

Sebastian Raschka (01:36:12) I learned by trying things myself. With math textbooks, if you look at the solutions, you learn something, but you learn better if you try first and then appreciate the solution because you know how to put it into your mental framework. If LLMs are here all the time, would you actually go through the length of struggling? Would you be willing to struggle? Struggle is not nice, but if you use the LLM to do everything, at some point you will never really take the next step and you won’t get that unlock that you get as an expert using an LLM.

Sebastian Raschka (01:36:53) So, I think there’s a Goldilocks sweet spot where maybe the trick is you make dedicated offline time where you study two hours a day, and the rest of the day you use LLMs. I think it’s important for people to still invest in themselves, in my opinion, and not just LLM everything.

Post-training explained: Exciting new research directions in LLMs

Lex Fridman (01:37:10) Yeah, there is a sense that we, together as a civilization, each individually have to find that Goldilocks zone. And in the programming context as developers. Now, we’ve had this fascinating conversation that started with pre-training and mid-training. Let’s get to post-training. There’s a lot of fun stuff in post-training. So, what are some of the interesting ideas in post-training?

Nathan Lambert (01:37:31) The biggest one from 2025 is learning this reinforcement learning with verifiable rewards, RLVR. You can scale up the training there, which means doing a lot of this kind of iterative generate-grade loop, and that lets the models learn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own and seeing the outputs, and then also that training enables this inference-time scaling very nicely. It just turned out that this paradigm was very nicely linked, where this kind of RL training enables inference-time scaling. But inference-time scaling could have been found in different ways. So, it was kind of this perfect storm where the models change a lot, and the way that they’re trained is a major factor in doing so.

Nathan Lambert (01:38:15) And this has changed how people approach post-training dramatically.

Lex Fridman (01:38:20) Can you describe RLVR, popularized by DeepSeek R1? Can you describe how it works?

Nathan Lambert (01:38:25) Yeah. Fun fact, I was on the team that came up with the term RLVR, which is from our Tulu 3 work before DeepSeek. We don’t take a lot of credit for being the people to popularize the scaling RL, but as much fun as academics get, as an aside, is the ability to name and influence—

Nathan Lambert (01:38:43) —the discourse, because the closed labs can only say so much. One of the things you can do as an academic is, while you might not have the compute to train the model, you can frame things in a way that ends up being… I describe it as like a community can come together around this RLVR term, which is very fun. And then DeepSeek are the people that did the training breakthrough, which is, they scaled the reinforcement learning. They have the model generate answers and then grade the completion if it was right, and then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment, and the environment gives it a state and a reward back, and you try to maximize this reward.

Nathan Lambert (01:39:26) In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it’s math problems or coding tasks. And it starts to get blurry with things like factual domains. That is also, in some ways, verifiable or constraints on your instruction, like ‘respond only with words that start with A.’ All of these things are verifiable in some way. The core idea is you find a lot more of these problems that are verifiable and you let the model try it many times while taking these RL gradient updates. The infrastructure evolved from reinforcement learning from human feedback, RLHF, where in that era, the score they were trying to optimize was a learned reward model of aggregate human preferences.

Nathan Lambert (01:40:13) So you kind of changed the problem domains and that let the optimization go on to much bigger scales, which kind of kickstarted a major change in what the models can do and how people use them.

Lex Fridman (01:40:24) What kind of domains is RLVR amenable to?

Nathan Lambert (01:40:28) Math and code are the famous ones, and then there’s a lot of work kind of on what is called the rubrics, which is related to a word people might have heard, LLM-as-a-judge. For each problem, I’ll have a set of problems in my training dataset. I will then have another language model and ask it, “What would a good answer to this problem look like?” And then you could try the problem a bunch of times over and over again and assign a score based on this rubric. So that’s not necessarily verifiable like a math and code domain, but this rubrics idea and other scientific problems where it might be a little bit more vague is where a lot of the attention is. They’re trying to push this set of methods into these more open-ended domains so the models can learn a lot more.

Sebastian Raschka (01:41:11) I think that’s called reinforcement learning with AI feedback, right?

Nathan Lambert (01:41:14) That’s the older term from it that was coined in Anthropic’s Constitutional AI paper. So a lot of these things come in cycles.

Sebastian Raschka (01:41:21) Also, just one step back for the RLVR. I think the interesting, beautiful thing here is that you ask the LLM a math question, you know the correct answer, and you let the LLM figure it out, but how it does it is… I mean, you don’t really constrain it much. There are some constraints you can add, like ‘use the same language’ or ‘don’t switch between Spanish and English.’ But let’s say you’re pretty much hands-off.

Sebastian Raschka (01:41:44) You only give the question and the answer, and then the LLM has the task to arrive at the right answer. But the beautiful thing here is what happens in practice: the LLM will do a step-by-step description, like how a student or a mathematician would derive the solution. It will use those steps and that helps the model to improve its own accuracy. And then, like you said, the inference scaling. Inference scaling loosely means spending more compute while using the LLM during inference, and here the inference scaling is that the model would use more tokens. In the DeepSeek R1 paper, they showed the longer they train the model, the longer the responses are.

Sebastian Raschka (01:42:28) They grow over time. They use more tokens, so it becomes more expensive for simple tasks, but these explanations help the model with accuracy. There are also a lot of papers showing what the model explains does not necessarily have to be correct, or maybe it’s even unrelated to the answer, but for some reason, it still helps the model—the fact that it is explaining. And I think it’s also—again, I don’t want to anthropomorphize these LLMs—but it’s kind of like how we humans operate, right? If there’s a complex math problem in a math class, you usually have a note paper and you do it step by step. You cross things out.

Sebastian Raschka (01:43:03) And the model also self-corrects, and that was, I think, the aha moment in the DeepSeek R1 paper. They called it the ‘aha moment’ because the model itself recognized it made a mistake and then said, “Ah, I did something wrong, let me try again.” I think that’s just so cool that this falls out of just giving it the correct answer and having it figure out how to do it—that it kind of does, in a sense, what a human would do. Although LLMs don’t think like humans, it’s a kind of interesting coincidence. And the nice side effect is it’s great for us humans to see these steps. It builds trust, and we can learn or double-check things.

Nathan Lambert (01:43:40) There’s a lot in here. I think- There’s been a lot of debate this year on if the language models—I think these aha moments are kind of fake because in pre-training, you essentially have seen the whole internet. So you have definitely seen people explaining their work, even verbally, like a transcript of a math lecture: “You try this, oh, I messed this up.” And what reinforcement learning—this RLVR—is very good at doing, is amplifying— —these behaviors, because they’re very useful in enabling the model to think longer and to check its work. I agree that it is very beautiful that this training kind of… the model learns to amplify this in a way that is just so useful at the final answers being better.

Sebastian Raschka (01:44:16) I can give you also a hands-on example. I was training the Qwen 3 base model with RLVR on MATH-500. The base model had an accuracy of about 15%. Just 50 steps, like in a few minutes with RLVR, the model went from 15% to 50% accuracy. And you can’t tell me it’s learning anything fundamentally about math in—

Nathan Lambert (01:44:38) The Qwen example is weird because there’s been two papers this year, one of which I was on, that talks about data contamination in Qwen— —and specifically that they train on a lot of this special mid-training phase that we— —can chime in on for a minute because it’s weird— —because they train on problems that are almost identical to MATH.

Sebastian Raschka (01:44:53) Exactly. And so you can see that basically the RL is not teaching the model any new knowledge about math. You can’t do that in 50 steps. So the knowledge is already there in the pre-training; you’re just unlocking it.

Nathan Lambert (01:45:03) I still disagree with the premise because there’s a lot of weird complexities that you can’t prove. One of the things that points to weirdness is that if you take the Qwen 3 so-called base model—you could Google “math dataset Hugging Face” and take a problem—if you put it into Qwen 3 base… all these math problems have words, so it would be like, “Alice has five apples and gives three to whoever,” and there are these word problems. With these Qwen-based models, why people are suspicious of them is if you change the numbers but keep the words— —Qwen will produce, without tools, a very high accuracy decimal representation—

Nathan Lambert (01:45:43) —of the answer, which means at some point it was shown problems that were almost identical to the test set, and it was using tools to get a very high precision answer. But a language model without tools will never actually have this. So it’s been this big debate in the research community: how much of these reinforcement learning papers that are training on Qwen and measuring specifically on this math benchmark—where there’s been multiple papers talking about contamination—how much can you believe them? I think this is what caused the reputation of RLVR being about formatting, because you can get these gains so quickly and therefore it must already be in the model. But there’s a lot of complexity here. It’s not really like controlled experimentation— —so we don’t really know.

Sebastian Raschka (01:46:26) But if it weren’t true, I would say distillation wouldn’t work, right? Distillation can work to some extent, but the biggest problem—and I’m researching this contamination—is we don’t know what’s in the data. Unless you have a new dataset, it is really impossible. Even something simpler like MMLU, which is a multiple-choice benchmark—if you just change the format slightly, like using a dot instead of a parenthesis, the model accuracy will vastly differ.

Nathan Lambert (01:47:04) I think that that could be like a model issue rather than a general issue.

Sebastian Raschka (01:47:09) It’s not even malicious by the developers of the LLM, like, “Hey, we want to cheat at that benchmark.” It’s just it has seen something at some point. I think the only fair way to evaluate an LLM is to have a new benchmark that is after the cutoff date when the model was deployed.

Lex Fridman (01:47:22) Can we lay out what would be the recipe of all the things that go into post-training? And you mentioned RLVR was a really exciting, effective thing. Maybe we should elaborate. RLHF still has a really important component to play. What kind of other ideas are there on post-training?

Nathan Lambert (01:47:40) I think you can take this in order. You could view it as what made o1, which is this first reasoning model, possible. You’re going to have similar interventions where you start with mid-training. The thing that is rumored to enable o1 and similar models is really careful data curation where you’re providing a broad set of what is called reasoning traces. This is just the model generating words in a forward process that reflects breaking down a problem into intermediate steps and trying to solve them. So at mid-training, you need to have data similar to this so that when you move into post-training, primarily with these verifiable rewards, it can learn.

Nathan Lambert (01:48:27) And then what is happening today is you’re figuring out which problems to give the model, how long you can train it for, and how much inference you can enable the model to use when solving these verifiable problems. As models get better, certain problems are no longer useful; the model will solve them 100% of the time, and therefore there’s very little signal. If we look at the GRPO equation, this one is famous for this because essentially the reward given to the agent is based on how good a given action—a completion—is relative to the other answers to that same problem. So if all the problems get the same answer, there’s no signal in these types of algorithms.

Nathan Lambert (01:49:09) So what they’re doing is finding harder problems, which is why you hear about things like scientific domains, which are so hard to get anything right in. If you have a lab or something, it just generates so many tokens, or much harder software problems. The frontier models are all pushing into these harder domains where they can train on more problems and the model will learn more skills at once. The RLHF link to this is that RLHF has been, and still is, the finishing touch on the models, where it makes them more useful by improving the organization, style, or tone.

Nathan Lambert (01:49:42) There are different things that resonate with different audiences. Some people like a really quirky model, and RLHF could be good at enabling that personality, and some people hate the markdown bulleted list thing that the models do, but it’s actually really good for quickly parsing information. This human feedback stage is really great for putting this into the model at the end of the day. It’s what made ChatGPT so magical for people. And that use has actually remained fairly stable. This formatting can also help the models get better at math problems, for example.

Nathan Lambert (01:50:17) The border between style and formatting and the method that you use to answer a problem are actually very closely linked when you’re training these models. RLHF can still make a model better at math, but these verifiable domains are a much more direct process for doing this because it makes more sense with the problem formulation. To summarize: mid-training gives the model the skills it needs to learn; RL with verifiable rewards lets the model try many times, putting a lot of compute into trial-and-error learning across hard problems; and then RLHF finishes the model, making it easy to use and rounding it out.

Lex Fridman (01:51:02) Can you comment on the amount of compute required for RL VR?

Nathan Lambert (01:51:06) It’s only gone up and up. I think Grok 4 was famous for saying they use a similar amount of compute for pre-training and post-training. Back to the scaling discussion, they involve very different hardware for scaling. Pre-training is very compute-bound, which is like the FLOPS discussion: how many matrix multiplications can you get through in one time. Because in RL you’re generating these answers and trying the model in real-world environments, it ends up being much more memory-bound. You’re generating long sequences, and the attention mechanisms have a behavior where you get a quadratic increase in memory as you get to longer sequences. So the compute becomes very different.

Nathan Lambert (01:51:44) In pre-training, we would talk about a model—if we go back to the Biden administration executive order—it’s like 10 to the 25th FLOPS to train a model. If you’re using FLOPS in post-training, it’s a lot weirder because the reality is just how many hours you are allocating how many GPUs for. In terms of time, the RL compute is getting much closer because you just can’t put it all into one system. Pre-training is so computationally dense where all the GPUs are talking to each other and it’s extremely efficient, whereas RL has all these moving parts and it can take a long time to generate a sequence of a hundred thousand tokens.

Nathan Lambert (01:52:17) If you think about Gemini 3 Pro taking an hour, what if your training run has to sample for an hour? You have to make sure that’s handled efficiently. So in GPU hours or wall-clock hours, the RL runs are probably approaching the same number of days as pre-training, but they probably aren’t using as many GPUs at the same time. There are rules of thumb in labs where you don’t want your pre-training runs to last more than a month because they fail catastrophically. If you are planning a huge cluster to be held for two months and then it fails on day 50, the opportunity costs are just so big.

Nathan Lambert (01:52:54) People don’t want to put all their eggs in one basket. GPT-4 was like the ultimate YOLO run, and nobody ever wanted to do it before where it took three months to train and everybody was shocked that it worked. I think people are a little bit more cautious and incremental now.

Sebastian Raschka (01:53:07) So RLVR is more unlimited in how much you can train or still get benefit, whereas RLHF, because it’s preference tuning, reaches a certain point where it doesn’t really make sense to spend more budget on it. To take a step back with preference tuning: there are multiple people that can give multiple explanations for the same thing and they can both be correct, but at some point, you learn a certain style and it doesn’t make sense to iterate on it. My favorite example is if relatives ask me what laptop they should buy. I give them an explanation or ask about their use case, and they might prioritize battery life and storage.

Sebastian Raschka (01:53:46) Other people, like us, would prioritize RAM and compute. Both answers are correct, but different people require different answers. With preference tuning, you are trying to average somehow; you are asking the data labelers to give you the preferred answer and then you train on that. But at some point, you learn that average preferred answer, and there’s no reason to keep training longer on it because it’s just a style. With RLVR, you let the model solve more and more complex, difficult problems. So I think it makes more sense to allocate more budget long-term to RLVR.

Sebastian Raschka (01:54:27) Right now, we are in an RLVR 1.0 phase where it’s still that simple thing where we have a question and answer, but we don’t do anything with the stuff in between. There were multiple research papers, by Google for example, on process reward models that also give scores for the explanation—how correct is the explanation? I think that will be the next thing, let’s say RLVR 2.0 for this year, focusing on the steps between question and answer and how to leverage that information to improve the explanation and accuracy. That’s one angle. And there was a DeepSeek-V3.2 paper where they also had interesting inference scaling.

Sebastian Raschka (01:55:11) Well, first they had developed models that grade themselves as a separate model. I think that will be one aspect. And the other, like Nathan mentioned, will be RLVR branching into other domains.

Nathan Lambert (01:55:23) The place where people are excited is value functions— —which is pretty similar. Process reward models assign how good something is to each intermediate step in a reasoning process, whereas value functions apply value to every token the language model generates. Both of these have been largely unproven in the language modeling and reasoning model era. People are more optimistic about value functions for whatever reason now. I think process reward models were tried a lot more in the pre-o1 era, and a lot of people had headaches with them. Value models have a very deep history in reinforcement learning.

Nathan Lambert (01:56:06) They’re one of the first things that were core to deep reinforcement learning existing—training value models. So right now the literature shows people are excited about trying value models, but there’s very little proof in it. And there are negative examples in trying to scale up process reward models.

Nathan Lambert (01:56:22) These things don’t always hold in the future. To summarize the scaling: you don’t want to do too much RLHF because of how the signal scales. People have worked on RLHF for years, especially after ChatGPT, but the first release of a reasoning model trained with RLVR, OpenAI’s o1, had a scaling plot where if you increase the training compute logarithmically, you get a linear increase in evaluations. This has been reproduced multiple times; I think DeepSeek had a plot like this. But there’s no scaling law for RLHF where if you log-increase the compute, you get linear performance.

Nathan Lambert (01:57:02) In fact, the seminal scaling paper for RLHF is about scaling loss for reward model over-optimization. That’s a big line to draw with RLVR and the methods we have now; they will follow this scaling paradigm where you can let the best runs go for an extra 10x and you get performance, but you can’t do this with RLHF. That is going to be field-defining. To do the best RLHF you might not need the extra 10 or 100x compute, but to do the best RLVR you do. There’s a seminal paper from a Meta internship called “The Art of Scaling Reinforcement Learning with Language Models.”

Nathan Lambert (01:57:47) Their framework is called ScaleRL. Their incremental experiment was like 10,000 V100 hours, which is thousands or tens of thousands of dollars per experiment, and they do a lot of them. This cost is not accessible to the average academic, which creates a hard equilibrium when trying to figure out how to learn from each community.

Advice for beginners on how to get into AI development & research

Lex Fridman (01:58:11) I was wondering if we could take a bit of a tangent and talk about education and learning. If you’re somebody listening to this who’s a smart person interested in programming and interested in AI, I presume building something from scratch is a good beginning. Can you just take me through what you would recommend people do?

Sebastian Raschka (01:58:32) I would personally start, like you said, by implementing a simple model from scratch that you can run on your computer. The goal of building a model from scratch is not to have something you use every day for your personal projects. It’s not going to be your personal assistant replacing an existing open-weight model or ChatGPT. It’s to see exactly what goes into the LLM, what exactly comes out of the LLM, and how pre-training works on your own computer. And then you learn about pre-training, supervised fine-tuning, and the attention mechanism.

Sebastian Raschka (01:59:03) You get a solid understanding of how things work, but at some point you will reach a limit because smaller models can only do so much. The problem with learning about LLMs at scale is that it’s exponentially more complex to make a larger model because it’s not just that the model becomes larger. You have to think about sharding your parameters across multiple GPUs. Even for the KV cache, there are multiple ways you can implement it. One is just to understand how it works, like a cache you grow step-by-step by concatenating lists, but then that wouldn’t be optimal on GPUs. You would pre-allocate a tensor and then fill it in. But that adds another 20 or 30 lines of code.

Sebastian Raschka (01:59:45) And for each thing, you add so much code. I think the trick with the book is basically to understand how the LLM works. It’s not going to be your production-level LLM, but once you have that, you can understand the production-level LLM.

Lex Fridman (01:59:56) So you’re trying to always build an LLM that’s going to fit on one GPU?

Sebastian Raschka (02:00:00) Yes. Most of the examples I have fit on one GPU. I have some bonus materials on some MoE models; one or two of them may require multiple GPUs, but the goal is to have it on one GPU. And the beautiful thing is also you can self-verify. It’s almost like RLVR. When you code these from scratch, you can take an existing model from the Hugging Face Transformers library. The Hugging Face Transformers library is great, but if you want to learn about LLMs, I think that’s not the best place to start because the code is so complex. It has to fit so many use cases and some people use it in production. It has to be really sophisticated, so it’s intertwined and hard; it’s not linear to read.

Nathan Lambert (02:00:39) It started as a fine-tuning library, and then it grew to be the standard representation of every model architecture and the way it is loaded. Hugging Face is the default place to get a model, and Transformers is the software that enables it— —so people can easily load a model— —and do something basic with it.

Sebastian Raschka (02:00:56) And all frontier labs that have open-weight models have a Hugging Face Transformers version of it, from DeepSeek to gpt-oss. That’s the canonical way that you can load them. But again, even the Transformers library is not used in production for inference. People use SGLang or vLLM, and it adds another layer of complexity.

Lex Fridman (02:01:15) We should say that the Transformers library has something like 400 models.

Sebastian Raschka (02:01:19) So it’s the one library that tries to implement a lot of LLMs, and so you have a huge codebase. It’s massive. It’s—I don’t know, maybe millions— —hundreds of thousands of lines of code. Understanding the part that you want to understand is like finding the needle in the haystack. But what’s beautiful about it is you have a working implementation, so you can work backwards from it. What I would recommend doing is if I want to understand, for example, how OLMo 3 is implemented, I would look at the weights in the model hub and the config file. You can see, “Oh, they used so many layers. They use group query attention.” Then you see all the components in a human-readable 100-line config file. And then you start with your GPT-2 model and add these things.

Sebastian Raschka (02:02:06) The cool thing here is you can then load the pre-trained weights and see if they work in your model. You want to match the same output that you get with a Transformers model, and then you can use that basically as a verifiable reward to make your architecture correct. Sometimes it takes me a day. With OLMo 3, the challenge was RoPE for the position embeddings; they had a YaRN extension and there was some custom scaling there. I couldn’t quite match it at first, but in this struggle you kind of understand things. At the end, you know you have it correct because you can unit test it against the reference implementation. I think that’s one of the best ways to learn. Basically, you reverse-engineer something.

Nathan Lambert (02:02:51) I think that is something everyone interested in getting into AI today should do, and that’s why I liked your book. I came to language models from the RL and robotics field, so I never had taken the time to just learn all the fundamentals. The Transformer architecture is as fundamental today as deep learning was in the past, and people need to learn it. I think where a lot of people get overwhelmed is how to apply this to have an impact or find a career path.

Nathan Lambert (02:03:23) AI language models make this fundamental stuff so accessible, and people with motivation will learn it. Then it’s like, “How do I get the cycles on goal to contribute to research?” I’m actually fairly optimistic because the field moves so fast that a lot of times the best people don’t fully solve a problem because there’s a bigger, lower-hanging fruit to solve, so they move on. In my RLHF book, I try to take post-training techniques and describe how they influence the model. It’s remarkable how many things people just stop studying.

Nathan Lambert (02:04:06) I think people trying to go narrow after doing the fundamentals is good. Reading relevant papers and being engaged in the ecosystem—you actually… The proximity that random people have online to leading researchers is incredible. The anonymous accounts on X in ML are very popular, and no one knows who all these people are. It could just be random people who study this stuff deeply. Especially with AI tools to help you keep digging into things you don’t understand, it’s very useful. There are research areas that might only have three papers you need to read, and then one of the authors will probably email you back.

Nathan Lambert (02:04:45) But you have to put in a lot of effort into these emails to show you understand the field. It would take a newcomer weeks of work to truly grasp a very narrow area, but going narrow after the fundamentals is very useful. I became very interested in character training—how you make a model funny, sarcastic, or serious, and what you do to the data to achieve this. A student at Oxford reached out to me and said, “Hey, I’m interested in this,” and I advised him. Now that paper exists. There were maybe only two or three people in the world very interested in that specific topic.

Nathan Lambert (02:05:25) He’s a PhD student, which gives you an advantage, but for me, that was a topic where I was waiting for someone to say, “Hey, I have time to spend cycles on this.” I’m sure there are a lot more narrow things where you’re just like, “It doesn’t make sense that there was no answer to this.” There’s so much information coming in that people feel they can’t grab onto anything, but if you actually stick to one area, I think there are a lot of interesting things to learn.

Sebastian Raschka (02:05:48) Yeah, I think you can’t try to do it all because it would be very overwhelming and you would burn out. For example, I haven’t kept up with computer vision in a long time; I’ve just focused on LLMs. But coming back to your book, I think it’s a really great resource and a good bang for the buck if you want to learn about RLHF. I wouldn’t just go out there and read raw RLHF papers because you would be spending two years—

Nathan Lambert (02:06:10) —and some of them contradict each other. I’ve just edited the book, and there’s no chapter where I had to say, “X papers say one thing and Y papers say another, and we’ll see what comes out to be true.”

Lex Fridman (02:06:21) What are some of the ideas we might have missed in the bigger picture of post-training? To go through the table of contents: first, you did the problem setup, training overview, what are preferences, preference data and the optimization tools, reward modeling, regularization, instruction tuning, rejection sampling, reinforcement learning. Then constitutional AI and AI feedback, reasoning and inference-time scaling, tool use and function calling, synthetic data and distillation, evaluation, and then the open questions section: over-optimization, style and information, product UX, character and post-training. What are some ideas worth mentioning that connect both the educational component and the research component? You mentioned the character training, which is pretty interesting.

Nathan Lambert (02:07:08) Character training is interesting because there’s so little out there, but we talked about how people engage with these models. We feel good using them because they’re positive, but that can go too far; it can be too positive. It’s essentially how you change your data or decision-making to make it exactly what you want. OpenAI has this thing called a “model spec,” which is essentially their internal guideline for what they want the model to do, and they publish this to developers. So you can know what is a failure of OpenAI’s training—where they have the intention but haven’t met it yet—versus what is something they actually wanted to do that you just don’t like.

Nathan Lambert (02:07:46) That transparency is very nice, but all the methods for curating these documents and how easy it is to follow them is not very well known. I think the way the book is designed is that the reinforcement learning chapter is obviously what people want because everybody hears about it with RLVR, and it’s the same algorithms and the same math, but you can use it in very different documents. I think the core of RLHF is how messy preferences are. It’s essentially a rehash of a paper I wrote years ago, but this is the chapter that tells you why RLHF is never fully solvable, because the way that RL is set up assumes that preferences can be quantified and reduced to single values.

Nathan Lambert (02:08:33) I think it relates in the economics literature to the Von Neumann-Morgenstern utility theorem. That is the chapter where all of that philosophical, economic, and psychological context tells you what gets compressed when doing RLHF. Later in the book, you use this RL map to make the number go up. I think that’s why it’ll be very rewarding for people to do research on, because quantifying preferences is something humans have designed the problem around to make them studyable. But there are fundamental debates; for example, in a language model response, you have different things you care about, whether it’s accuracy or style.

Nathan Lambert (02:09:13) When you’re collecting the data, they all get compressed into, “I like this more than another.” There’s a lot of research in other areas of the world that goes into how you should actually do this. I think social choice theory is the subfield of economics around how you should aggregate preferences. I went to a workshop that published a white paper on how you can think about using social choice theory for RLHF. I want people who get excited about the math to stumble into this broader context. I also keep a list of all the tech reports of reasoning models that I like. In Chapter 14, where there’s a short summary of RLVR, there’s a gigantic table where I list every single reasoning model that I like. I think in education, a lot of it needs to be, at this point, what I like—

Nathan Lambert (02:10:08) —because language models are so good at the math. For example, the famous paper on Direct Preference Optimization, which is a much simpler way of solving the problem than RL—the derivations in the appendix skip steps of math. I tried for this book to redo the derivations and I was like, “What the heck is this log trick that they use?” But when doing it with language models, they just say, “This is the log trick.” I don’t know if I like that the math is so commoditized. I think some of the struggle in reading this appendix— —and following the math is good for learning.

Lex Fridman (02:10:43) Yeah, we’re returning to this often on the topic of education. You both have brought up the word “struggle” quite a bit. There is value in that. If you’re not struggling as part of this process, you’re not fully following the proper process for learning, I suppose.

Nathan Lambert (02:11:02) Some of the providers are starting to work on models for education designed to not give… actually, I haven’t used them, but I would guess they’re designed to not give all the information at once— —and make people work for it. I think you could train models to do this and it would be a wonderful contribution. In the book, you had to reevaluate every decision— Which is such a great example. I think there’s a chance we work on it at AI2, which I think would be so fun.

Sebastian Raschka (02:11:26) It makes sense. I did something like that the other day for video games. In my spare time, I like video games with puzzles, like Zelda and Metroid. There’s this new game where I got really stuck. I didn’t want to struggle for two days, so I used an LLM. But I told it, “Please don’t add any spoilers. I’m at this point; what do I have to do next?” You can do the same thing for math where you say, “I’m at this point and I’m getting stuck. Don’t give me the full solution, but what is something I could try?” You kind of carefully probe it.

Sebastian Raschka (02:12:02) But the problem is that it requires discipline. A lot of people enjoy math, but there are also a lot of people who need to do it for their homework, and then it’s just a shortcut. We can develop an educational LLM, but the other LLMs are still there, and there’s still a temptation to use them.

Lex Fridman (02:12:20) I think a lot of people, especially in college, understand the stuff they’re passionate about— —they’re self-aware about it, and they understand it shouldn’t be easy. Like, I think we just have to develop a good taste— —talk about research taste, like school taste about stuff that you should be struggling on— —and stuff you shouldn’t be struggling on. Which is tricky to know, because sometimes you don’t have good long-term vision about what would be actually useful to you in your career. But you have to develop that taste, yeah.

Nathan Lambert (02:12:51) I was talking to maybe my fiance or friends about this, and it’s like there’s this brief 10-year window where all of the homework and all the exams could be digital. But before that, everybody had to do all the exams in bluebooks because there was no other way. And now after AI, everybody’s going to need to be in bluebooks and oral exams because everybody could cheat so easily. It’s like this brief generation that had a different education system where everything could be digital, but you still couldn’t cheat. And now it’s just going back. It’s just very funny.

Lex Fridman (02:13:20) You mention character training. Just zooming out on a more general topic: for that topic, how much compute was required? And in general, to contribute as a researcher, are there places where not too much compute is required where you can actually contribute as an individual researcher?

Nathan Lambert (02:13:39) For the character training thing, I think this research is built on fine-tuning about seven billion parameter models with LoRA, which is like a… Essentially, you’re only fine-tuning a small subset of the weights of the model. I don’t know exactly how many GPU hours that would take.

Lex Fridman (02:13:55) But it’s doable.

Nathan Lambert (02:13:55) Not doable for every academic. So the situation for some academics is so dire that the only work you can do is doing inference where you have closed models or open models, and you get completions from them and you can look at them and understand the models. And that’s very well-suited to evaluation, where you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with. I think that the top-end goal for a researcher working on evaluation, if you want to have career momentum, is that the frontier labs pick up your evaluation. So you don’t need to have every project do this.

Nathan Lambert (02:14:33) But if you go from a small university with no compute and you figure out something that Claude struggles with and then the next Claude model has it in the blog post, there’s your career rocket ship. I think that that’s hard but if you want to scope the maximum possible impact with minimum compute, it’s something like that—which is just get very narrow, and it takes learning of where the models are going. So you need to build a tool that tests where Claude 4.5 will fail. If I’m going to start a research project, I need to think where the models in eight months are going to be struggling.

Lex Fridman (02:15:05) But what about developing totally novel ideas?

Nathan Lambert (02:15:08) This is a trade-off. I think that if you’re doing a PhD, you could also be like, “It’s too risky to work in language models. I’m going way longer term,” which is like— —what is the thing that’s going to define language model development in 10 years? Which I think that I end up being a person that’s pretty practical. I mean, I went to my PhD where it was like, “I got into Berkeley. Worst case, I get a master’s, and then I go work in tech.” And so I’m very practical about it. The life afforded to people to work at these AI companies, the amount of… OpenAI’s average compensation is over a million dollars in stock a year per employee. For any normal person in the US, getting into this AI lab is transformative for your life. So I’m pretty practical about it—

Nathan Lambert (02:15:50) —there’s still a lot of upward mobility working in language models if you’re focused. And looking at the outcomes, look at these jobs. But from a research perspective, for transformative impact and these academic awards, being the next Yann LeCun comes from not caring about language model development very much.

Lex Fridman (02:16:07) It’s a big financial sacrifice in that case.

Nathan Lambert (02:16:09) So I get to work with some awesome students, and they’re like, “Should I go work at an AI lab?” And I’m like, “You’re getting a PhD at a top school. Are you going to leave to go to a lab?” If you go work at a top lab, I don’t blame you. Don’t go work at some random startup that might go to zero. But if you’re going to OpenAI, I think it could be worth leaving a PhD for.

Lex Fridman (02:16:30) Let’s more rigorously think through this. Where would you give a recommendation for people to do a research contribution? The options are academia—get a PhD, spend five years publishing, though compute resources are constrained. There are research labs that are more focused on open-weight models, so working there. Or closed frontier labs. OpenAI, Anthropic, xAI, and so on.

Nathan Lambert (02:17:04) The two gradients are: the more closed, the more money you tend to get, but also you get less credit. In terms of building a portfolio of things that you’ve done, it’s very clear what you have done as an academic. Versus if you are going to trade this fairly reasonable progression for being a cog in the machine, which could also be very fun. I think they’re very different career paths. But the opportunity cost for being a researcher is very high because PhD students are paid essentially nothing. I think it ends up rewarding people that have a fairly stable safety net and they realize they can operate in the long term, doing very interesting work and getting a very interesting job.

Nathan Lambert (02:17:50) So it is a privileged position to be like, “I’m going to see out my PhD and figure it out after because I want to do this.” And at the same time, the academic ecosystem is getting bombarded by funding getting cut and stuff. There are just so many different trade-offs where I understand plenty of people that are like, “I don’t enjoy it. I can’t deal with this funding search. My grant got cut for no reason by the government,” or, “I don’t know what’s going to happen.” So I think there’s a lot of uncertainty and trade-offs that, in my opinion, favor just taking the well-paying job with meaningful impact. It’s not like you’re getting paid to sit around at OpenAI. You’re building the cutting edge of things that are— changing millions of people’s relationship to tech.

Lex Fridman (02:18:34) But publication-wise, they’re being more secretive, increasingly so. So you’re publishing less and less. You are having a positive impact at scale, but you’re a cog in the machine.

Sebastian Raschka (02:18:47) I think, honestly, it hasn’t changed that much. I have been in academia; I’m not in academia anymore. At the same time, I wouldn’t want to miss my time in academia. But what I wanted to say before I get to that part, I think it hasn’t changed that much. I was using AI or machine learning methods for applications in computational biology with collaborators, and a lot of people went from academia directly to Google. I think it’s the same thing. Back then, professors were sad that their students went into industry because they couldn’t carry on their legacy in that sense. I think it’s the same thing. It hasn’t changed that much. The only thing that has changed is the scale.

Sebastian Raschka (02:19:32) But, you know, cool stuff was always developed in industry that was closed. You couldn’t talk about it. I think the difference now is your preference. Do you like to talk about your work and publish, or are you more in a closed lab? That’s one difference—the compensation, of course. But it’s always been like that. So it really depends on where you feel comfortable. And also, nothing is forever. The only thing right now is there’s a third option, which is starting a startup. There are a lot of people doing startups. Very risky move, but it’s a high-risk, high-reward type of situation, whereas joining an industry lab is pretty safe and offers upward mobility.

Sebastian Raschka (02:20:16) Honestly, I think once you have been at an industry lab, it will be easier to find future jobs. But then again, it’s like, how much do you enjoy the team and working on proprietary things versus how do you like the publishing work? I mean, publishing is stressful. Acceptance rates at conferences can be arbitrary and very frustrating, but it’s also high reward. If you have a paper published, you feel good because your name is on there. You have a high accomplishment.

Nathan Lambert (02:20:48) I feel like my friends who are professors seem on average happier than my friends who work at— a frontier lab, to be totally honest. Because there’s just a grounding and— the frontier labs definitely do this 9/9/6— which essentially is shorthand for work all the time.

Work culture in AI (72+ hour weeks)

Lex Fridman (02:21:03) Can you describe 9/9/6 as a culture? I believe you could say it was invented in China and adopted in Silicon Valley. What’s 9/9/6? It’s 9:00 AM to 9:00 PM—

Sebastian Raschka (02:21:14) six days a week.

Lex Fridman (02:21:15) Six days a week. What is that, 72 hours? Is this basically the standard in AI companies in Silicon Valley? More and more this kind of grind mindset.

Sebastian Raschka (02:21:26) Yeah, I mean, maybe not exactly like that, but I think there is a trend towards it. And it’s interesting—I think it almost flipped because when I was in academia, I felt like that because as a professor, you had to write grants, you had to teach, and you had to do your research. It’s like three jobs in one, and it is more than a full-time job if you want to be successful. And I feel like now, like Nathan just said, the professors in comparison to a lab have even less pressure or workload than at a frontier lab because—

Nathan Lambert (02:21:57) I think they work a lot. They’re just so fulfilled. By working with students— …and having a constant runway of mentorship and a mission that is very people-oriented. I think in an era when things are moving very fast and are very chaotic, it’s very rewarding to people.

Sebastian Raschka (02:22:11) Yeah, and I think at a startup, there’s this pressure. You have to make it. It is really important that people put in the time, but it is really hard because you have to deliver constantly. I’ve been at a startup. I had a good time, but I don’t know if I could do it forever. It’s an interesting pace and it’s exactly like we talked about in the beginning. These models are leapfrogging each other, and they are just constantly trying to take the next step compared to their competitors. It’s just ruthless right now.

Nathan Lambert (02:22:42) I think this leapfrogging nature and having multiple players is actually an underrated driver of language modeling progress where competition is so deeply ingrained. These companies have intentionally created very strong cultures. For example, Anthropic is known to be culturally deeply committed and organized. We hear so little from them, and everybody at Anthropic seems very aligned. Being in a culture that is super tight and having this competitive dynamic is a thing that’s going to make you work hard and create things that are better.

Nathan Lambert (02:23:20) But that comes at the cost of human capital. You can only do this for so long, and people are definitely burning out. I wrote a post on burnout as I’ve tread in and out of this myself, especially trying to be a manager while doing full-mode training. It’s a crazy job. In the book Apple in China, Patrick McGee talked about how hard the Apple engineers worked to set up the supply chains in China. He mentioned they had “saving marriage” programs, and he said in a podcast that people died from this level of working hard. It’s a perfect environment for creating progress based on human expense. The human expense is the 996 that we started this with, where people do really grind.

Sebastian Raschka (02:24:08) I also read this book. I think they had a code word for if someone had to go home to spend time with their family to save the marriage. Then the colleagues said, “Okay, this is red alert for this situation. We have to let that person go home this weekend.” But at the same time, I don’t think they were forced to work. They were so passionate about the product that you get into that mindset. I had that sometimes as an academic, and as an independent person. I overwork, and it’s unhealthy. I had back issues and neck issues because I did not take the breaks that I should have. But it’s not because anyone forced me; it’s because I wanted to work because it’s exciting stuff.

Nathan Lambert (02:24:46) That’s what OpenAI and Anthropic are like. They want to do this work.

Silicon Valley bubble

Lex Fridman (02:24:49) Yeah, but there’s also a feeling of fervor that’s building, especially in Silicon Valley, aligned with the scaling laws idea. There’s this hype where the world will be transformed in a scale of weeks and you want to be at the center of it. I have the great fortune of having conversations with a wide variety of human beings, and I get to see all these bubbles and echo chambers across the world. It’s fascinating to see how we humans form them. I think it’s fair to say that Silicon Valley is a kind of echo chamber, a kind of silo and bubble. I think bubbles are actually really useful and effective. It’s not necessarily a negative thing because you can be ultra-productive.

Lex Fridman (02:25:34) It could be the Steve Jobs reality distortion field, because you just convince each other the breakthroughs are imminent, and by convincing each other of that, you make the breakthroughs imminent.

Nathan Lambert (02:25:48) Bryne Hobart wrote a book classifying bubbles. One of them is financial bubbles, which involve speculation and are bad, and the other is effectively for build-outs, because it pushes people to build. I do think AI is in this, but I worry about it transitioning to a financial bubble.

Lex Fridman (02:26:05) Yeah, but also in the space of ideas, that bubble creates a reality distortion field. That means you are deviating from reality, and if you go too far while also working 996, you might miss some fundamental aspects of the human experience. This is a common problem in Silicon Valley. It’s a very specific geographic area. You might not understand the Midwest perspective or the experience of all the other different humans in the United States and across the world. You speak a certain way to each other and convince each other of a certain thing, and that can get you into real trouble.

Lex Fridman (02:26:47) Whether AI is a big success and becomes a powerful technology or it’s not, in either trajectory you can get yourself into trouble. So you have to consider all of that. Here you are, a young person trying to decide what you want to do with your life.

Nathan Lambert (02:27:02) The thing that is… I don’t even really understand this, but the SF AI memes have gotten to the point where the “permanent underclass” was one of them. This was the idea that the last six months of 2025 was the only time to build durable value in an AI startup or model. Otherwise, all the value will be captured by existing companies and you will therefore be poor. That’s an example of the SF thing that goes so far. I still think for young people who are really passionate about having an impact in AI, being physically in SF is the most likely place where you’re going to do this. But it has trade-offs.

Lex Fridman (02:27:41) I think SF is an incredible place, but there is a bit of a bubble. And if you go into that bubble, which is extremely valuable, just get out also. Read history books, read literature, and visit other places in the world. Twitter and Substack are not the entire world.

Nathan Lambert (02:28:01) I think I would say, one of the people I worked with is moving to SF, and I need to get him a copy of Season of the Witch. It’s a history of SF from 1960 to 1985 that goes through the hippie revolution, the culture emerging in the city, the HIV/AIDS crisis, and other things. That is so recent, with so much turmoil and hurt, but also love in SF. No one knows about this. It’s a great book, Season of the Witch; I recommend it. A bunch of my SF friends who do get out recommended it to me. I lived there and I didn’t appreciate this context, and it’s just so recent.

Text diffusion models and other new research directions

Lex Fridman (02:28:46) Yeah. Okay, let’s… we talked a lot about many things, certainly about what was exciting last year. But this year, one of the things you guys mentioned that’s exciting is the scaling of text diffusion models and just a different exploration of text diffusion. Can you talk about what that is and what possibilities it holds? So, different kinds of approaches than the current LMs?

Sebastian Raschka (02:29:13) Yeah, so we talked a lot about the transformer architecture and the autoregressive transformer architecture specifically, like GPT. And it doesn’t mean no one else is working on anything else. People are always on the lookout for the next big thing, because I think it would be almost stupid not to. Sure, right now the transformer architecture is the thing and it works best, but it’s always a good idea to not put all your eggs into one basket. People are developing alternatives to the autoregressive transformer. One of them would be, for example, text diffusion models.

Sebastian Raschka (02:29:49) And listeners may know diffusion models from image generation, like Stable Diffusion popularized it. Back then, people used GANs, Generative Adversarial Networks. And then there was this diffusion process where you iteratively de-noise an image, and that resulted in really good quality images over time. Other companies build their own diffusion models. And now people are like, “Okay, can we try this also for text?” It doesn’t make intuitive sense yet because it feels like it’s not something continuous like a pixel that we can differentiate. It’s discrete text, so how do we implement that de-noising process?

Sebastian Raschka (02:30:25) But it’s kind of similar to the BERT models by Google. When you go back to the original transformer, there were the encoder and the decoder. The decoder is what we are using right now in GPT and so forth. The encoder is more like a parallel technique where you have multiple tokens that you fill in in parallel. GPT models do autoregressive completion one token at a time. In BERT models, you have a sentence that has gaps—you mask them out—and then one iteration is filling in those gaps.

Sebastian Raschka (02:31:02) And text diffusion is kind of like that, where you are starting with some random text, and then you are filling in the missing parts or refining them iteratively over multiple iterations. The cool thing here is that this can do multiple tokens at the same time, so it has the promise of being more efficient. Now, the trade-off is, of course, how good is the quality? It might be faster, but the more de-noising steps you do, the better the text becomes. People are trying to see if that is a valid alternative to the autoregressive model in terms of giving you the same quality for less compute.

Sebastian Raschka (02:31:46) Right now, there are papers that suggest if you want to get the same quality, you have to crank up the de-noising steps and then you end up spending the same compute you would spend on an autoregressive model. The other downside is that while it’s parallel, some tasks are not. For reasoning tasks or tool use where you have to ask a code interpreter to give you an intermediate result, it is kind of tricky with diffusion models. So there are some hybrids. But the main idea is how can we parallelize it. It’s an interesting avenue. I think right now there are mostly research models out there, like LaMDA and some other ones.

Sebastian Raschka (02:32:24) I saw some by startups, some deployed models, but there is no big diffusion model at scale yet on the level of Gemini or ChatGPT. But there was an announcement by Google where they said they are launching Gemini Diffusion, and they put it into context of their Nano 2 model. They said for the same quality on most benchmarks, we can generate things much faster. I don’t think the text diffusion model is going to replace autoregressive LLMs, but it will be something for quick, cheap, at-scale tasks. Maybe the free tier in the future will be something like that.

Nathan Lambert (02:33:04) I think there are a couple of examples where it’s actually started to be used. To paint an example of why this is so much better: when a model like GPT-5 takes time to respond, it’s generating one token at a time. This diffusion idea is essentially generating all of those tokens in the completion in one batch, which is why it could be way faster.

Nathan Lambert (02:33:27) The startups I’m hearing are code startups where you have a codebase and somebody is effectively vibe coding. They say, “Make this change,” and a code diff is essentially a huge reply from the model. It doesn’t have to have that much external context, and you can get it really fast by using these diffusion models. They use text diffusion to generate really long diffs because doing it with an autoregressive model would take minutes, and that time causes a lot of churn for a user-facing product. Every second, you lose users. So I think that it’s going to be this thing where it’s going to-

Nathan Lambert (02:34:02) -grow and have some applications, but I actually thought that different types of models were going to be used for different things sooner than they have been. I think the tool use point is the one that’s stopping them from being most general purpose because, with something like Claude Code or ChatGPT with search, the autoregressive chain is interrupted with an external tool, and I don’t know how to do that with the diffusion setup.

Tool use

Lex Fridman (02:34:28) So what’s the future of tool use this year and in the coming years? Do you think there’s going to be a lot of developments there, and how that’s integrated into the entire stack?

Sebastian Raschka (02:34:37) I do think right now it’s mostly on the proprietary LLM side, but we will see more of that in open-source tooling. It is a huge unlock because then you can really outsource certain tasks from just memorization to actual computation—you know, instead of having the LLM memorize what is 23 plus 5, just use a calculator.

Lex Fridman (02:34:58) So do you think that can help solve hallucinations?

Sebastian Raschka (02:35:01) Not solve it, but reduce it. Still, the LLM needs to know when to ask for a tool call. And second, it doesn’t mean the internet is always correct. You can do a web search for who won the World Cup in 1998, but it still needs to find the right website and get the right information. You can still go to the incorrect website and get incorrect information. I don’t think it will fully solve it, but it is improving. There was another cool paper earlier this year—I think it was December 31st, so not technically 2026, but close—on the recursive language model.

Sebastian Raschka (02:35:43) That’s a cool idea to take this even a bit further. Nathan, you mentioned earlier it’s harder to do cool research in academia because of the compute budget. If I recall correctly, they did everything with GPT-5, so they didn’t even use local models. But the idea is, for a long-context task, instead of having the LLM solve all of it in one shot or in a chain, you break it down into sub-tasks. You have the LLM decide what is a good sub-task and then recursively call an LLM to solve that.

Sebastian Raschka (02:36:16) And then adding tools—you know, each sub-task maybe goes to the web and gathers information, and then you pull it all together at the end. I think there’s going to be a lot of unlock using things like that where you don’t necessarily improve the LLM itself, you improve how the LLM is used and what it can use. One downside right now with tool use is you have to give the LLM permission to use tools. That will take some trust, especially if you want to unlock things like having an LLM answer emails for you, or just sort them. I don’t know if I would today give an LLM access to my emails, right? I mean, this is a huge risk.

Nathan Lambert (02:37:03) I think there’s one last point on the tool use thing. You hinted at this, and we’ve both come at this in our own ways: open versus closed models use tools in very different ways. With open models, people go to Hugging Face and download the model, and then the person’s going to be like, “What tool do I want?” Maybe X.ai is my preferred search provider, but someone else might care for a different search startup. When you release a model, it needs to be useful for multiple tools, which is really hard because you’re making a general reasoning engine, which is actually what gpt-oss-120b is good for.

Nathan Lambert (02:37:36) But on the closed models, you’re deeply integrating the specific tool into your experience. I think that open models will struggle to replicate some of the things that I like to do with closed models, where you can reference a mix of public and private information. Something that I keep trying every three to six months is Codex on the web, which is just prompting a model to make an update to some GitHub repository that I have.

Nathan Lambert (02:38:01) That set of secure cloud environment is just so nice for just sending it off to do this thing and then come back to me. This will probably help define some of the local open and closed niches. Because there was such a rush to get tool use working, the open models were on the back foot, which is kind of inevitable. There are so many resources in these frontier labs, but it will be fun when the open models solve this because it’s going to necessitate a more flexible model that might work with this recursive idea to be an orchestrator. Hopefully, necessity drives innovation there.

Continual learning

Lex Fridman (02:38:45) So, continual learning—this is a longstanding topic and an important problem. I think that increases in importance as the cost of training models goes up. So can you explain what continual learning is and how important it might be this year and in the coming years to make progress?

Nathan Lambert (02:39:03) This relates a lot to this kind of SF zeitgeist of: what is AGI, Artificial General Intelligence, and what is ASI, Artificial Superintelligence? What are the language models that we have today capable of doing? I think language models can solve a lot of tasks, but a key milestone for the AI community is when AI can replace any remote worker, taking in information and solving digital tasks. The limitation is that a language model will not learn from feedback the same way an employee does. If you hire an editor, they might mess up, but you will tell them, and they don’t do it again.

Nathan Lambert (02:39:43) But language models don’t have this ability to modify themselves and learn very quickly. The idea is, if we are going to get to something that is a true, general adaptable intelligence that can go into any remote work scenario, it needs to be able to learn quickly from feedback and on-the-job learning. I’m personally more bullish on language models being able to just provide very good context. You can write extensive documents where you say, “I have all this information. Here are all the blog posts I’ve ever written. I like this type of writing; my voice is based on this.” But a lot of people don’t provide this to models.

Nathan Lambert (02:40:24) The agentic models are just starting. So it’s this kind of trade-off: do we need to update the weights of this model with this continual learning thing to make them learn fast? Or, the counterargument is we just need to provide them with more context and information, and they will have the appearance of learning fast by just having a lot of context and being very smart.

Lex Fridman (02:40:43) So we should mention the terminology here. Continual learning refers to changing the weights continuously so that the model adapts and adjusts based on the new incoming information, and does so continually, rapidly, and frequently. And then the thing you mentioned on the other side of it is generally referred to as in-context learning. As you learn stuff, there’s a huge context window. You can just keep loading it with extra information every time you prompt the system, which I think both can legitimately be seen as learning. It’s just a different place where you’re doing the learning.

Sebastian Raschka (02:41:24) I think, to be honest with you, continual learning—the updating of weights—we already have that in different flavors. I think the distinction here is: do you do that on a personalized custom model for each person, or do you do it on a global model scale? And I think we have that already with going from GPT-5 to 5.1 and 5.2. It’s maybe not immediate, but it is like a quick curated update where there was feedback by the community on things they couldn’t do. They updated the weights, released the next model, and so forth. So it is kind of a flavor of that. Another even finer-grained example is RLVR; you run it, it updates.

Sebastian Raschka (02:42:08) The problem is you can’t just do that for each person because it would be too expensive to update the weights for each person. Even at OpenAI scale, building the data centers, it would be too expensive. I think that is only feasible once you have something on the device where the cost is on the consumer. Like what Apple tried to do with the Apple Intelligence models, putting them on the phone so they learn from the experience.

Lex Fridman (02:42:33) A bit of a related topic, but this kind of—maybe anthropomorphized term—memory. What are different ideas of the mechanism of how to add memory to these systems as you’re increasingly seeing? Especially personalized memory?

Sebastian Raschka (02:42:49) Right now, it’s mostly like context—stuffing things into the context and then just recalling that. But again, it’s expensive because even if you cache it, you spend tokens on that. And the second one is you can only do so much. I think it’s more like a preference or style. A lot of people do that when they solve math problems. You can add previous knowledge, but you also give it certain preference prompts, like “do what I preferred last time.” But it doesn’t unlock new capabilities. For that, one thing people still use is LoRA adapters.

Sebastian Raschka (02:43:32) These are basically, instead of updating the whole weight matrix, two smaller weight matrices that you have in parallel or overlays, like the delta. But you can do that to some extent, and then again, it is economics. There were also papers showing, for example, LoRA learns less but forgets less. There’s no free lunch. If you want to learn more, you need to use more weights, but it gets more expensive. And then if you learn more, you forget more; you have to find that Goldilocks zone.

Long context

Lex Fridman (02:44:04) We haven’t really mentioned it much, but implied in this discussion is context length as well. Is there a lot of innovations that’s possible there?

Nathan Lambert (02:44:13) I think the colloquially accepted thing is that it’s a compute and data problem. Sometimes there are small architecture things, like attention variants. We talked about hybrid attention models, which is essentially if you have what looks like a state space model within your transformer. Those are better suited because you have to spend less compute to model the furthest along token. But those aren’t free because they have to be accompanied by a lot of compute or the right data. How many sequences of 100,000 tokens do you have in the world, and where do you get these? It just ends up being pretty expensive to scale them.

Nathan Lambert (02:44:56) So we’ve gotten pretty quickly to a million tokens of input context length. And I would expect it to keep increasing and get to 2 million or 5 million this year, but I don’t expect it to go to, like, 100 million. That would be a true breakthrough, and I think those breakthroughs are possible. I think of the continual learning thing as a research problem where there could be a breakthrough that makes transformers work way better at this and it’s cheap. These things could happen with so much scientific attention. Но turning the crank, it’ll be consistent increases over time.

Sebastian Raschka (02:45:27) I think also looking at the extremes, there’s no free lunch. One extreme to make it cheap is to have, let’s say, an RNN that has a single state where you save everything from the previous stuff. It’s a specific fixed-size thing, so you never really grow the memory. You are stuffing everything into one state, but then the longer the context gets, the more information you forget because you can’t compress everything into one state. Then on the other hand, you have the transformers, which try to remember every token. That is great if you want to look up specific information, but very expensive because you have the KV cache and the dot product that grow.

Sebastian Raschka (02:46:06) But then, like you said, the Mamba layers kind of have the same problem. Like an RNN, you try to compress everything into one state, and you’re a bit more selective there. I think it’s like this Goldilocks zone again with NVIDIA Nemotron 3; they found a good ratio of how many attention layers you need for the global information where everything is accessible compared to having these compressed states. I think we will scale more by finding better ratios in that Goldilocks zone between making it cheap enough to run and making it powerful enough to be useful.

Sebastian Raschka (02:46:43) And one more plug here: the recursive language model paper is one of the papers that tries to address the long context thing. What they found is, essentially, instead of stuffing everything into this long context, if you break it up into multiple smaller tasks, you save memory and can actually get better accuracy than having the LLM try everything all at once. It’s a new paradigm; we will see if there are other flavors of that. I think we will still make improvement on long context, but like Nathan said, the problem is for pre-training itself, we don’t have as many long-context documents as other documents. So it’s harder to study basically how LMs behave on that level.

Nathan Lambert (02:47:31) There are some rules of thumb where, essentially, you pre-train a language model—like OLMo, we pre-trained at an 8K context length and then extended to 32K with training. There’s a rule of thumb where doubling the training context length takes about 2X compute, and then you can normally 2 to 4X the context length again. I think a lot of it ends up being compute-bound at pre-training. Everyone talks about this big increase in compute for the top labs this year, and that should reflect in some longer context windows.

Nathan Lambert (02:48:02) But I think on the post-training side, there’s some more interesting things. As we have agents, the agents are going to manage this context on their own. Now people who use Claude Code a lot dread the compaction, which is when Claude takes its entire 100,000 tokens of work and compacts it into a bulleted list. But what the next models will do—I’m sure people are already working on this—is the model can control when it compacts and how. So you can essentially train your RL algorithm where compaction is an action,

Nathan Lambert (02:48:30) where it shortens the history. Then the problem formulation will be, “I want to keep the maximum evaluation scores while the model compacts its history to the minimum length.” Because then you have the minimum amount of tokens that you need to do this kind of compounding auto-regressive prediction. There are actually pretty nice problem setups in this where these agentic models learn to use their context in a different way than just plowing forward.

Sebastian Raschka (02:48:56) One interesting recent example would be DeepSeek-V3.2, where they had a sparse attention mechanism with a very efficient, small, lightweight indexer. Instead of attending to all the tokens, it selects which tokens I actually need. It almost comes back to the original idea of attention where you are selective, but attention is always on; you have maybe zero weight on some of them, but you use them all. But they are even more like, “Okay, let’s just mask that out or not even do that.” And even with sliding window attention in OLMo, that is also kind of like that idea. You have that rolling window where you keep it fixed, because you don’t need everything all the time.

Sebastian Raschka (02:49:34) Occasionally, in some layers you might, but it’s wasteful. But right now, I think if you use everything, you’re on the safe side; it gives you the best bang for the buck because you never miss information. And right now, I think this year will also be the year of figuring out, like you said, how to be smarter about that. Right now people want to have the next state-of-the-art, and the state-of-the-art happens to be the brute force, expensive thing. Once you have that, like you said, you want to keep that accuracy but see how we can do that cheaper now using tricks.

Nathan Lambert (02:50:07) Yeah, all this scaling thing. Like the reason we get the Claude 4.5 Sonnet model first is because you can train it faster and you’re not hitting these compute walls as soon. They can just try a lot more things and get the model out faster, even though the bigger model is actually better.

Robotics

Sebastian Raschka (02:50:22) I think we should say that there’s a lot of exciting stuff going on in the AI space. My mind has recently been really focused on robotics, so today we almost entirely didn’t talk about robotics. There’s a lot of stuff on image generation and video generation. I think it’s fair to say that the most exciting research work in terms of intensity and fervor is in the LLM space, which is why I think it’s justified for us to focus on the LLMs we’re discussing. But it’d be nice to bring in certain things that might be useful. For example, world models—there’s growing excitement about that. Do you think there will be any use in this coming year for world models in the LLM space?

Sebastian Raschka (02:51:08) Also with LLMs, what’s an interesting thing here is I think if we unlock more LLM capabilities, it also automatically unlocks all the other fields because it makes progress faster. Because, you know, a lot of researchers and engineers use LLMs for coding. So even if they work on robotics, if you optimize these LLMs that help with coding, it pays off. But then yes, world models are interesting. It’s basically where you have the model run a simulation of the world—like a little toy version of the real thing—which can unlock capabilities like data the LLM is not aware of. It can simulate things. I think LLMs happen to work well by pre-training and doing next-token prediction, but we could do this in a more sophisticated way.

Sebastian Raschka (02:52:05) There was a paper, I think by Meta, called “Coder World Models.” They basically apply the concept of world models to LLMs where, instead of just having next-token prediction and verifiable rewards checking the answer correctness, they also make sure the intermediate variables are correct. The model is basically learning a code environment. I think this makes a lot of sense; it’s just expensive to do. But it is making things more sophisticated by modeling the whole process, not just the result, and that can add more value.

Sebastian Raschka (02:52:51) I remember when I was a grad student, there’s a competition called CASP where they do protein structure prediction. They predict the structure of a protein that is not solved yet. In a sense, this is actually great, and I think we need something like that for LLMs also, where you do the benchmark but no one knows the solution until someone reveals it after the fact. When AlphaFold came out, it crushed this benchmark. I mean there were multiple iterations, but I remember the first one explicitly modeled the physical interactions and the physics of the molecule.

Sebastian Raschka (02:53:34) Also, things like impossible angles. Then in the next version, I think they got rid of this and just used brute force, scaling it up. I think with LLMs, we are currently in this brute-force scaling because it just happens to work, but I do think at some point it might make sense to bring back this approach. I think with world models, that might be actually quite cool. And of course, for robotics, that is completely related to LLMs.

Lex Fridman (02:54:03) Yeah, and robotics is very explicit. There’s the problem of locomotion or manipulation. Locomotion is much more solved, especially in the learning domain. But there’s a lot of value, just like with the initial protein folding systems… …Bringing in the traditional model-based methods. So it’s unlikely that you can just learn the manipulation or the whole-body local manipulation problem end-to-end. That’s the dream. But then you realize when you look at the magic of the human hand… …And the complexity of the real world, you realize it’s really hard to learn this all the way through- …the way I guess AlphaFold 2 didn’t.

Nathan Lambert (02:54:40) I’m excited about the robotic learning space. I think it’s collectively getting supercharged by all the excitement and investment in language models generally. The infrastructure for training transformers, which is a general modeling thing, is becoming world-class industrial tooling. Wherever there was a limitation for robotics, it’s just way better now. There’s way more compute. They take these language models and use them as central units where you can do interesting explorative work around something that already works. And then I see it emerging as, kind of like we talked about, Hugging Face transformers and Hugging Face.

Nathan Lambert (02:55:19) I think when I was at Hugging Face, I was trying to get this to happen, but it was too early. These open robotic models on Hugging Face enable people to contribute data and fine-tune them. I think we’re much closer now that the investment in robotics and self-driving cars is related and enables this. Once you get to the point where you have this sort of ecosystem, someone can download a robotics model and fine-tune it to their robot or share datasets across the world. There’s some work in this area like RTX from a few years ago where people are starting to do that. But once they have this ecosystem, it’ll look very different. And then this whole post-ChatGPT boom is putting more resources into that, which I think is a very good area for doing research.

Lex Fridman (02:56:02) This is also resulting in much better, more accurate, more realistic simulators being built, closing this sim-to-real gap in the robotic space. But you know, you mentioned a lot of excitement and investment. The downside of that, which happens in hype cycles—I personally believe, and most robotics people believe—is that robotics is not going to be solved on the timescale being implicitly or explicitly promised. So what happens when all these robotics companies spring up and then they don’t have a product that works? Then there’s going to be this crash of excitement, which is nerve-wracking. Hopefully something else will swoop in so that the continued development of some of these ideas keeps going.

Sebastian Raschka (02:56:53) I think it’s also related to the continual learning issue. The real world is so complex, whereas with LLMs, you don’t really need to have something learn for the user because there are a lot of things everyone has to do—everyone maybe wants to fix their grammar in their email or code. It’s more constrained, so you can prepare the model for that. But preparing a robot for the real world is harder. You have robotic foundation models, and you can learn things like grasping, but every house is different. It’s so different that the robot would have to learn on the job, essentially. And I think that is the bottleneck right now: customizing it on the fly.

Lex Fridman (02:57:42) I don’t think I can possibly understate the importance of the thing that doesn’t get talked about almost at all by robotics folks or anyone, and that is safety. All the interesting complexities we talk about regarding learning, all the failure modes and failure cases—everything we’ve been talking about with LLMs where sometimes it fails in interesting ways—all of that is fun and games in the LLM space. In the robotic space, in people’s homes, across millions of minutes and billions of interactions, you really are almost allowed to fail never. When you have embodied systems put out there in the real world, you just have to solve so many problems you never thought you’d have to solve when you’re just thinking about the general robot learning problem.

Nathan Lambert (02:58:32) I’m so bearish on in-home learned robots for consumer purchase. I’m very bullish on self-driving cars, and I’m very bullish for robotic automation, like Amazon distribution— …where Amazon has built whole new distribution centers designed for robots first rather than humans. There’s a lot of excitement in AI circles about AI enabling automation—

Nathan Lambert (02:58:54) …and mass-scale manufacturing, and I do think that the path to robots doing that is more reasonable. It’s a thing that is designed and optimized to do a repetitive task that a human could conceivably do but doesn’t want to. But it’s also going to take a lot longer than people probably predict. I think the leap from the AI singularity to scaling up mass manufacturing in the US because we have a massive AI advantage is one that is troubled by a lot of political and other challenging problems.

Timeline to AGI

Lex Fridman (02:59:31) Let’s talk about timelines specifically: timelines to AGI or ASI. Is it fair, as a starting point, to say that nobody really agrees on the definitions of AGI and ASI?

Nathan Lambert (02:59:46) I think there’s a lot of disagreement, but I’ve been getting pushback where people say it is something that could reproduce most digital economic work. The remote worker is a fairly reasonable example. I think OpenAI’s definition is somewhat related to that—an AI that can do a certain number of economically valuable tasks—which I don’t really love as a definition, but it could be a grounding point. Language models today, while immensely powerful, are not this remote worker drop-in. There are things an AI could do that are way harder than remote work, like solving a…

Nathan Lambert (03:00:29) …finding an unexpected scientific discovery that you couldn’t even posit, which would be an example of something people call an artificial superintelligence problem. Or taking in all medical records and finding linkages across certain illnesses that people didn’t know or figuring out that some common drug can treat a niche cancer. They would say that is a superintelligence thing. So these are natural tiers. My problem is that it becomes deeply entwined with the quest for meaning in AI and these religious aspects. There are different paths you can take.

Lex Fridman (03:01:06) And I don’t even know if remote work is a good definition. I liked the originally titled AI2027 report. They focus more on code and research taste, so the target there is the superhuman coder. They have several milestone systems: superhuman coders, superhuman AI researcher, then superintelligent AI researcher, and then the full ASI. After you develop the superhuman coder, everything else follows quickly. The task is to have fully autonomous, automated coding, so any kind of coding you need to do in order to perform research is fully automated.

Lex Fridman (03:01:58) From there, humans would be doing AI research together with that system, and they will quickly be able to develop a system that actually can do the research for you. That’s the idea. Initially, their prediction was 2027 or ’28, and now they’ve pushed it back by three to four years to 2031, mean prediction. My prediction is probably even beyond 2031, but at least you can think concretely about how difficult it is to fully automate programming.

Nathan Lambert (03:02:31) Yeah, I disagree with some of their presumptions and dynamics on how it would play out, but I think they did good work in defining concrete milestones to tell a useful story. That’s why the reach of this AI 2027 document well transcended Silicon Valley—because they told a good story and did a lot of rigorous work.

Nathan Lambert (03:02:53) I think the camp that I fall into is that AI is so-called jagged, which will be excellent at some things and really bad at some things. I think that when they’re close to this automated software engineer, what it will be good at is traditional ML systems and front end—the model is excellent at those—but the distributed ML, the models are actually really quite bad at because there’s so little training data on doing large-scale distributed learning and things. And this is something that we already see, and I think this will just get amplified. And then it’s kind of messier in these trade-offs, and then there’s how you think AI research works and so on.

Lex Fridman (03:03:28) So you think basically a superhuman coder is almost unachievable meaning, because of the jagged nature of the thing, you’re just always going to have gaps in capabilities?

Nathan Lambert (03:03:38) I think it’s assigning completeness to something where the models are kind of superhuman at some types of code, and I think that will continue. And people are creative, so they’ll utilize these incredible abilities to fill in the weaknesses of the models and move really fast. There will always be, for a long time, this dance between the humans enabling this thing that the model can’t do, and the best AI researchers are the ones that can enable this superpower.

Nathan Lambert (03:04:04) And I think those lines, compared to what we already see… I think like Claude Code for building a website, you can stand up a beautiful website in a few hours or do data analysis. But the whole thing is going to keep getting better at these things, and we’ll pick up some new code skills and stuff along the way. Linking to what’s happening in big tech, this AI 2027 report leans into the singularity idea where I think research is messy and social and largely in the data in ways that AI models can’t process. But what we do have today is really powerful, and these tech companies are all collectively buying into this with tens of billions of dollars of investment. So we are going to get some much better version of ChatGPT, a much better version of Claude Code than we already have.

Nathan Lambert (03:04:50) I think that it’s just hard to predict where that is going, but the bright clarity of that future is why some of the most powerful people in the world are putting so much money into this. And I think it’s just kind of small differences—we don’t actually know what a better version of ChatGPT is, but also can it automate AI research? I would say probably not, at least in this timeframe. Big tech is going to spend $100 billion much faster than we get an automated AI researcher that enables an AI research singularity.

Lex Fridman (03:05:22) So you think your prediction would be, if this is even a useful milestone, more than 10 years out?

Nathan Lambert (03:05:30) I would say less than that on the software side, but I think longer than that on things like research.

Lex Fridman (03:05:36) Well, let’s just for fun try to imagine a world where all software writing is fully automated. Can you imagine that world?

Nathan Lambert (03:05:46) By the end of this year, the amount of software that’ll be automated will be so high. But it’ll be things like you’re trying to train a model with RL and you need to have multiple bunches of GPUs communicating with each other. That’ll still be hard, but I think it’ll be much easier.

Lex Fridman (03:06:02) One of the ways to think about this, the full automation of programming, is just think of lines of useful code written—the fraction of that to the number of humans in the loop. So presumably there’ll be, for a long time, humans in the loop of software writing. It’ll just be fewer and fewer relative to the amount of code written. Right? And with the superhuman coder, I think the presumption there is the number of humans in the loop goes to zero. What does that world look like when the number of humans in the loop is in the hundreds, not in the hundreds of thousands?

Will AI replace programmers?

Nathan Lambert (03:06:39) I think software engineering will be driven more to system design and goals of outcomes, where I do think software is largely going to be… I think this has been happening over the last few weeks, where people have gone from a month ago saying, “Oh yeah, agents are kind of slop,” which is a famous Karpathy quote, to the industrialization of software when anyone can just create software with their fingerprints. I do think we are closer to that side of things, and it takes direction and understanding how the systems work to extract the best from the language models. And I think it’s hard to accept the gravity of how much is going to change with software development and how many more people can do things without ever looking at the code.

Sebastian Raschka (03:07:22) I think what’s interesting is to think about whether these systems will be independent, in the sense that while I have no doubt that LLMs will at some point solve coding in the way calculators solve calculating, right? At some point, humans developed a tool that you never need a human to calculate that number for; you just type it in, and it’s an algorithm. I think that’s the same probably for coding. But the question isn’t… I think what will happen is you will just say, “Build that website,” and it will make a really good website, and then you maybe refine it. But will it do things independently where…

Sebastian Raschka (03:07:59) Will you still have humans asking the AI to do something? Like will there be a person to say, “Build that website?” Or will there be AI that just builds websites or something, or whatever?

Lex Fridman (03:08:12) I think talking about building websites is the—

Nathan Lambert (03:08:15) Too simple.

Sebastian Raschka (03:08:16) Yeah. Sure.

Lex Fridman (03:08:16) It’s just that the problem with websites and the problem with the web, you know, HTML and all that kind of stuff, it’s very resilient to just— slop. It will show you slop. It’s good at showing slop. I would rather think of safety-critical systems, like asking AI to end-to-end generate something that manages logistics— or manages cars— a fleet of cars, all that kind of stuff. So it end-to-end generates that for you.

Nathan Lambert (03:08:45) I think a more intermediate example is take something like Slack or Microsoft Word. I think if the organizations allow it, AI could very easily implement features end-to-end and do a fairly good job for things that you want to try. You want to add a new tab in Slack that you want to use, and I think AI will be able to do that pretty well.

Lex Fridman (03:09:06) Actually, that’s a really great example. How far away are we from that?

Nathan Lambert (03:09:09) Like this year.

Lex Fridman (03:09:11) See, I don’t know. I don’t know.

Nathan Lambert (03:09:14) I guess I don’t know— how bad production codebases are, but I think that within… on the order of a few years, a lot of people are going to be pushed to be more like a designer and product manager, where you have multiple of these agents that can try things for you, and they might take one to two days to implement a feature or attempt to fix a bug. And you have these dashboards—which I think Slack is actually a good dashboard—where your agents will talk to you and you’ll then give feedback. But things like, I make a website and it’s like, “Do you want to make a logo that’s passable?” I think these cohesive design things and the style is going to be very hard for models and deciding on what to add next.

Lex Fridman (03:09:54) I just… Okay. So I hang out with a lot of programmers and some of them are a little bit on the skeptical side in general—that’s just the vibe. I just think there’s a lot of complexity involved in adding features to complex systems. Like, if you look at the browser, Chrome. If I wanted to add a feature, if I wanted to have tabs as opposed to up top, I want them on the left side. Interface, right? I think we’re not… This is not a next year thing.

Nathan Lambert (03:10:26) One of the Claude releases this year, one of their tests was we give it a piece of software and leave Claude to run to recreate it entirely, and it could already almost rebuild Slack from scratch, just given the parameters of the software and left in a sandbox environment to do that.

Lex Fridman (03:10:41) So the from-scratch part, I like almost better.

Nathan Lambert (03:10:44) So it might be that the smaller and newer companies are advantaged and they’re like, “We don’t have to have the bloat and complexity, and therefore this feature exists.”

Sebastian Raschka (03:10:53) And I think this gets to the point that you mentioned that some people you talk to are skeptical, and I think that’s not because the LLM can’t do X, Y, Z. It’s because people don’t want it to do it this way.

Lex Fridman (03:11:05) Some of that could be a skill issue on the human side. Unfortunately, we have to be honest with ourselves. And some of that could be an underspecification issue. So, programming… this is like a communication type of issue in relationships and friendships. You’re assuming the LLM somehow is supposed to read your mind. I think this is where spec-driven design is really important. Like you just, using natural language, specify what you want.

Nathan Lambert (03:11:32) I think if you talk to people at the labs, they use these in their training and production code. Claude Code is built with Claude Code, and they all use these things extensively. And Dario talks about how much of Claude’s code… It’s like these people are slightly ahead in terms of the capabilities—

Nathan Lambert (03:11:49) —they have, and they probably spend on inference. They could spend 10 to 100 times as much as we’re spending, like we’re on a lowly 100 or $200 a month plan. They truly let it rip. And I think that with the pace of progress that we have, a year ago we didn’t have Claude Code and we didn’t really have reasoning models. The difference between sitting here today and what we can do with these models—it seems like there’s a lot of low-hanging fruit to improve them. The failure modes are pretty dumb. It’s like, “Claude, you tried to use the CLI command I don’t have installed 14 times, and then I sent you the command to run.” From a modeling perspective, that thing is pretty fixable. So, I don’t know.

Lex Fridman (03:12:34) I agree with you. I’ve been becoming more and more bullish in general. Speaking to what you’re articulating, I think it is a human skill issue. So Anthropic and other companies are leading the way in understanding how to best use the models for programming; therefore, they’re effectively using them. I think there’s a lot of programmers on the outskirts who don’t… I mean, there’s not a really good guide on how to use them. People are trying to figure it out exactly, but—

Nathan Lambert (03:13:04) It might be very expensive. It might be that the entry point for that is $2,000 a month, which is only for tech companies and rich people. That could be it.

Lex Fridman (03:13:13) But it might be worth it. If the final result is a working software system, it might be worth it. By the way, it’s funny how we converged from the discussion of the timeline to AGI to something more pragmatic and useful. Is there anything concrete and profound to be said about the timeline to AGI and ASI? Or are these discussions a bit too detached from the day-to-day?

Nathan Lambert (03:13:39) There’s interesting bets. There’s a lot of people trying to do Reinforcement Learning with Verifiable Rewards—RLVR—but in real scientific domains. There are startups spending hundreds of millions of dollars in funding, and they have wet labs where they’re having language models propose hypotheses that are tested in the real world. I would say that they’re early, but with the pace of progress—

Nathan Lambert (03:14:00) —maybe they’re early by six months and they make it because they were there first, or maybe they’re early by eight years. You don’t really know. So I think that type of moonshot to branch this momentum into other sciences would be very transformative if AlphaFold moments happen in all sorts of other scientific domains by a startup solving this. I think there are startups—maybe Harmonic is one—where they’re going all in on language models plus Lean for math. I think you had another podcast guest where you talked about this recently, and it’s like we don’t know exactly what’s going to fall out of spending $100 million on that model.

Nathan Lambert (03:14:41) Most of them will fail, but a couple of them might be big breakthroughs that are very different than ChatGPT or Claude Code type software experiences. Like a tool that’s only good for a PhD mathematician but makes them 100 times more effective.

Sebastian Raschka (03:14:58) I agree. I think this will happen in a lot of domains, especially those with a lot of resources like finance, legal, and pharmaceutical companies. But then again, is it really AGI? Because we are specializing it again. Is it really that much different from how we had specialized algorithms back in the day? I think it’s just the same thing but way more sophisticated. Is there a threshold when we call it AGI? I think the real cool thing here is that we have foundation models that we can specialize. That’s the breakthrough.

Sebastian Raschka (03:15:34) Right now, I think we are not there yet because first, it’s too expensive, but also ChatGPT doesn’t just give away their model to customize it. I can imagine a business model where OpenAI says at some point, “Hey, Bank of America, for $100 million we will do your custom model.” I think that will be the huge economic value add. The other thing though is, what is the differentiating factor? If everyone uses ChatGPT, they will all do the same thing. Everyone is moving in lockstep, but usually companies want to have a competitive advantage. I think there is no way around using some of their private data and experimenting with specialization. It’s going to be interesting.

Nathan Lambert (03:16:26) Given the pace of progress, it does feel like things are coming. I don’t think the AGI and ASI thresholds are particularly useful.

Lex Fridman (03:16:35) I think the real question, and this relates to the remote worker thing, is when are we going to see a big, obvious leap in economic impact? Because currently there’s not been an obvious leap in economic impact from LLM models, for example. Aside from AGI or ASI, there’s a real question of when we are going to see a GDP jump. Jump.

Nathan Lambert (03:17:06) Yeah, it’s like, what is the GDP made up of? A lot of it is financial services, so I don’t know what this is. It’s just hard for me to think about the GDP bump, but I would say that software development becomes valuable in a different way when you no longer have to look at the code anymore. So when it is like, Claude will make you a small business—which is essentially Claude can set up your website, your bank account, your email, and your whatever else—and you just have to express what you’re trying to put into the world. That’s not just an enterprise market, but it is hard. I don’t know how you get people to try doing that. I guess if ChatGPT can do it—people are trying ChatGPT.

Lex Fridman (03:17:49) I think it boils down to the scientific question of, “How hard is tool use to solve?” Because a lot of the stuff you’re implying, the remote work stuff, is tool use. It’s like computer use; how you have an LLM that goes out there, this agentic system, and does something in the world, and only screws up 1% of the time.

Nathan Lambert (03:18:11) Computer use is a good example of what labs care about and we haven’t seen a lot of progress on.

Lex Fridman (03:18:12) Or less.

Nathan Lambert (03:18:12) We saw multiple demos in 2025 of, like, Claude can use your computer, or OpenAI had operator, and they all suck. So they’re investing money in this, and I think that’ll be a good example. Whereas actually, taking over the whole screen seems a lot harder than having an API that they can call in the back end. Some of that is you have to then set up a different environment for them all to work in. They’re not working on your MacBook; they are individually interfacing with Google and Amazon and Slack, and they handle all these things in a very different way than humans do. So some of this might be structural blockers.

Sebastian Raschka (03:18:55) Also, specification-wise, I think the problem for arbitrary tasks is that you still have to specify what you want your LLM to do. What is the environment? How do you specify? You can say what the end goal is, but if it can’t solve the end goal—with LLMs, if you ask it for text, it can always clarify or do sub-steps. How do you put that information into a system that, let’s say, books a travel trip for you? You can say, “Well, you screwed up my credit card information,” but even to get it to that point, as a user, how do you guide the model before it can even attempt that? I think the interface is really hard.

Lex Fridman (03:19:36) Yeah, it has to learn a lot about you specifically. And this goes to continual learning—about the general mistakes that are made throughout, and then mistakes that are made through you.

Nathan Lambert (03:19:48) All the AI interfaces are getting set up to ask humans for input. I think Claude Code we talked about a lot. It asks feedback and questions. If it doesn’t have enough specification on your plan or your desire, it starts to ask questions, “Would you rather?” We talked about Memory, which saves across chats. Its first implementation is kind of odd, where it’ll mention my dog’s name or something in a chat. I’m like, “You don’t need to be subtle about this. I don’t care.” But things are emerging, like ChatGPT has the Pulse feature.

Nathan Lambert (03:20:19) Which is like a curated couple paragraphs with links to something to look at or to talk about, and people talk about how the language models are going to ask you questions. It’s probably going to work. The language model knows you had a doctor appointment or something, and it’s like, “Hey, how are you feeling after that?” Which again, goes into the territory where humans are very susceptible to this and there’s a lot of social change to come. But also, they’re experimenting with having the models engage. Some people really like this Pulse feature, which processes your chats and automatically searches for information and puts it in the ChatGPT app. So there’s a lot of things coming.

Sebastian Raschka (03:20:58) I used that feature before, and I always feel bad because it does that every day, and I rarely check it out. How much compute is burned on something I don’t even look at, you know?

Nathan Lambert (03:21:11) There’s also a lot of idle compute in the world, so don’t feel too bad.

Lex Fridman (03:21:16) Okay. Do you think new ideas might be needed? Is it possible that the path to AGI—whatever that is, however we define that—to solve computer use more generally, to solve biology and chemistry and physics, sort of the Dario definition of AGI or powerful AI? Do you think it’s possible that totally new ideas are needed? Non-LLM, non-RL ideas. What might they look like? We’re now going into philosophy land a little bit.

Nathan Lambert (03:21:50) For something like a singularity to happen, I would say yes. And the new ideas could be architectures or training algorithms, which are fundamental deep learning things. But they’re, in that nature, pretty hard to predict. But I think we won’t get very far even without those advances. Like, we might get this software solution, but it might stop at software and not do computer use without more innovation. So I think that a lot of progress will be coming, but if you’re going to zoom out, there’s still ideas in the next 30 years that are going to look like a major scientific innovation that enabled the next chapter of this. And I don’t know if it comes in one year or in 15 years.

Lex Fridman (03:22:32) Yeah. I wonder if the Bitter Lesson holds true for the next 100 years, and what that looks like.

Nathan Lambert (03:22:37) If scaling laws are fundamental in deep learning, I think the Bitter Lesson will always apply, which is compute will become more abundant. But even within abundant compute, the ones that have a steeper scaling law slope or a better offset—like, this is a 2D plot of performance and compute—even if there’s more compute available, the ones that get 100x out of it will win.

Lex Fridman (03:23:01) It might be something like literally computer clusters orbiting Earth with solar panels.

Nathan Lambert (03:23:09) The problem with that is heat dissipation. You get all the radiation from the sun and you don’t have any air to dissipate heat. But there is a lot of space to put clusters. There’s a lot of solar energy there and you could figure out the heat dissipation, but there is a lot of energy and there probably could be engineering will to solve the heat problem— …so there could be.

Lex Fridman (03:23:27) Is it possible—and we should say that it definitely is possible—that we’re basically going to be plateauing this year? Not in terms of— …the system capabilities, but what the system capabilities actually mean for human civilization. So on the coding front, really nice websites will be built. Very nice autocomplete.

Lex Fridman (03:23:53) Very nice way to understand code bases and maybe help debug, but really just a very nice helper on the coding front. It can help research mathematicians do some math. It can help you with shopping. It’s a nice helper. It’s Clippy on steroids. What else? It may be a good education tool and all that kind of stuff, but computer use turns out extremely difficult to solve. So I’m trying to frame the cynical case in all these domains where there’s not a really huge economic impact, but realize how costly it is to train these systems at every level—both the pre-training and the inference, how costly the inference is, the reasoning, all of that. Is that possible? And how likely is that, do you think?

Nathan Lambert (03:24:47) When you look at the models, there are so many obvious things to improve, and it takes a long time to train these models and to do this art, that it’ll take us with the ideas that we have multiple years to actually saturate in terms of whatever benchmark or performance we are searching for. It might serve very narrow niches. The average ChatGPT user might not get a lot of benefit out of this, but it is going to serve different populations by getting better at different things.

Is the dream of AGI dying?

Lex Fridman (03:25:18) But I think what everybody’s chasing now is a general system that’s useful to everybody. So, okay, if that’s not… that can plateau, right?

Nathan Lambert (03:25:28) I think that dream is actually kind of dying. As you talked about with the specialized models where it’s like… and multimodal is often… like, video generation is a totally different thing.

Lex Fridman (03:25:39) “That dream is kind of dying” is a big statement, because I don’t know if it’s dying. If you ask the actual Frontier Lab people, they’re still chasing it, right?

Sebastian Raschka (03:25:48) I do think they are still rushing to get the next model out, which will be much better than the previous one. “Much” is a relative term, but it will be better than the previous one. I can’t see them slowing down. I just think the gains will be made or felt more through not only scaling the model, but now… I feel like there’s a lot of tech debt. It’s like, “Well, let’s just put the better model in there, and better model, better model.” And now people are like, “Okay, let’s also at the same time improve everything around it too.”

Sebastian Raschka (03:26:20) Like the engineering of the context and inference scaling. And the big labs will still keep doing that. And now also the smaller labs will catch up to that because now they are hiring more. There will be more people. LLMs, it’s kind of like a circle. They also make them more productive and it’s just like an amplifier. I think what we can expect is amplification, but not a paradigm change. I don’t think that is true, but everything will be just amplified and amplified and amplified, and I can see that continuing for a long time.

Nathan Lambert (03:26:52) Yeah. I guess my statement with the dream is dying depends on exactly what you think it’s going to be doing. Like Claude Code is a general model that can do a lot of things, but it depends a lot on integrations and other things. I bet Claude Code could do a fairly good job of doing your email, and the hardest part is figuring out how to give it information and how to get it to be able to send your emails and stuff like this. But I think it goes back to what is the “one model to rule everything” ethos, which is just like a thing in the cloud that handles your entire digital life and is way smarter than everybody.

Nathan Lambert (03:27:34) So it’s an interesting leap of faith to go from Claude Code becomes that—which, in some ways, there are some avenues for that—but I do think that the rhetoric of the industry is a little bit different.

Sebastian Raschka (03:27:49) I think the immediate thing we will feel next as a normal person using LLMs will probably be related to something trivial, like making figures. Right now LLMs are terrible at making figures. Is it because we are getting served the cheap models with less inference compute than behind the scenes? Maybe with some cranks we can already get better figures, but if you ask today to draw a flowchart of X, Y, Z, it’s most of the time terrible. And it is kind of a very simple task for a human. I think it’s almost easier sometimes to draw something than to write something.

Nathan Lambert (03:28:25) Yeah, the multimodal understanding does feel like something that is odd, that it’s not better solved.

Lex Fridman (03:28:31) I think we’re not saying one actually obvious thing that we’re not realizing, that’s a gigantic thing that’s hard to measure, which is making all of human knowledge accessible… …To the entire world. One of the things that I think is hard to articulate, but there’s just a huge difference between Google Search and an LLM. I feel like I can basically ask an LLM anything and get an answer, and it’s doing less and less hallucination.

Lex Fridman (03:29:04) And that means understanding my own life, figuring out a career trajectory, figuring out how to solve the problems all around me, learning about anything through human history. I feel like nobody’s really talking about that because they just immediately take it for granted that it’s awesome. That’s why everybody’s using it—it’s because you get answers for stuff, and think about the impact of that across time. This is not just in the United States; this is all across the world. Kids throughout the world being able to learn these ideas—the impact that has across time is probably where the real GDP growth will be. It won’t be like a leap.

Lex Fridman (03:29:51) It’ll be that that’s how we get to Mars, that’s how we build these things, that’s how we have a million new OpenAIs, all the kind of innovation that happens from there. And that’s just this quiet force that permeates everything, right? Human knowledge.

Sebastian Raschka (03:30:06) I do agree with you, and in a sense it makes knowledge more accessible, but it also depends on what the topic is. For something like math, you can ask it questions and it answers, but if you want to learn a topic from scratch—we talked about this earlier—I think the sweet spot is still math textbooks where someone laid it out linearly. That is a proven strategy to learn a topic, and it makes sense if you start from zero to get information-dense text to soak it up, but then you use the LLM to make infinite exercises.

Sebastian Raschka (03:30:47) If you have problems in a certain area or have questions about things you are uncertain about, you ask it to generate example problems, you solve them, and then maybe you need more background knowledge and you ask it to generate that. But it won’t give you anything that is not in the textbook. It’s just packaging it differently, if that makes sense.

Sebastian Raschka (03:31:13) But then there are things where it also adds value in a more timely sense, where there is no good alternative besides a human doing it on the fly. For example, if you’re planning to go to Disneyland and you try to figure out which tickets to buy for which park when, well, there is no textbook on that. There is no information-dense resource on that. There’s only the sparse internet, and then there is a lot of value in the LLM. You just ask it. You have the constraints on traveling on these specific days, you want to go to certain places, and you ask it to figure out what you need, when and from where… …What it costs and stuff like that. It is a very customized, on-the-fly package. Personalization is essentially like—

How AI will make money?

Sebastian Raschka (03:32:02) …pulling information from the sparse internet, the non-information-dense thing where there’s no better version that exists. You make it from scratch almost.

Lex Fridman (03:32:12) And if it does exist, it’s full of—speaking of Disney World—ad slop. Like any city in the world, if you ask “what are the top 10 things to do?” An LLM is just way better to ask… …Than anything on the internet.

Nathan Lambert (03:32:29) Well, for now, that’s because they’re massively subsidized, and eventually they’re going to be paid for by ads.

Lex Fridman (03:32:35) Oh my goodness.

Nathan Lambert (03:32:37) It’s coming.

Lex Fridman (03:32:38) No. I’m hoping there’s a very clear indication of what’s an ad and what’s not an ad in that context, but—

Sebastian Raschka (03:32:46) That’s something I mentioned a few years ago. It’s like, I don’t know, if you are looking for a new running shoe, is it a coincidence that Nike maybe comes up first? Maybe, maybe not. I think there are clear laws around this. You have to be clear about that, but I think that’s what everyone fears. It’s like the subtle message in there or something like that. But also, this brings us to the topic of ads where, I think this was a thing, hopefully they try to launch in 2025 because I think they’re still not making money in that other way right now, so… …Like having actual ad spots in there. And then the thing, though, is they couldn’t because there are alternatives without ads and people would just flock-

Sebastian Raschka (03:33:31) …to the other products. And it also is just crazy how they’re one-upping each other, spending so much money just to get the users.

Nathan Lambert (03:33:41) I think so. Like some Instagram ads—I don’t use Instagram- …but I understand the appeal of paying a platform to find users who will genuinely like your product. That is the best case of things like Instagram ads.

Nathan Lambert (03:33:56) But there are also plenty of cases where advertising is very awful for incentives. I think that a world where the power of AI can integrate with that positive view—like, I am a person and I have a small business and I want to make the best damn steak knives in the world and I want to sell them to somebody who needs them. And if AI can make that sort of advertising work even better, that’s very good for the world, especially with digital infrastructure because that’s how the modern web has been built. But that’s not to say that addicting feeds so that you can show people more content is a good thing. So, I think that’s even what OpenAI would say is they want to find a way that can make the monetization upside of ads while still giving their users agency.

Nathan Lambert (03:34:45) And I personally would think that Google is probably going to be better at figuring out how to do this because they already have ad supply. If they figure out how to turn this demand in their Gemini app into useful ads, then they can turn it on. I don’t know if I think it’s this year, but there will be experiments with it.

Sebastian Raschka (03:35:06) I do think what holds companies back right now is really just that the competition is not doing it. It’s more like a reputation thing. I think people are just afraid right now of ruining their reputation or losing users- …because it would make headlines if someone launched these ads. But-

Nathan Lambert (03:35:23) Unless they were great, but the first ads won’t be great because it’s a hard problem that we don’t know how to solve.

Sebastian Raschka (03:35:28) Yeah, I think also the first version of that will likely be something like on X, like the timeline where you have a promoted post sometimes in between. It’ll be something like that where it will say “promoted” or something small, and then there will be an image or something. I think right now the problem is who makes the first move.

Nathan Lambert (03:35:43) If we go 10 years out, the proposition for ads is that you will make so much money on ads by having so many users- …that you can use this to fund better R&D and- …make better models, which is why- …like YouTube is dominating the market. Netflix is scared of YouTube. They make, I don’t know—I pay $28 a month for premium. They make at least $28 a month off of me and many other people, and they’re just creating such a dominant position in video. So I think that’s the proposition, which is that ads can give you a sustained advantage- …in what you’re spending per user. But there’s so much money in it right now that it’s like somebody starting that flywheel- is scary because it’s a long-term bet.

Big acquisitions in 2026

Lex Fridman (03:36:29) Do you think there’ll be some crazy big moves this year business-wise? Like Google or Apple acquiring Anthropic or something like this?

Nathan Lambert (03:36:40) Dario will never sell, but we are starting to see some types of consolidation, with Groq being valued at $20 billion and Scale AI for almost 30 billion. There are countless other deals structured in a way that is actually detrimental to the Silicon Valley ecosystem—these licensing deals where not everybody gets brought along, rather than a full acquisition that benefits the rank-and-file employees by getting their stock vested. That’s a big issue for Silicon Valley culture to address because the startup ecosystem is the lifeblood. If you join a startup, even if it’s not that successful, your startup very well might get acquired at a cheap premium and you’ll get paid out for your equity.

Nathan Lambert (03:37:24) And these licensing deals are essentially taking the top talent a lot of the time. I think the deal for Groq to NVIDIA is rumored to be better for the employees, but it is still this antitrust-avoiding thing. I think that this trend of consolidation will continue. Me and many smart people I respect have been expecting consolidation to have happened sooner, but it seems like things are starting to turn. But at the same time, you have companies raising ridiculous amounts of money for reasons that I don’t understand. I’m like, “I don’t know why you’re taking that money.” So it’s maybe mixed this year, but some consolidation pressure is starting.

Lex Fridman (03:38:04) What kind of surprising consolidation do you think we’ll see? You say Anthropic is a “never.” I mean, Groq is a big one—Groq with a Q, by the way.

Nathan Lambert (03:38:12) Yeah. There’s just a lot of startups and there’s a very high premium on AI startups. So there could be a lot of $10 billion range acquisitions, which is a really big acquisition for a startup that was maybe founded a year ago. I think Manus.ai—this company based in Singapore that was founded eight months ago and then had a $2 billion exit. I think there will be some other big multi-billion dollar acquisitions, like Perplexity.

Lex Fridman (03:38:39) Like Perplexity, right?

Nathan Lambert (03:38:40) Yeah, people rumor them to Apple. I think there’s a lot of pressure and liquidity in AI. There’s pressure on big companies to have outcomes, and I would guess that a big acquisition gives people leeway to then tell the next chapter of that story.

Lex Fridman (03:38:56) I mean, yeah, we’ve been talking about code. Maybe somebody acquires Cursor.

Nathan Lambert (03:39:02) They’re in such a good position because they have so much user data. And we talked about continual learning and stuff; they had one of the most interesting blog posts. They mentioned that their new Composer model was a fine-tune of one of these large Mixture of Experts models from China. You can know that from gossip or because the model sometimes responds in Chinese, which none of the American models do. They had a blog post where they said, “We’re updating the model weights every 90 minutes based on real-world feedback from people using it.” Which is the closest thing to real-world RL happening on a model, and it was just right there in one of their blog posts.

Lex Fridman (03:39:36) That’s incredible.

Nathan Lambert (03:39:36) —which is super cool.

Lex Fridman (03:39:38) And by the way, I should say I use Composer a lot because one of the benefits it has is it’s fast.

Nathan Lambert (03:39:43) I need to try it because everybody says this.

Lex Fridman (03:39:45) And there’ll be some IPOs potentially. You think Anthropic, OpenAI, xAI?

Nathan Lambert (03:39:51) They can all raise so much money so easily that they don’t feel a need to… So long as fundraising is easy, they’re not going to IPO because public markets apply pressure.

Nathan Lambert (03:40:00) I think we’re seeing in China that the ecosystem’s a little different, with both MiniMax and Z.ai applying for filing IPO paperwork, which will be interesting to see how the Chinese market reacts. I actually would guess that it’s going to be similarly hypey to the US so long as all this is going, and not based on the realities that they’re both losing a ton of money. I wish more of the American gigantic AI startups were public because it would be very interesting to see how they’re spending their money and have more insight. And also just to give people access to investing in these, because I think that they’re the companies of the era. And the tradition is now for so many of the big startups in the US to not go public.

Nathan Lambert (03:40:43) It’s like we’re still waiting for Stripe and their IPO, but Databricks definitely didn’t; they raised like a Series G or something. And I just feel like it’s a kind of a weird equilibrium for the market where I would like to see these companies go public and evolve in that way that a company can.

Future of OpenAI, Anthropic, Google DeepMind, xAI, Meta

Lex Fridman (03:41:01) You think 10 years from now some of the frontier model companies are still around? Anthropic, OpenAI?

Nathan Lambert (03:41:08) I definitely don’t see it to be a winner-takes-all unless there truly is some algorithmic secret that one of them finds that lets this flywheel. Because the development path is so similar for all of them. Google and OpenAI have all the same products, and Anthropic’s more focused, but when you talk to people, it sounds like they’re solving a lot of the same problems. So I think… there’s offerings that’ll spread out. It’s a very big cake that’s being made that people are going to take money out of.

Lex Fridman (03:41:36) I don’t want to trivialize it, but OpenAI and Anthropic are primarily LLM service— —providers. And some of the other companies like Google and xAI, linked to X, do other stuff— —too. And so it’s very possible if AI becomes more commodified that the companies that are just providing LLMs will die.

Sebastian Raschka (03:42:00) I think the advantage they have is they have a lot of users, and I think they will just pivot. Like Anthropic, I think, pivoted. I don’t think they originally planned to work on code, but it happened that they found, “Okay, this is a nice niche and now we are comfortable in this niche and we push on this niche.” And I can see the same thing once… Let’s say hypothetically speaking, I’m not sure if it will be true, but let’s say Google takes all the market share of the general chatbot. Maybe OpenAI will then be focused on some other sub-topic— —like… They have too many users to go away in the foreseeable future, I think.

Lex Fridman (03:42:37) I think Google is always ready to say, “Hold my beer,” with AI mode.

Nathan Lambert (03:42:40) I think the question is if the companies can support the valuations. I’d see the AI companies being looked at in some ways like AWS, Azure, and GCP, which are all competing in the same space and all very successful businesses. There’s a chance that the API market is so unprofitable that they go up and down the stack to products and hardware. They have so much cash that they can build power plants and build data centers, which is a durable advantage now. But there’s also just a reasonable outcome that these APIs are so valuable and so flexible for developers that they become the likes of something like AWS. But AWS and Azure are also going to have these APIs, so five or six people competing in the API market is hard. So maybe that’s why they get squeezed out.

Lex Fridman (03:43:27) You mentioned “RIP LLaMA.” Is there a path to winning for Meta?

Nathan Lambert (03:43:32) I think nobody knows. They’re moving a lot, so they’re signing licensing deals with Black Forest Labs, which is image generation, or Midjourney. So I think in some ways, on the product and consumer-facing AI front, it’s too early to tell. I think they have some people that are excellent and very motivated being close to Zuckerberg. So I think that there’s still a story to unfold there. Llama is a bit different, where Llama was the most focused expression of the organization. And I don’t see Llama being supported to that extent anymore. I think it was a very successful brand for them, so they still might do some part of participation in the open ecosystem or continue the Llama brand into a different service, because people know what Llama is.

Lex Fridman (03:44:21) You think there’s a Llama 5?

Nathan Lambert (03:44:24) Not an open weight one.

Sebastian Raschka (03:44:26) It’s interesting. Just to recap a bit, I mean, Llama was the pioneering open-weight model—Llama 1, 2, 3, a lot of love. But I think then what happened, just hypothesizing or speculating, is that the leaders at Meta, like the upper executives, got very excited about Llama because they saw how popular it was in the community. And then I think the problem was trying to use the open source to make a bigger splash. It felt forced, like developing these very big Llama 4 models just to be on the top of the benchmarks.

Sebastian Raschka (03:45:09) But I don’t think the goal of Llama models is to be on top of the benchmarks beating, let’s say, ChatGPT or other models. I think the goal was to have a model that people can use, trust, modify, and understand. That includes having smaller models; they don’t have to be the best models. And what happened was just these models were—of course, the benchmarks suggest that they were better than they were because I think they had specific models trained on preferences so that they performed well on the benchmarks. That’s kind of this overfitting thing to force it to be the best. But then at the same time, they didn’t do the small models that people could use, and no one could run these big models.

Sebastian Raschka (03:45:45) And then there was kind of a weird thing. I think it’s just because people got too excited about headlines pushing the frontier. I think that’s it.

Lex Fridman (03:45:54) And too much on the benchmark-sync side.

Sebastian Raschka (03:45:56) It’s too much work.

Nathan Lambert (03:45:57) I think it imploded under internal political fighting and misaligned incentives. The researchers want to build the best models, but there’s a layer of organization— …and management that is trying to demonstrate that they do these things. There are a lot of pieces and rumors where some horrible technical decision was made, and it just seems like it got too bad where it all just crashed out.

Lex Fridman (03:46:24) Yeah, but we should also give huge props to Mark Zuckerberg. I think it comes from Mark actually, from the top of the leadership, saying open source is important. The fact that that exists means there could be a Llama 5, where they learn the lessons from the benchmark-syncing and say, “We’re going to be GPT-OSS—” “…and provide a really awesome library of open source.”

Nathan Lambert (03:46:51) What people say is that there’s a debate between Mark and Alexander Wang, who is very bright but much more against open source. To the extent that he has a lot of influence over the AI org, it seems much less likely, because Mark brought him in for fresh leadership in directing AI. And if being open or closed is no longer the defining nature of the model, I don’t expect that to be a defining argument between Mark and Alex. They’re both very bright, but I have a hard time understanding all of it because Mark wrote this piece in July of 2024, which was probably the best blog post at the time, making the case for open source AI. And then July 2025 came around and it was like, “We’re reevaluating our relationship with open source.” So it’s just kind of…

Sebastian Raschka (03:47:42) But I think also the problem—well, we may have been a bit too harsh, and that caused some of that. I mean, we as open source developers or the open source community. Because even though the model was maybe not what everyone hoped for, it got a lot of backlash. I think that was a bit unfortunate because as a company, they were hoping for positive headlines. Instead of just getting no headlines or positive headlines, they got negative headlines. And then it kind of reflected badly on the company. It’s maybe a spite reaction, almost like, “Okay, we tried to do something nice, we tried to give you something cool like an open source model, and now you are being negative about us.” So in that sense, it looks like, “Well, maybe then we’ll change our mind.” I don’t know.

Lex Fridman (03:48:38) Yeah, that’s where the dynamics of discourse on— …X can lead us as a community astray. Because sometimes it feels random; people pick the things they like and don’t like. I mean, you can see the same thing with Grok 4.1 and Grok Code Fast 1.0. I don’t think, vibe-wise, people love it publicly. But a lot of people use it. So if you look at Reddit and X, they don’t really give it praise from the programming community— … but, like, they use it. And the same thing with probably Llama. I don’t understand the dynamics of either positive hype or negative hype. I don’t understand it.

Nathan Lambert (03:49:25) I mean, one of the stories of 2025 is the US filling the gap of Llama, which is all the rise of these Chinese open-weight models- … to the point where I was like, “That was the single issue I’ve spent a lot of energy on in the last five months,” which is trying to do policy work- … to get the US to invest in this.

Lex Fridman (03:49:41) So just tell me the story of Adam.

Nathan Lambert (03:49:43) Adam Project is… It started as me calling it the American DeepSeek Project, which doesn’t really work for DC audiences, but it’s the story of what is the most impactful thing I can do with my career. These Chinese open-weight models are cultivating a lot of power and there is a lot of demand for building on open models, especially in enterprises in the US that are very cagey about Chinese models.

Lex Fridman (03:50:06) Looking at Perplexity, The Adam Project—American Truly Open Models—is a US-based initiative to build and host high-quality, genuinely open-weight AI models and supporting infrastructure explicitly aimed at competing with and catching up to China’s rapidly advancing open-source AI ecosystem.

Nathan Lambert (03:50:25) I think the one-sentence summary would be that—or two sentences. One is a proposition that open models are going to be an engine for AI research because that is what people start with; therefore, it’s important to own them. And the second one is, therefore, the US should be building the best models so that the best research happens in the US and those US companies take the value from being the home of where AI research is happening. Without more investment in open models, we have all the plots on the website where it’s like, “Qwen, Qwen, Qwen, Qwen,” and it’s all these models that are excellent from these Chinese companies that are cultivating influence in the US and internationally.

Nathan Lambert (03:51:07) And the US is spending way more on AI. The ability to create open models that are half a generation or a generation beyond what the cutting edge of closed labs is costs roughly $100 million, which is a lot of money, but not compared to what these companies have. Therefore, we need a centralizing force of people who want to do this. I think we got signed engagement from people pretty much across the full stack, including policy.

Lex Fridman (03:51:33) So there has been support from the administration?

Nathan Lambert (03:51:36) I don’t think anyone technically in government has signed it publicly, but I know that people that have worked in AI policy, both in the Biden and Trump administrations, are very supportive of trying to promote open-source models in the US. I think, for example, AI2 got a grant from the NSF for $100 million over four years, which is the biggest CS grant the NSF has ever awarded, for AI2 to attempt this, and I think it’s a starting point. But the best results happen when there are multiple organizations building models because they can cross-pollinate ideas and build this ecosystem. It doesn’t work if it’s just Llama releasing models to the world, because Llama could go away. The same thing applies for AI2; I can’t be the only one building models.

Nathan Lambert (03:52:24) It becomes a lot of time spent on talking to people, whether they’re in policy… I know NVIDIA is very excited about this. I think Jensen Huang has been specifically talking about the urgency for this, and they’ve done a lot more in 2025, where the Nemotron 3 models are more of a focus. They’ve started releasing some data along with NVIDIA’s open models and very few companies do this, especially of NVIDIA’s size. So there are signs of progress. We hear about Reflection AI where they say their two-billion-dollar fundraise is dedicated to building US open models, and their announcement tweet reads like a cultural tide starting to turn.

Nathan Lambert (03:53:09) I think in July was when we had four or five DeepSeek-caliber Chinese open-weight models and zero from the US. That’s the moment where I released this and was like, “I guess I have to spend energy on this because nobody else is gonna do it.” So it takes a lot of people contributing together. I’m not saying the Adam Project is the only thing moving the ecosystem, but it’s people like me doing this sort of thing to get the word out.

Manhattan Project for AI

Sebastian Raschka (03:53:35) Do you like the 2025 America’s AI Action Plan? That includes open source stuff. The White House AI Action Plan includes a dedicated section titled “Encourage Open-Source and Open-Weight AI,” defining such models and arguing they have unique value for innovation and startups.

Nathan Lambert (03:53:52) Yeah. I mean, the AI Action Plan is just a plan, but I think it’s maybe the most coherent policy document that has come out of the administration, and I hope that it largely succeeds. I know people that have worked on it. The challenge is taking policy and making it real, and I have no idea how to do this as an AI researcher, but largely a lot of things in that were very real. There’s a huge build-out of AI in the country, and while there are issues people hear about, from water use to whatever, we should be able to build things in this country without ruining places in the process. It’s worthwhile to spend energy on.

Nathan Lambert (03:54:35) I think that’s a role for the federal government. They set the agenda. And setting the agenda so that open-weight should be a first consideration is a large part of what they can do to get people thinking about it.

Sebastian Raschka (03:54:49) Also, for education and talent, it’s very important. Otherwise, if there are only closed models, how do you get the next generation of people contributing? You would only be able to learn after you joined a company, but at that point, how do you identify and hire talented people? I think open source is essential for educating the population and training the next generation of researchers. It’s the only way.

Nathan Lambert (03:55:24) The way that I could’ve gotten this to go more viral was to tell a story of Chinese AI integrating with an authoritarian state, becoming ASI and taking over the world, and therefore we need our own American models. But it’s very intentional why I talk about innovation and science in the US, because I think it’s both more realistic as an outcome and it’s a world that I would like to manifest.

Sebastian Raschka (03:55:47) I would say, though, that any open-weight model is a valuable model.

Nathan Lambert (03:55:55) Yeah. And my argument is that we should be in a leading position. But I think it’s worth saying it so simply because there are still voices in the AI ecosystem that say we should consider banning the release of open models due to the safety risks. And I think it’s worth adding that, effectively, that’s impossible without the US having its own Great Firewall, which is known to not work that well. The cost for training these models, whether it’s one to a hundred million dollars, is attainable to a huge amount of people in the world that want to have influence, so these models will be getting trained all over the world. We want this information and these tools to flow freely across the world and into the US so that people can use them and learn from them.

Nathan Lambert (03:56:47) Stopping that would be such a restructuring of our internet that it seems impossible.

Sebastian Raschka (03:56:51) Do you think maybe the big open-weight models from China are actually a good thing for US companies? You mentioned earlier they are usually one generation behind in terms of what they release open source. For example, gpt-oss-120b might not be the cutting-edge model, or Gemini 3 might not be, because they want to ensure it is safe. But when these companies see that DeepSeek-V3.2 is really awesome and is being used with no backlash or security risk, that could encourage them to release better models. Maybe that is a very positive thing.

Nathan Lambert (03:57:30) A hundred percent. These Chinese companies have set things into motion that I think would potentially not have happened if they were not all releasing models. I’m almost sure that those discussions have been had by leadership.

Sebastian Raschka (03:57:45) Is there a possible future where the dominant models, AI models in the world are all open source?

Nathan Lambert (03:57:50) Depends on the trajectory of progress that you predict. If you think saturation in progress is coming within a few years, essentially within the time where financial support is still very good, then open models will be so optimized and so much cheaper to run that they’ll win out. Essentially, this goes back to open source ideas where so many more people will be putting money into optimizing the serving of these open-weight common architectures that they will become standards. Then you could have chips dedicated to them and it’ll be way cheaper than the offerings from these closed companies that are custom.

Sebastian Raschka (03:58:25) We should say that the AI2027 report predicts—one of the things it does from a narrative perspective is that there will be a lot of centralization. As the AI system gets smarter and smarter, the national security concerns will come to be, and you’ll centralize the labs, and you’ll become super secretive, and there’ll be this whole race.

Lex Fridman (03:58:45) …from a military perspective of how you… between China and the United States. And so all of these fun conversations we’re having about LLMs—all the generals and soldiers will come into the room and be like, “All right, we’re now in the Manhattan Project stage of this whole thing.”

Sebastian Raschka (03:59:02) I think 2025, ’26, ’27—I don’t think something like that is even remotely possible. I mean, you can make the same argument for computers, right? You can say, “Okay, computers are capable and we don’t want the general public to get them.” Or chips—even AI chips—but you see how Huawei makes chips now. It took a few years, but… and I don’t think there is a way you can contain knowledge like that. I think in this day and age, it is impossible, like the internet. I don’t think this is a possibility.

Nathan Lambert (03:59:37) On the Manhattan Project thing, one of my funny things looking at them is I think that a Manhattan Project-like thing for open models would actually be pretty reasonable, because it wouldn’t cost that much. But I think that that will come. It seems like culturally, the companies are changing. But I agree with Sebastian on all of the stuff that he just said. It’s just like, I don’t see it happening nor being helpful.

Lex Fridman (03:59:58) Yeah. I mean, the motivating force behind the Manhattan Project was that there was civilizational risk. It’s harder to motivate that for open-source models.

Nathan Lambert (04:00:08) There’s not civilizational risk.

Future of NVIDIA, GPUs, and AI compute clusters

Lex Fridman (04:00:10) On the hardware side, we mentioned NVIDIA a bunch of times. Do you think Jensen and NVIDIA are going to keep winning?

Sebastian Raschka (04:00:18) I think they have the downside that they have to iterate a lot and manufacture a lot. And what they’re doing—they do innovate, but I think there’s always the chance that there is someone who does something fundamentally different, who gets very lucky and then does something. But the problem is, I think, adoption. You know, the moat of NVIDIA is probably not just the GPU; it’s more like the CUDA ecosystem, and that has evolved over two decades. I mean, even back when I was a grad student, I was in a lab doing biophysical simulations, molecular dynamics, and we had a Tesla GPU back then just for the computations. It was fifteen years ago now.

Sebastian Raschka (04:01:01) They built this up for a long time and that’s the moat, I think. It’s not the chip itself. Although they have the money now to iterate, build, and scale, it’s really on the compatibility. If you’re at that scale as a company, why would you go with something risky where it’s only— … a few chips that they can make per year? You go with the big one. But then I do think with LLMs now, it will be easier to design something like CUDA. It took 15 years because it was hard, but now that we have LLMs, we can maybe replicate CUDA.

Lex Fridman (04:01:35) And I wonder if there will be a separation of the training and the inference- … compute, as we stabilize a bit more and more compute is needed for inference.

Nathan Lambert (04:01:47) That’s supposed to be the point of the Groq acquisition. And that’s why part of what Vera Rubin is—

Nathan Lambert (04:01:52) … where they have a new chip with no high-bandwidth memory, or very little, which is one of the most expensive pieces. It’s designed for pre-fill, which is the part of inference where you essentially do a lot of matrix multiplications, and then you only need the memory when you’re doing this autoregressive generation and you have the KV cache swaps. So they have this new GPU that’s designed for that specific use case, and then the cost of ownership per flop is actually way lower. But I think that NVIDIA’s fate lies in the diffusion of AI still. Their biggest clients are still these hyperscale companies, whether it’s Google—which obviously can make TPUs—Amazon making Trainium, or Microsoft trying to do its own things.

Nathan Lambert (04:02:36) As long as the pace of AI progress is high, NVIDIA’s platform is the most flexible and people will want that. But if there’s stagnation, then with creating bespoke chips, there’s more time to do it.

Lex Fridman (04:02:50) It’s interesting that NVIDIA is, is quite active in trying to develop all kinds of different products.

Nathan Lambert (04:02:55) They try to create areas of commercial value that will use a lot of GPUs.

Lex Fridman (04:03:01) But they keep innovating and they’re doing a lot of incredible research, so…

Nathan Lambert (04:03:06) Everyone says the company’s super oriented around Jensen and how operationally plugged in he is. It sounds so unlike many other big companies that I’ve heard about. And so long as that’s the culture, I think that you can expect that to keep progress happening. It’s like he’s still in the Steve Jobs era of Apple. So long as that is how it operates, I’m pretty optimistic for their situation because it is their top-order problem, and I don’t know if making these chips for the whole ecosystem is the top goal of all these other companies. They’ll do a good job, but it might not be as good of a job.

Lex Fridman (04:03:43) Since you mentioned Jensen, I’ve been reading a lot about history and about singular figures in history. What do you guys think about the great man view of history? How important are individuals for steering the direction of history in the tech sector? So, you know, what’s NVIDIA without Jensen? You mentioned Steve Jobs. What’s Apple without Steve Jobs? What’s xAI without Elon or DeepMind without Demis?

Nathan Lambert (04:04:11) People make things earlier and faster, whereas scientifically, many great scientists credit being in the right place at the right time. Eventually someone else will still have the idea. So I think that in that way, Jensen is helping manifest this GPU revolution much faster and much more focused than it would be without having a person like him there. This is making the whole AI build-out faster. But I do still think that eventually something like ChatGPT would have happened and a build-out like this would have happened, but it probably would not have been as fast. I think that’s the sort of flavor that is applied.

Sebastian Raschka (04:04:55) These individual people are placing bets on something. Some get lucky, some don’t. But if you don’t have these people at the helm, it would be more diffused. It’s almost like investing in an ETF versus individual stocks. Individual stocks might go up or down more heavily than an ETF, which is more balanced. We’ll eventually get there, but I just think the focus is the thing. Passion and focus.

Lex Fridman (04:05:19) Isn’t there a real case to be made that without Jensen, there’s not a reinvigoration of the deep learning revolution?

Nathan Lambert (04:05:26) It could’ve been 20 years later, is the thing I would say.

Lex Fridman (04:05:29) Yeah, 20 is…

Nathan Lambert (04:05:30) Or another deep learning winter could have come… …If GPUs weren’t around.

Lex Fridman (04:05:35) That could change history completely because you could think of all the other technologies that could’ve come in the meantime, and the focus of human civilization would get… Silicon Valley would be captured by different hype.

Sebastian Raschka (04:05:48) But I do think there’s certainly an aspect where the GPU trajectory was all planned. But on the other end, it’s also a lot of lucky coincidences or good intuition. Like the investment into, let’s say, biophysical simulations. I mean, I think it started with video games and then it just happened to be good at linear algebra because video games require a lot of linear algebra. And then you have the biophysical simulations. But still, I don’t think the master plan was AI. I think it just happened to be Alex Krizhevsky. So someone took these GPUs and said, “Hey, let’s try to train a neural network on that.” It happened to work really well and… …I think it only happened because you could purchase those GPUs.

Nathan Lambert (04:06:30) Gaming would’ve created a demand for faster processors if… …NVIDIA had gone out of business in the early days. That’s what I would think. I think GPUs would still exist… …At the time of AlexNet and at the time of the Transformer. It was just hard to know if it would be one company as successful or multiple smaller companies with worse chips. But I don’t think that’s a 100-year delay. It might be a decade delay.

Lex Fridman (04:07:01) Well, it could be a one, two, three, four, five-decade delay. I mean, I just can’t see Intel or AMD doing what NVIDIA did.

Nathan Lambert (04:07:08) I don’t think it would be a company that exists.

Sebastian Raschka (04:07:11) A new company.

Nathan Lambert (04:07:11) I think it would be a different company that would rise.

Sebastian Raschka (04:07:13) Like Silicon Graphics or something.

Nathan Lambert (04:07:15) So yeah, some company that has died would have done it.

Lex Fridman (04:07:19) But looking at it, it seems like these singular figures, these leaders, have a huge impact on the trajectory of the world. Obviously, incredible teams are behind them. But, you know, having that kind of very singular, almost dogmatic focus- …is necessary to make progress.

Sebastian Raschka (04:07:40) Yeah, I mean, even with GPT, it wouldn’t exist if there wasn’t a person, Ilya, who pushed for this scaling, right?

Nathan Lambert (04:07:47) Yeah, Dario was also deeply involved in that. If you read some of the histories from OpenAI, it almost seems wild thinking about how early these people were like, “We need to hook up 10,000 GPUs and take all of OpenAI’s compute and train one model.” There were a lot of people there that didn’t want to do that.

Future of human civilization

Lex Fridman (04:08:02) Which is an insane thing to believe—to believe scaling before scaling has any indication that it’s going to materialize. Again, singular figures. Speaking of which, 100 years from now, this is presumably post-singularity, whatever the singularity is. When historians look back at our time now, what technological breakthroughs would they really emphasize as the breakthroughs that led to the singularity? So far we have Turing to today, which is 80 years.

Sebastian Raschka (04:08:36) I think it would still be computing, like the umbrella term “computing.” I don’t necessarily think that even 100 or 200 years from now it would be AI. It could still well be computers, you know? We are now taking better advantage of computers, but it’s the fact of computing.

Lex Fridman (04:08:53) It’s basically a Moore’s Law kind of discussion. Even the details of CUDA and GPUs won’t even be remembered, and there won’t be all this software turmoil. It’ll be just, obviously, compute.

Nathan Lambert (04:09:07) I generally agree, but is it the connectivity of the internet and compute able to be merged? Or is it both of them?

Sebastian Raschka (04:09:17) I think the internet will probably be related to communication—it could be a phone, internet, or a satellite. And compute is more like the scaling aspect of it.

Lex Fridman (04:09:29) It’s possible that the internet is completely forgotten. That the internet is wrapped into the phone networks, like communication networks. This is just another manifestation of that, and the real breakthrough comes from just the increased compute—Moore’s Law, broadly defined.

Nathan Lambert (04:09:46) Well, I think the connection of people is very fundamental to it. You want to find the best person in the world for something, they are somewhere in the world. Being able to have that flow of information—AIs will also rely on this. I’ve been fixating on when I said the dream was dead about the one central model; the thing that is evolving is that people have many agents for different tasks. People already started doing this with different Clouds for different tasks. It’s described as many AGIs in the data center where each one manages and they talk to each other. That is so reliant on networking and the free flow of information on top of compute. But networking, especially with GPUs, is such a part of the scaling of compute. The GPUs and the data centers need to talk to each other.

Lex Fridman (04:10:36) Do you think there’s something very specific and singular to the fact that it’s neural networks that’s seen as a breakthrough? Like a genius move where you’re basically replicating, in a very crude way, the structure of the human brain, the human mind?

Sebastian Raschka (04:10:54) I think without the human mind, we probably wouldn’t have neural networks because it was an inspiration for them. But on the other end, I think it’s just so different. I mean, it’s digital versus biological, so I think it will probably be more grouped as an algorithm.

Lex Fridman (04:11:11) That’s massively parallelizable— —on this particular kind of compute?

Sebastian Raschka (04:11:15) It could have well been genetic computing, like genetic algorithms, just parallelized. It just happens that this is more efficient and works better.

Lex Fridman (04:11:23) And it very well could be that the neural networks, the way we architect them now, are just a small component of the system that leads to the singularity.

Nathan Lambert (04:11:33) I think if you think of it over 100 years, society can be changed more with more compute and intelligence because of autonomy. But looking at this, what are the things from the Industrial Revolution that we remember? We remember the engine—it is probably the equivalent of the computer in this. But there’s a lot of other physical transformations that people are aware of, like the cotton gin and all these machines that are still known—air conditioning, refrigerators— Some of these things from AI will still be known; the word “transformer” could still very well be known. I would guess that deep learning is definitely still known, but the transformer might be evolved away from in 100 years with AI researchers everywhere. But I think deep learning is likely to be a term that is remembered.

Lex Fridman (04:12:28) And I wonder what the air conditioning and the refrigeration of the future is that AI brings. If we travel forward 100 years from now, what do you think is different? How does the world look? First of all, do you think there’s humans? Do you think there’s robots everywhere walking around?

Sebastian Raschka (04:12:46) I do think there will be specialized robots for certain tasks.

Lex Fridman (04:12:49) Humanoid form?

Sebastian Raschka (04:12:50) Maybe half-humanoid. We’ll see. I think for certain things, yes, there will be humanoid robots because it’s just amenable to the environment. But for certain tasks, it might not make sense. What’s harder to imagine is how we interact with devices and what humans do with them. I’m pretty sure it will not be the cellphone or the laptop. Will it be implants?

Lex Fridman (04:13:16) I mean, it has to be brain-computer interfaces, right? I mean, 100 years from now, it has to—given the progress we’re seeing now— —there has to be, unless there’s legitimately a complete alteration of how we interact with reality.

Sebastian Raschka (04:13:33) On the other hand, if you think of cars, cars are older than 100 years, right? And it’s still the same interface. We haven’t replaced cars with something else; we just made them better. But it’s still a steering wheel, it’s still wheels.

Nathan Lambert (04:13:45) I think we’ll still carry around a physical brick of compute— —because people want some ability to have a private interface. You might not engage with it as much as a phone, but having something where you could have private information that is yours as an interface between you and the rest of the internet is something I think will still exist. It might not look like an iPhone, and it might be used a lot less, but I still expect people to carry things around.

Lex Fridman (04:14:08) Why do you think the smartphone is the embodiment of privacy? There’s a camera on it. There’s a-

Nathan Lambert (04:14:15) Private for you, like encrypted messages, encrypted photos; you know what your life is. I guess this is a question of how optimistic you are on brain-machine interfaces. Is all that just going to be stored in the cloud, like your whole calendar? It’s hard to think about processing all the information that we can process visually through brain-machine interfaces presenting something like a calendar to you. It’s hard to just think about knowing your email inbox without looking. Like you signal to a computer and then you just know your email inbox. Is that something that the human brain can handle being piped into it non-visually? I don’t know exactly how those transformations happen. ‘Cause humans aren’t changing in 100 years.

Nathan Lambert (04:15:05) I think agency and community are things that people actually want.

Lex Fridman (04:15:09) A local community, yeah.

Nathan Lambert (04:15:10) So, like, people you are close to, being able to do things with them and being able to ascribe meaning to your life. I don’t think that human biology is changing away from those on a timescale that we can discuss. UBI does not solve agency. I do expect mass wealth, and I hope that it has spread so that the average life does look very different in 100 years. But that’s still a lot to happen in 100 years. If you think about countries that are early in their development process, to build all the infrastructure and have policy that shares one nation’s wealth with another is… I think it’s an optimistic view to see all that happening in 100 years- …while they are still independent entities and not just absorbed into some international order by force.

Lex Fridman (04:16:13) But there could be just better, more elaborate, more effective- …social support systems that help alleviate some levels of basic suffering from the world. With the transformation of society where a lot of jobs are lost in the short term, I think we have to really remember that each individual job that’s lost is a human being who’s suffering. When jobs are lost at scale, it is a real tragedy. You can make all kinds of arguments about economics or say it’s all going to be okay and good for the GDP because new jobs will be created, but fundamentally at the individual level for that human being, that’s real suffering. That’s a real personal tragedy.

Lex Fridman (04:16:58) And we have to not forget that as the technologies are being developed. Also, my hope for all the AI slop we’re seeing is that there will be a greater and greater premium for the fundamental aspects of the human experience that are in-person. The things that we all enjoy, like seeing each other and talking together in-person.

Nathan Lambert (04:17:22) The next few years are definitely going to see an increased value on physical goods and events- …and even more pressure from slop. The slop is only starting. The next few years will be more and more diverse-

Lex Fridman (04:17:37) Do you think we’ll all be drow-

Nathan Lambert (04:17:37) …versions of slop.

Lex Fridman (04:17:38) They would be drowning in slop. Is that what-

Nathan Lambert (04:17:40) So I’m hoping that society drowns in slop enough to snap out of it and be like, “We can’t. It just doesn’t matter. We all can’t deal with it.” And then, the physical has such a higher premium on it.

Sebastian Raschka (04:17:53) Even like classic examples, I honestly think this is true, and I think we will get tired of it. We are already kind of tired of it. Same with art. I don’t think art will go away. I mean, you have physical paintings. There’s more value, not just monetary value, but just more appreciation for the actual painting than a photocopy of that painting. It could be a perfect digital reprint, but there is something when you go to a museum and you look at that art and you see that real thing and you just think about, “Okay, a human.” It’s like a craft. You have an appreciation for that.

Sebastian Raschka (04:18:25) And I think the same is true for writing, for talking, for any type of experience, where it will be… I do unfortunately think it will be like a dichotomy, like a fork where some things will be automated. There are not as many paintings as there used to be 200 years ago. There are more photographs, more photocopies. But at the same time, it won’t go away. There will be value in that. I think that the difference will just be what’s the proportion of that. But personally, I have a hard time reading things where I see it’s obviously AI-generated. I’m sorry, there might be really good information there, but I have a certain feeling, like, it’s not for me.

Nathan Lambert (04:19:08) I think eventually they’ll fool you, and it’ll be on platforms that give ways of verifying or building trust. So you will trust that Lex is not AI-generated, having been here. So then you have trust in this- -channel. But it’s harder for new people- -that don’t have that trust.

Sebastian Raschka (04:19:25) Well, that will get interesting because I think fundamentally it’s a solvable problem by having trust in certain outlets that they won’t do it, but it’s all going to be kind of trust-based. There will be some systems to authorize, “Okay, this is real. This is not real.” There will be some telltale signs where you can obviously tell this is AI-generated and this is not. But some will be so good that it’s hard to tell, and then you have to trust. And that will get interesting and a bit problematic.

Nathan Lambert (04:19:54) The extreme case of this is to watermark all human content. So all photos that we take on our own- -have some watermark until they- -are edited- -or something like this. And software can manage communications with the device manufacturer- -to maintain human editing, which is the opposite of the discussion to try to watermark AI images. And then you can make a Google image that has a watermark and use a different Google tool to remove the watermark.

Sebastian Raschka (04:20:20) Yeah. It’s going to be an arms race, basically.

Lex Fridman (04:20:23) And we’ve been mostly focusing on the positive aspects of AI. I mean, all the capabilities that we’ve been talking about can be used to destabilize human civilization with even just relatively dumb AI applied at scale, and then further and further, superintelligent AI systems. Of course, there’s the sort of doomer take that’s important to consider a little bit as we develop these technologies. What gives you hope about the future of human civilization? Everything we’ve been talking about—are we going to be okay?

Nathan Lambert (04:20:59) I think we will. I’m definitely a worrier both about AI and non-AI things, but humans do tend to find a way. I think that’s what humans are built for—to have community and find a way to figure out problems. And that’s what has gotten us to this point. I think the AI opportunity and related technologies is really big. I think that there are big social and political problems to help everybody understand that. I think that’s what we’re staring at a lot of right now; the world is a scary place, and AI is a very uncertain thing. And it takes a lot of work that is not necessarily building things. It’s like telling people and understanding people, things that the people building AI are historically not motivated or wanting to do.

Nathan Lambert (04:21:50) But it is something that is probably doable. It just will take longer than people want. And we have to go through that long period of hard, distraught AI discussions if we want to have the lasting benefits.

Lex Fridman (04:22:04) Yeah. Through that process, I’m especially excited that we get a chance to better understand ourselves at the individual level as humans and at the civilization level, and answer some of the big mysteries, like what is this whole consciousness thing going on here? It seems to be truly special. Like, there’s a real miracle in our mind. And AI puts a mirror to ourselves and we get to answer some of the big questions about what is this whole thing going on here.

Sebastian Raschka (04:22:35) Well, one thing about that is also what I do think makes us very different from AI and why I don’t worry about AI taking over is, like you said, consciousness. We humans, we decide what we want to do. AI in its current implementation, I can’t see it changing. You have to tell it what to do. And so you still have the agency. It doesn’t take the agency from you because it becomes a tool. You tell it what to do. It will be more automatic than other previous tools. It’s certainly more powerful than a hammer, it can figure things out, but it’s still you in charge, right? So the AI is not in charge, you’re in charge. You tell the AI what to do and it’s doing it for you.

Lex Fridman (04:23:17) So in the post-singularity, post-apocalyptic war between humans and machines, you’re saying humans are worth fighting for?

Sebastian Raschka (04:23:27) 100%. I mean, the movie Terminator, they made in- -the ’80s, essentially, and I do think the only thing I can see going wrong is, of course, if things are explicitly programmed to do things that are harmful.

Lex Fridman (04:23:43) I think actually in a Terminator type of setup, I think humans win. I think we’re too clever. It’s hard to explain how we figure it out, but we do. And we’ll probably be using local LLMs, open source LLMs, to help fight the machines. I apologize for the ridiculousness. Like I said, Nathan, I’ve already been a big fan of yours for a long time. And I’ve been a big fan of yours, Sebastian, for a long time, so it’s an honor to finally meet you. Thank you for everything you put out into the world. Thank you for the excellent books you’re writing. Thank you for teaching us. And thank you for talking today. This was fun.

Sebastian Raschka (04:24:26) Thank you for inviting us here and having this human connection, which is actually-

Lex Fridman (04:24:30) -extremely valuable- -human connection. Thanks for listening to this conversation with Sebastian Raschka and Nathan Lambert. To support this podcast, please check out our sponsors in the description, where you can also find links to contact me, ask questions, give feedback, and so on. And now let me leave you with some words from Albert Einstein: “It is not that I’m so smart, but I stay with the questions much longer.” Thank you for listening, and hope to see you next time.

Lex Fridman

Research Scientist at MIT. Host of Lex Fridman Podcast.

Transcript for State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490

Table of Contents