This week, OpenAI launched what its chief executive, Sam Altman, called âthe smartest model in the worldââa generative-AI program whose capabilities are supposedly far greater, and more closely approximate how humans think, than those of any such software preceding it. The start-up has been building toward this moment since September 12, a day that, in OpenAIâs telling, set the world on a new path toward superintelligence.
That was when the company previewed early versions of a series of AI models, known as o1, constructed with novel methods that the start-up believes will propel its programs to unseen heights. Mark Chen, then OpenAIâs vice president of research, told me a few days later that o1 is fundamentally different from the standard ChatGPT because it can âreason,â a hallmark of human intelligence. Shortly thereafter, Altman pronounced âthe dawn of the Intelligence Age,â in which AI helps humankind fix the climate and colonize space. As of yesterday afternoon, the start-up has released the first complete version of o1, with fully fledged reasoning powers, to the public. (The Atlantic recently entered into a corporate partnership with OpenAI.)
On the surface, the start-upâs latest rhetoric sounds just like hype the company has built its $157 billion valuation on. Nobody on the outside knows exactly how OpenAI makes its chatbot technology, and o1 is its most secretive release yet. The mystique draws interest and investment. âItâs a magic trick,â Emily M. Bender, a computational linguist at the University of Washington and prominent critic of the AI industry, recently told me. An average user of o1 might not notice much of a difference between it and the default models powering ChatGPT, such as GPT-4o, another supposedly major update released in May. Although OpenAI marketed that product by invoking its lofty missionââadvancing AI technology and ensuring it is accessible and beneficial to everyone,â as though chatbots were medicine or foodâGPT-4o hardly transformed the world.
Read: The AI boom has an expiration date
But with o1, something has shifted. Several independent researchers, while less ecstatic, told me that the program is a notable departure from older models, representing âa completely different ballgameâ and âgenuine improvement.â Even if these modelsâ capacities prove not much greater than their predecessorsâ, the stakes for OpenAI are. The company has recently dealt with a wave of controversies and high-profile departures, and model improvement in the AI industry overall has slowed. Products from different companies have become indistinguishableâChatGPT has much in common with Anthropicâs Claude, Googleâs Gemini, xAIâs Grokâand firms are under mounting pressure to justify the technologyâs tremendous costs. Every competitor is scrambling to figure out new ways to advance their products.
Over the past several months, Iâve been trying to discern how OpenAI perceives the future of generative AI. Stretching back to this spring, when OpenAI was eager to promote its efforts around so-called multimodal AI, which works across text, images, and other types of media, Iâve had multiple conversations with OpenAI employees, conducted interviews with external computer and cognitive scientists, and pored over the start-upâs research and announcements. The release of o1, in particular, has provided the clearest glimpse yet at what sort of synthetic âintelligenceâ the start-up and companies following its lead believe they are building.
The company has been unusually direct that the o1 series is the future: Chen, who has since been promoted to senior vice president of research, told me that OpenAI is now focused on this ânew paradigm,â and Altman later wrote that the company is âprioritizingâ o1 and its successors. The company believes, or wants its users and investors to believe, that it has found some fresh magic. The GPT era is giving way to the reasoning era.
Last spring, I met Mark Chen in the renovated mayonnaise factory that now houses OpenAIâs San Francisco headquarters. We had first spoken a few weeks earlier, over Zoom. At the time, he led a team tasked with tearing down âthe big roadblocksâ standing between OpenAI and artificial general intelligenceâa technology smart enough to match or exceed humanityâs brainpower. I wanted to ask him about an idea that had been a driving force behind the entire generative-AI revolution up to that point: the power of prediction.
The large language models powering ChatGPT and other such chatbots âlearnâ by ingesting unfathomable volumes of text, determining statistical relationships between words and phrases, and using those patterns to predict what word is most likely to come next in a sentence. These programs have improved as theyâve grownâtaking on more training data, more computer processors, more electricityâand the most advanced, such as GPT-4o, are now able to draft work memos and write short stories, solve puzzles and summarize spreadsheets. Researchers have extended the premise beyond text: Todayâs AI models also predict the grid of adjacent colors that cohere into an image, or the series of frames that blur into a film.
The claim is not just that prediction yields useful products. Chen claims that âprediction leads to understandingââthat to complete a story or paint a portrait, an AI model actually has to discern something fundamental about plot and personality, facial expressions and color theory. Chen noted that a program he designed a few years ago to predict the next pixel in a grid was able to distinguish dogs, cats, planes, and other sorts of objects. Even earlier, a program that OpenAI trained to predict text in Amazon reviews was able to determine whether a review was positive or negative.
Todayâs state-of-the-art models seem to have networks of code that consistently correspond to certain topics, ideas, or entities. In one now-famous example, Anthropic shared research showing that an advanced version of its large language model, Claude, had formed such a network related to the Golden Gate Bridge. That research further suggested that AI models can develop an internal representation of such concepts, and organize their internal âneuronsâ accordinglyâa step that seems to go beyond mere pattern recognition. Claude had a combination of âneuronsâ that would light up similarly in response to descriptions, mentions, and images of the San Francisco landmark. âThis is why everyoneâs so bullish on prediction,â Chen told me: In mapping the relationships between words and images, and then forecasting what should logically follow in a sequence of text or pixels, generative AI seems to have demonstrated the ability to understand content.
The pinnacle of the prediction hypothesis might be Sora, a video-generating model that OpenAI announced in February and which conjures clips, more or less, by predicting and outputting a sequence of frames. Bill Peebles and Tim Brooks, Soraâs lead researchers, told me that they hope Sora will create realistic videos by simulating environments and the people moving through them. (Brooks has since left to work on video-generating models at Google DeepMind.) For instance, producing a video of a soccer match might require not just rendering a ball bouncing off cleats, but developing models of physics, tactics, and playersâ thought processes. âAs long as you can get every piece of information in the world into these models, that should be sufficient for them to build models of physics, for them to learn how to reason like humans,â Peebles told me. Prediction would thus give rise to intelligence. More pragmatically, multimodality may also be simply about the pursuit of dataâexpanding from all the text on the web to all the photos and videos, as well.
Just because OpenAIâs researchers say their programs understand the world doesnât mean they do. Generating a cat video doesnât mean an AI knows anything about catsâit just means it can make a cat video. (And even that can be a struggle: In a demo earlier this year, Sora rendered a cat that had sprouted a third front leg.) Likewise, âpredicting a text doesnât necessarily mean that [a model] is understanding the text,â Melanie Mitchell, a computer scientist who studies AI and cognition at the Santa Fe Institute, told me. Another example: GPT-4 is far better at generating acronyms using the first letter of each word in a phrase than the second, suggesting that rather than understanding the rule behind generating acronyms, the model has simply seen far more examples of standard, first-letter acronyms to shallowly mimic that rule. When GPT-4 miscounts the number of râs in strawberry, or Sora generates a video of a glass of juice melting into a table, itâs hard to believe that either program grasps the phenomena and ideas underlying their outputs.
These shortcomings have led to sharp, even caustic criticism that AI cannot rival the human mindâthe models are merely âstochastic parrots,â in Benderâs famous words, or supercharged versions of âautocomplete,â to quote the AI critic Gary Marcus. Altman responded by posting on social media, âI am a stochastic parrot, and so r u,â implying that the human brain is ultimately a sophisticated word predictor, too.
Altmanâs is a plainly asinine claim; a bunch of code running in a data center is not the same as a brain. Yet itâs also ridiculous to write off generative AIâa technology that is redefining education and art, at least, for better or worseâas âmereâ statistics. Regardless, the disagreement obscures the more important point. It doesnât matter to OpenAI or its investors whether AI advances to resemble the human mind, or perhaps even whether and how their models âunderstandâ their outputsâonly that the products continue to advance.
OpenAIâs new reasoning models show a dramatic improvement over other programs at all sorts of coding, math, and science problems, earning praise from geneticists, physicists, economists, and other experts. But notably, o1 does not appear to have been designed to be better at word prediction.
According to investigations from The Information, Bloomberg, TechCrunch, and Reuters, major AI companies including OpenAI, Google, and Anthropic are finding that the technical approach that has driven the entire AI revolution is hitting a limit. Word-predicting models such as GPT-4o are reportedly no longer becoming reliably more capable, even more âintelligent,â with size. These firms may be running out of high-quality data to train their models on, and even with enough, the programs are so massive that making them bigger is no longer making them much smarter. o1 is the industryâs first major attempt to clear this hurdle.
When I spoke with Mark Chen after o1âs September debut, he told me that GPT-based programs had a âcore gap that we were trying to address.â Whereas previous models were trained âto be very good at predicting what humans have written down in the past,â o1 is different. âThe way we train the âthinkingâ is not through imitation learning,â he said. A reasoning model is ânot trained to predict human thoughtsâ but to produce, or at least simulate, âthoughts on its own.â It follows that because humans are not word-predicting machines, then AI programs cannot remain so, either, if they hope to improve.
More details about these modelsâ inner workings, Chen said, are âa competitive research secret.â But my interviews with independent researchers, a growing body of third-party tests, and hints in public statements from OpenAI and its employees have allowed me to get a sense of whatâs under the hood. The o1 series appears âcategorically differentâ from the older GPT series, Delip Rao, an AI researcher at the University of Pennsylvania, told me. Discussions of o1 point to a growing body of research on AI reasoning, including a widely cited paper co-authored last year by OpenAIâs former chief scientist, Ilya Sutskever. To train o1, OpenAI likely put a language model in the style of GPT-4 through a huge amount of trial and error, asking it to solve many, many problems and then providing feedback on its approaches, for instance. The process might be akin to a chess-playing AI playing a million games to learn optimal strategies, Subbarao Kambhampati, a computer scientist at Arizona State University, told me. Or perhaps a rat that, having run 10,000 mazes, develops a good strategy for choosing among forking paths and doubling back at dead ends.
Read: Silicon Valleyâs trillion-dollar leap of faith
Prediction-based bots, such as Claude and earlier versions of ChatGPT, generate words at a roughly constant rate, without pauseâthey donât, in other words, evince much thinking. Although you can prompt such large language models to construct a different answer, those programs do not (and cannot) on their own look backward and evaluate what theyâve written for errors. But o1 works differently, exploring different routes until it finds the best one, Chen told me. Reasoning models can answer harder questions when given more âthinkingâ time, akin to taking more time to consider possible moves at a crucial moment in a chess game. o1 appears to be âsearching through lots of potential, emulated âreasoningâ chains on the fly,â Mike Knoop, a software engineer who co-founded a prominent contest designed to test AI modelsâ reasoning abilities, told me. This is another way to scale: more time and resources, not just during training, but also when in use.
Here is another way to think about the distinction between language models and reasoning models: OpenAIâs attempted path to superintelligence is defined by parrots and rats. ChatGPT and other such productsâthe stochastic parrotsâare designed to find patterns among massive amounts of data, to relate words, objects, and ideas. o1 is the maze-running rodent, designed to navigate those statistical models of the world to solve problems. Or, to use a chess analogy: You could play a game based on a bunch of moves that youâve memorized, but thatâs different from genuinely understanding strategy and reacting to your opponent. Language models learn a grammar, perhaps even something about the world, while reasoning models aim to use that grammar. When I posed this dual framework, Chen called it âa good first approximationâ and âat a high level, the best way to think about it.â
Reasoning may really be a way to break through the wall that the prediction models seem to have hit; much of the tech industry is certainly rushing to follow OpenAIâs lead. Yet taking a big bet on this approach might be premature.
For all the grandeur, o1 has some familiar limitations. As with primarily prediction-based models, it has an easier time with tasks for which more training examples exist, Tom McCoy, a computational linguist at Yale who has extensively tested the preview version of o1 released in September, told me. For instance, the program is better at decrypting codes when the answer is a grammatically complete sentence instead of a random jumble of wordsâthe former is likely better reflected in its training data. A statistical substrate remains.
François Chollet, a former computer scientist at Google who studies general intelligence and is also a co-founder of the AI reasoning contest, put it a different way: âA model like o1 ⌠is able to self-query in order to refine how it uses what it knows. But it is still limited to reapplying what it knows.â A wealth of independent analyses bear this out: In the AI reasoning contest, the o1 preview improved over the GPT-4o but still struggled overall to effectively solve a set of pattern-based problems designed to test abstract reasoning. Researchers at Apple recently found that adding irrelevant clauses to math problems makes o1 more likely to answer incorrectly. For example, when asking the o1 preview to calculate the price of bread and muffins, telling the bot that you plan to donate some of the baked goodsâeven though that wouldnât affect their costâled the model astray. o1 might not deeply understand chess strategy so much as it memorizes and applies broad principles and tactics.
Even if you accept the claim that o1 understands, instead of mimicking, the logic that underlies its responses, the program might actually be further from general intelligence than ChatGPT. o1âs improvements are constrained to specific subjects where you can confirm whether a solution is trueâlike checking a proof against mathematical laws or testing computer code for bugs. Thereâs no objective rubric for beautiful poetry, persuasive rhetoric, or emotional empathy with which to train the model. That likely makes o1 more narrowly applicable than GPT-4o, the University of Pennsylvaniaâs Rao said, which even OpenAIâs blog post announcing the model hinted at, stating: âFor many common cases GPT-4o will be more capable in the near term.â
Read: The lifeblood of the AI boom
But OpenAI is taking a long view. The reasoning models âexplore different hypotheses like a human would,â Chen told me. By reasoning, o1 is proving better at understanding and answering questions about images, too, he said, and the full version of o1 now accepts multimodal inputs. The new reasoning models solve problems âmuch like a person would,â OpenAI wrote in September. And if scaling up large language models really is hitting a wall, this kind of reasoning seems to be where many of OpenAIâs rivals are turning next, too. Dario Amodei, the CEO of Anthropic, recently noted o1 as a possible way forward for AI. Google has recently released several experimental versions of Gemini, its flagship model, all of which exhibit some signs of being maze ratsâtaking longer to answer questions, providing detailed reasoning chains, improvements on math and coding. Both it and Microsoft are reportedly exploring this âreasoningâ approach. And multiple Chinese tech companies, including Alibaba, have released models built in the style of o1.
If this is the way to superintelligence, it remains a bizarre one. âThis is back to a million monkeys typing for a million years generating the works of Shakespeare,â Emily Bender told me. But OpenAIâs technology effectively crunches those years down to seconds. A company blog boasts that an o1 model scored better than most humans on a recent coding test that allowed participants to submit 50 possible solutions to each problemâbut only when o1 was allowed 10,000 submissions instead. No human could come up with that many possibilities in a reasonable length of time, which is exactly the point. To OpenAI, unlimited time and resources are an advantage that its hardware-grounded models have over biology. Not even two weeks after the launch of the o1 preview, the start-up presented plans to build data centers that would each require the power generated by approximately five large nuclear reactors, enough for almost 3 million homes. Yesterday, alongside the release of the full o1, OpenAI announced a new premium tier of subscription to ChatGPT that enables users, for $200 a month (10 times the price of the current paid tier), to access a version of o1 that consumes even more computing powerâmoney buys intelligence. âThere are now two axes on which we can scale,â Chen said: training time and run time, monkeys and years, parrots and rats. So long as the funding continues, perhaps efficiency is beside the point.
The maze rats may hit a wall, eventually, too. In OpenAIâs early tests, scaling o1 showed diminishing returns: Linear improvements on a challenging math exam required exponentially growing computing power. That superintelligence could use so much electricity as to require remaking grids worldwideâand that such extravagant energy demands are, at the moment, causing staggering financial lossesâare clearly no deterrent to the start-up or a good chunk of its investors. Itâs not just that OpenAIâs ambition and technology fuel each other; ambition, and in turn accumulation, supersedes the technology itself. Growth and debt are prerequisites for and proof of more powerful machines. Maybe thereâs substance, even intelligence, underneath. But there doesnât need to be for this speculative flywheel to spin.
Skip the extension â just come straight here.
Weâve built a fast, permanent tool you can bookmark and use anytime.
Go To Paywall Unblock Tool