OpenAI’s new Sora just put all other AI video tools on notice — here’s why
OpenAI has unveiled its new artificial intelligence video model called Sora that makes previous AI video tools look like toys. It has incredible realism and can make consistent minute long clips with multiple shots — all from a single text prompt.
You won't be able to use it for a while as, according to a spokesperson for the company I spoke to "there are safety issues to solve first", but we have been given a glimpse at its impressive abilities.
The realism is so advanced that I’ve seen several posts on X along the lines of “I can’t tell what is real on my feed anymore”. It can sometimes stray into the uncanny valley and look more like a hyper realistic render in Unreal Engine than from a real camera, but its still impressive.
But how did OpenAI achieve this “ChatGPT moment for generative video” and what will the other models have to do to catch up? The answer seems to be “raise more money”.
It’s all about the computing power
Since its inception OpenAI has raised more than $11 billion in funding, most of which is from Microsoft.
CEO Sam Altman is now on the hunt for up to $7 trillion to create a network of global AI chip factories to further service the ever more demanding need for processing power. That is almost as high as the combined GDP of Germany and France.
While the major developments seen in Sora aren't entirely down to money or computing resources, it does play a big part.
The first line of the research paper talks about using large-scale training to improve the quality and duration of diffusion models.
Emad Mostaque, founder and CEO of Stability AI, one of the companies that built Stable Diffusion and led the development of diffusion models told me that the work on Sora “proves that you can scale just about any modality.”
StabilityAI also works with multi modalities including audio, image, video and text and Mostque told me during a conversation on X that the company now needs to “get more compute” to compete and reach these same levels.
OpenAI revealed a trio of videos showing the value of increased compute, going from a horrific near-dog creature to a fully realistic dog and human bouncing in the snow.
Longer and more varied clips
Currently it seems a universal in AI generated video that clips are roughly 24 frames per second, last about three seconds and are low HD quality.
Sora came out of the gate with a series of example clips, including those generated in response to requests from users on X, that are up to a minute long and of higher resolution. This is a step change in generated video and promises similar capabilities as Google Lumiere.
The other significant difference, and this likely comes from the ability to create a longer clip in one hit, is multiple shots within a generated clip. One fascinating example comes in the form of an astronaut getting ready for launch with shots jumping between the man and machine.
Creating a simulation of the whole world
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world,” the OpenAI research declares.
This is one of the main goals of all the AI video tools. Creating a mechanism for understanding the entire world as humans see it, then using that to create a realistic video.
Runway, one of the leading AI video labs is working on General World Models, writing on X: "We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around general world models."
Even Meta is working on training AI models by having them watch and output video. V-JEPA is a new method for teaching machines to understand and model the physical world through videos.
They are trained with a feature prediction objective and in one example CEO Mark Zuckerberg is playing guitar, blocks out the strumming pattern and V-JEPA can replicate it.
What does this mean for the future of AI video?
The dream or nightmare scenario — depending on your perspective — is that you’ll go to Netflix and instead of searching for a movie you’ll write a prompt “make me a documentary on fictional creatures using the voice of David Attenborough” and it’ll generate it from that prompt.
That is a long way off, although with a few extra steps I was able to make an AI trailer for a similarly fictional show.
The reality is more likely that, much like Adobe has done with generative fill in Photoshop, video editing tools will use AI video to “fill in the gaps” or replace lost shots.
The real benefit is in creating a deeper understanding for AI of the world. Jim Fan, a research scientist and AI agent expert for Nvidia explained that at its heart Sora is a physics engine, a "simulation of many worlds, real or fantastical" and that the simular renders intuitive physics, reasoning and grounding.
He predicts that Sora was likely trained on synthetic data, such as the hyper realistic renders possible with Unreal Engine 5, rather than just on real videos. This would also help it understand the physics as it would have abelling data for every aspect of the environment.
It also means we could see these video environments turned back into 3D worlds and real-time generation of virtual or game environments for the Vision Pro or Quest headsets.