Datascience in Towards Data Science on Medium,

A Practical Exploration of Sora — Intuitively and Exhaustively Explained

1/17/2025 Jesus Santana

A Practical Exploration of Sora — Intuitively and Exhaustively Explained

The functionality of OpenAI’s video generation model, and the theory behind it

“The Eye”, generated by Daniel Warfield using Sora. All images and videos by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.

OpenAI recently released Sora, their cutting edge video generation model. With Sora, they also released a web UI that allows users to generate and edit videos in a variety of sophisticated and interesting ways. In this article we’ll be exploring the features of that UI, and how Sora might be enabling those features.

Who is this useful for? Anyone who wants to use AI for video generation, or wants to forge a deeper understanding of artificial intelligence.

How advanced is this post? This post is intended for a general audience, and is designed to be a gentle introduction to artificial intelligence. It also contains some speculation about how Sora works, which might be compelling for more advanced readers.

Pre-requisites: None.

A Tour of Sora

At the time of writing, here’s the main page of the Sora UI:

On the left are some of the typical suspects. The Explore section allows you to browse videos generated by other creators, and the library section allows you to browse and organize videos which you’ve generated.

To actually generate video, you use the text bar… thing at the bottom of the screen. Guides by OpenAI just refer to it as “the bottom of the screen”, and the HTML class used to define it doesn’t exactly roll off the tongue, so I’ll refer to it as the “creation bar” for our purposes.

The creation bar, the main place where you create content in Sora.

The fundamental way of generating video with Sora is to type in some prompt and then press submit.

This will trigger a generation job, which will then appear as a progress bar in two places: in your list of videos and in your notification section at the top of the screen.

Once the generation has been completed, the results can be viewed intuitively throughout the Sora UI.

The result of our prompt “A Rubik’s Cube erupting into a ball of flames”. We can scrub through the video in our library, click to play individual examples, or open and play all generated variants.

You can also upload images which can be used to inform video generation. Here, I’ve uploaded the cover of my article on LLM Routing

When you upload an image you can optionally add a prompt to describe the video you want, or you can upload the image by itself and let Sora figure out what type of video makes sense.

Four examples of video generated from Sora when no prompt was given. As you can see, very little is actually happening in this example.

An example generation with the image and a textual prompt describing the image. As you can see, these results are a bit more dynamic.

When generating video you can set the aspect ratio (the shape of the video),

The quality (different options available depending on your subscription tier),

the length of the video you want to create (different options for this are also available depending on your subscription tier),

and how many variations of the video you want to generate per submission.

For those wanting more granular control of video generation, you can use the storyboard.

An example of a story board, which consists of images, text, and video over time.

With storyboards you can inject images, videos, and text throughout the storyboard to granularly control how frames are generated.

A few examples of the results of the storyboard above

The strategies of prompting video generation models can be very complex. getting a result of reasonable quality can take a substantial amount of fiddling, both in terms of the content blocks and their temporal position within the storyboard. I’ll talk about the storyboard and some ideas one can employ to get higher quality results in later sections.

Circling back to the creation bar, there’s also an option for presets

This probably uses a technology called LoRA, which I cover in another article. You can think of LoRA like specially trained filters that can be applied to an AI model like Sora.

Here’s a few examples of various generations with the same prompt and different presets:

“A mysterious detective in a moodily lit room” with no preset

“A mysterious detective in a moodily lit room” with the “Balloon World” preset

“A mysterious detective in a moodily lit room” with the “Stop Motion” preset

once a video’s been generated there are a few options for editing it.

Options for editing an existing video, at the bottom of the screen when a particular video is selected.

you can “Re-Cut”, which is like normal video cutting, except you can use AI to change the resolution of the image, the length, etc.

“Re-Cut” allows you to do cool stuff like AI Upscale, extend a video by generating more frames, etc.

you can “Remix” which allows you to generate a new video based on a prompt and an existing video

remixing the video with the prompt “A detective made of spaghetti, inspecting photos in moody lighting.”

You can blend between videos in a variety of ways with the “Blend” function

The result of creating a transition between the detective and eye videos via the Blend tool.

And you can turn your videos into a loop with the “loop” function. This will make it so the video loops seamlessly such that the end of the video perfectly feeds into the beginning.

Creating a loop out of the detective scene

And that’s the essence of Sora; it’s a big box of AI-powered video editing tools. In the following sections we’ll work through each of these functionalities, explore them more in depth, and discuss how they (probably) work.

The Theory of Text to Video Generation

The most fundamental idea of Sora is the “diffusion model”. Diffusion is a modeling approach where the task of generating things like images and videos is thought of as a denoising problem.

Basically, to train a model like Sora, you start with a bunch of videos and a textual description of those videos. Then, you make those videos super noisy.

Adding noise to a video, to varying degrees. From my original article on Sora. The original “Steamboat Willie” rendition of Mickey Mouse is in the public domain because it was published (or registered with the U.S. Copyright Office) before January 1, 1929. Source.

and then you train the model to generate the original videos from noise based on the textual descriptions.

The modeling objective of a diffusion model; to remove noise from images. From my original article on Sora.

By training a model like Sora on a lot of videos (like, millions or billions of hours of video), OpenAI trained a model that’s able to turn text into video through denoising.

A conceptual diagram of Sora. It receives a sequence of noisy images, and outputs a sequence of images as a video.

And that’s about it. This one strategy of denoising video is powering text to video generation, image to video generation, timelines, blending etc. OpenAI is achieving all this complex functionality with a single model as a result of some clever tricks which we’ll be exploring in the following sections.

Sora’s Tricks

Recall that one way to make videos with Sora is by using a timeline editor

Recall that the timeline allows us to place blocks of content within a timeline, and generate a video based on that information.

Let’s explore a few tricks that could be used to make this work.

Trick 1) Frame Injection

Recall that diffusion models take in noisy video and output less noisy video

Using Sora to generate video based on a textual description, from my original article on Sora.

If you have an image, and you want to create a video from that image, you can simply set one of those frames to be part of the input of the model.

Using Sora to generate video based on a textual description and an image, from my original article on Sora.

If you have video you want to generate based off of, you can just include that video in the input.

Using Sora to generate video based on a textual description and a video, from my original article on Sora.

The whole idea of Sora is that it understands video and denoises the entire input to make sense based on whatever textual description is provided. So, if you include video in the input to the model, it will naturally denoise the rest of the frames to make sense based on whatever frames you provided.

If you have multiple videos or images in a timeline, you can just put them in the corresponding locations in Sora’s input

If you add multiple images or video throughout Sora’s input, it will force the model to reconcile it’s denoising process with the surrounding frames, from my original article on Sora.

The choice of what frames to add, and where they’re added in the input will drastically impact the output of Sora. I suspect this type of approach is central to some of the more advanced generation approaches, like the storyboard, blending, and looping features within Sora.

Imagine adding the frames at the end of a video to the beginning of the Sora’s input, and the beginning of the video to the end of Sora’s input. If you did that, the model would be forced to generate frames which transition from the end to the beginning of your video. This is probably the fundamental idea of how Sora generates video loops.

So that would explain how Sora deals with storyboards of images and video. When you place various images and videos within a storyboard, those are placed throughout the otherwise noisy input to Sora, and Sora takes those images and videos into account when denoising the rest of the frames.

Recall that, on the storyboard, we can place both text and video over time.

The next trick describes how a storyboard of textual descriptions might work.

Trick 2) Gated Cross Attention

I wrote an article on a model called “Flamingo” a while ago. The idea of Flamingo was to create a language model which could generate text based on arbitrary sequences of text and images as an input.

Flamingo generating responses based on an input of both text and images. Blocks in pink are generated by the Flamingo model. Many models which deal with both textual and visual inputs take direct inspiration from Flamingo. Image from the Flamingo paper.

In a lot of ways, Flamingo is similar to Sora. Flamingo accepts arbitrary sequences of images and text and uses that data to generate text. Sora accepts arbitrary sequences of images and text and uses that data to generate videos. Flamingo goes about the general problem of understanding images and text simultaneously by using a special masking strategy in a mechanism called “cross attention”.

This is a fairly advanced concept, but I’ll try to explain it from a high level.

Basically, there’s a mechanism in many modern AI models called attention, which allows vectors that represent words or sections of images to interact with one another to make abstract and meaning rich representations. These representations allow modern AI models to make their complex inferences.

There are many different flavors of attention, but they all do the same thing: Take a bunch of vectors that represent individual things and allow the model to make an abstract and meaning rich representation based on making those vectors interact with one another. Image from my article on transformers.

AI models use two general flavors of attention: self attention and cross attention. Self-attention allows AI models to make complex and abstract representations of an input

Imagine that we input a sequence of text “I am an input sequence” into an AI model. We might create a vector that represents each word in the sequence, then input those vectors into the model. “Self-attention”, in this context, would make all the atomic and isolated vectors interact with one another. So, the vectors for “I”, “am”, “an”, “input”, and “sequence”, would be allowed to interact with one another, allowing the model to create an understanding of the entire sequence, not just a list of isolated words.

Cross-attention allows different inputs to interact with each other. You can think of cross attention as a sort of filter, where one input is used to filter another input.

You can think of Cross attention as a filter where one input (in this case an image of a cat) is used to filter the information in another input (in this case, the text “What’s in the image”). From my upcoming article on cross attention by hand.

Models that deal with multiple types of data, like Flamingo and Sora, typically use both self-attention and cross attention. Flamingo, for instance, uses cross attention to introduce image data into the representation for text, and uses self attention to reason about text.

In the flamingo paper, as I discuss in my article on Flamingo, cross attention is used to allow a model to reason about both language and image data. Image from the Flamingo paper.

The whole idea of cross attention is to allow textual information to interact with image information, so you might imagine that it makes sense that all the image information would interact with all the textual information so that the model could reason about all images based on all of the text. However, the authors of the Flamingo paper only allow text to interact with the immediately preceding image, not all the images, within the cross attention mechanism. This is called “Gated” or “Masked” cross attention.

This is kind of an advanced topic. Feel free to refer to my article on Flamingo for more information, but the basic idea is that, instead of all image data being able to interact with all text data, the text data can only interact with the immediately preceding image data. So the text “My puppy sitting in the grass” can interact with the dog image, and the text “My cat looking very dignified” can interact with the cat photo. Image from the Flamingo paper.

This works because, within the model, there are also instances of self attention. So, even though each image data can’t directly interact with all the text data, it can indirectly.

Flamingo also uses multi headed self attention. So, even though certain pieces of text can only interact with certain image data directly through cross attention, all text vectors can interact within self attention. So, the model can learn to propegate information from the images as needed beween it’s various cross attention and self attention layers. Imagine the word vectors in this image have already had an opportunity to interact with various images via gated cross attention.

If I fed a list of captioned images into a model using both gated cross and self attention, then asked the question “What was in the first image?” Cross attention would allow the image data to be injected into some of the text data, then self attention would allow all of the text data to interact such that the model could answer the question.

When a model like Flamingo uses gated cross attention and self-attention in the same model, even though image information is only exposed to certain text vectors in cross attention, those vectors can interact with each other in self attention; resulting in vectors that contain a mix of information from both images.

It’s possible that Sora employs a similar strategy, except instead of applying image data throughout text, it may apply text data throughout a sequence of images. For each frame of generated video, we can allow a certain piece of text to attend to those frames.

Just like with Flamingo, even if gated cross attention is used, self attention allows the model to move textual information throughout the frames. These mixed vectors would be used to denoise the image and generate the final video.

Even though certain pieces of text only directly interact with a small set of frames, they can interact indirectly with all frames via self attention, allowing the model to make a cohesive video based on all text inputs.

Here you can see a storyboard used to generate a video. Notice that the rose is visible when the text “the door opens” is present, which is before a rose is mentioned in the storyboard.

I previously mentioned that prompt engineering with Sora can be a bit complicated. This might explain why. The result is kind of a soup of all prompts. Even if you have a prompt at the very beginning of the storyboard, it might impact the generation of later frames.

the result of the storyboard defined above. It respects many of the core ideas defined in the storyboard, but also ignores some critical ones. While Sora is incredibly powerful, it also requires a bit of futzing around to get exactly what you might want.

Trick 3) Partial Diffusion

Instead of trying to prompt engineer our way into getting the video we want, we can use the “Blending” function in Sora to make sure we’re transitioning between the scenes we’re interested in.

Recall that a diffusion model iteratively denoises the entire video

Recall that a diffusion video model essentially takes in a sequence of noisy images and outputs a sequence of not noisy images.

and we can inject full video frames on either side to entice the model to create some transition between certain frames. We did that in a previous section using the timeline.

Recall that we created a timeline which started with an image of angels and ended with a video of an eye

However, when we used the timeline to create this effect our results weren’t very compelling. The “blend” function in Sora allows for more fine-tuned control over transitions. I’m not 100% sure how this works, but I have a strong hunch.

I think, instead of using frames we 100% want, or 100% noisy frames, the blend function allows us to specify the degree of noise to two videos we want to blend between.

If we create a sequence of images with varying degrees of noise between them, we can control Sora more or less rigidly between frames.

This allows Sora to either weakly or strongly edit certain frames over time.

I generated two videos, one of a snowflake and one of falling rose petals and created a transition between them. The curve specifies how much of each video ends up in the output.

By editing this curve, you can change the duration of the blend, how gradual it is, and how much adjustment to the video can be done on either end of the transition. So, for instance, if we wanted a gradual transition that perfectly respected the angel video in the beginning, and perfectly respected the eye video in the end, we might do something like this:

If we wanted a video that only slightly respected the video of the angels, and then did a long transition to a video that fully respected the video of the eye, you might make a curve like this

I’m not sure if there’s a bug in the Sora, but I can’t seem to find the submit button on custom blends. However, we can confirm our suspicions by using one of the pre-defined blend modes.

A video that perfectly respects the rose pedal frames in the beginning, then incorporates the snowflake video more loosely

There’s no silver bullet that will make Sora work in every use case, but I think the Blending tool is especially compelling for a lot of users. I recommend putting in some upfront effort to make the videos line up to help Sora out.

Trick 4) Variable Aspect Ratios, Durations, and Levels of Detail

Making an AI model that can output a few different aspect ratios and resolutions might seem like a trivial detail, but it’s not. Virtually all AI models can accept a fixed sized input, and can generate fixed sized output. Data scientists have to use a lot of fancy tricks to create the illusion of dynamically sized outputs.

In natural language that’s done with padding. When you submit a query consisting of just a few tokens, a special <pad> token is used to fill up empty space.

In image and video generation you could probably do something similar. For instance, you could make a video generation model that always outputs a long video with maximum resolution then, if a user wants a shorter and lower resolution video you could surround the generated video with pads.

Imagine an AI model that could generate video of square, horizontal, and vertical aspect ratio videos of various sizes and durations. We could just pick the size and duration we want (static) and pad our the rest (yellow).

While this might be possible, I don’t think it’s practical; training a diffusion model like this would be incredibly wasteful. One of the major ideas of diffusion models is the idea of a “latent diffusion model”. Basically, there’s a ludicrous amount of data in even a small image, so most modern diffusion models compress this data into a smaller representation, the model thinks about that data, then it’s decompressed to generate the final image.

It’s easy to forget just how much information is in a single image. On modern images, you have to zoom in pretty far just to see individual pixels. This image alone contains 5,468,400 pixels in total. Considering large models, which cost millions of dollars to train, are still measured in the trillions of parameters, having a 5.4 billion parameter input seems like a big ask. Also, notice how when you zoom into the pixel level there is a vast amount of redundant information in the image. From my original article on Sora.

A latent diffusion model compresses and decompresses the data before and after the model. From my original article on Sora.

With this approach, we can create a few different compression and decompression portions of the model, and train each of those portions for their respective output.

The same AI model can be used for various different aspect ratios as long as they’re all compressed into a similar format.

Sora has three aspect ratios, three resolutions, and four durations for a total of 36 total possible output formats. OpenAI could train an individual decompression portion for each of these representations, or they could do countless other fancy approaches, who knows.

I’m sure there are about a million other pieces of secret sauce that go into making Sora work, but now I’d like to discuss a problem with Sora: people.

A Big Problem

The whole world has been very concerned with deep fakes lately, and rightfully so. Deep fakes are AI generated content which is specifically designed to imitate a person for embarrassing, incriminating, or exploitative purposes. As a result, OpenAI has put some serious guard rails around generating content with people in it.

I wanted the intro video of this article to be a slideshow of realistic people’s faces from a variety of demographics. The plan was to generate a few images in MidJourney (my image generation model of choice), then create a timeline of transitions to construct a video.

However, I was met with the following response.

Apparently, you can’t generate videos with content that contains people.

As you may have been able to glean from reading this article, Sora is incredibly powerful, but it’s also somewhat cumbersome and limited. I really think Sora’s power is in its ability to apply new and interesting techniques to real video or VFXs. I can imagine Sora adjusting lighting, creating interesting transitions, and all sorts of other interesting applications. However, if you can’t use Sora to edit video containing people (which is most of the video humans care about) then who cares?

You can get Sora can generate video of people just fine, and as far as I can tell once they’re generated you can use them just like any other video, but the inability to use videos of real people seriously inhibits certain use cases.

Two videos I generated in an attempt to create the cover video

A generated video which blends the two videos above. By the way, the implications of a white man creating a video of a black woman smiling under the title “faces of diversity” are not lost on me. The human face is a subtle and complex communication tool which simultaneously says “this is who I am” and “this is how I feel”. Using Sora, I feel, comes with a responsibility to be conscious and respectful of that fact.

OpenAI is in a bit of a sticky situation. On one hand the most dangerous use of Sora is in editing video of people, and on the other hand the most compelling use of Sora is in editing video of people. Right now, it seems like they’re dealing with this problem by only allowing certain parties to upload content containing humans.

Conclusion

In this article we discussed Sora, both in terms of its core functionality and the theory behind it. First, we took a tour of Sora and explored some of its core functionality, then we explored how that functionality might be implemented. We also discussed some of the practical difficulties one might face when using Sora, some of the ethical concerns it raises, and some of its current limitations.

If you’re interested in exploring Sora from a more conceptual prospective, check out my article on the subject.

Sora — Intuitively and Exhaustively Explained

Join Intuitively and Exhaustively Explained

At IAEE you can find:

Long form content, like the article you just read
Thought pieces, based on my experience as a data scientist, engineering director, and entrepreneur
A discord community focused on learning AI
Regular Lectures and office hours

A Practical Exploration of Sora — Intuitively and Exhaustively Explained was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Datascience in Towards Data Science on Medium https://ift.tt/U0xh2v1
via IFTTT

A Practical Exploration of Sora — Intuitively and Exhaustively Explained

1/17/2025 Jesus Santana