Sora AI Video Generator Behind the Scenes of the Artistic Process

The first batch of users of Sora detailed their process of creating a full video using it.

By Li 南 (Lian Ran), Edited by Xuan 郑 (Zheng Xuan)

In early February, OpenAI's release of Sora stunned the world, and its revolutionary breakthrough in AI video generation on the web was once seen as a big storm blowing towards Hollywood.

Sora is a diffusion model that compared to previous AI video generators, Sora can generate a video content up to one minute long based on prompt words, keeping the visual quality and consistency, and achieving switching of camera shots and composition adjustments. It can also ensure that the video's subject details related to the background theme are accurately aligned, making the generated video more realistic, as if it were an extension of the real world.

At that time, OpenAI also released technical information, indicating that in the future it can extend or seamlessly blend the generated videos.

Starting from March, Sora has granted permission to some artists. At the end of the month, OpenAI released several surreal videos created by artists using Sora on its official website. Recently, the artist team behind one of these works, AirHead, ShyKids, revealed the complete process of using Sora for production.

Overall, the Sora truly invested in film and television production is not as stunning as it was at the beginning, but it is shocking enough—it can allow a team of only three people to produce a fantastic short film in about 1.5 to 2 weeks.

In the view of this team, the current form of Sora has achieved incredible progress in specific image generation aspects; however, for relatively complex projects, it may still need some time to evolve to meet the specific requirements of directors. Besides the use of Sora, this AirHead still required a lot of editing and human guidance to complete. The team stated, Integrating Sora into the creative process is a very realistic way of working, but if we don't, it doesn't seem to matter.

01. The following is a summary of the discussion between fxguide and ShyKids on the current working principle of Sora:

As one of the production teams that have limited access to Sora, the ShyKids team produced the Sora short film AirHead. ShyKids is a Canadian production company known for its diverse and innovative media production methods.

Sora is still in development and actively improving based on feedback from teams like ShyKids. It is important to recognize that Sora is still in a very early stage of development, almost in the pre-alpha stage.

Patrick, who is responsible for post-production in ShyKids, commented that using Sora is an interesting process. Sora is a very powerful tool, We are already dreaming about how it can fit into our existing process. But I think for any generative AI tool, control is still the most desirable and currently the most elusive thing.

User Interface and Interaction: To improve consistency, only text input is supported

Sora's user interface design is simple, allowing artists to initiate the video clip generation process by entering text prompts.

After the artist inputs the desired scene description, OpenAI's ChatGPT technology will convert it into a longer string, a crucial step to trigger Sora to generate video clips.

Currently, Sora only supports text input and has not integrated multi-modal input methods. In other words, besides text descriptions, users cannot provide input in other forms such as images or sound.

The importance of this design is that although Sora does a great job of maintaining the consistency of objects in the video frame, the system still cannot ensure that the content in the first frame matches completely with the subsequent frames.

In other words, even when using the same text prompts, the video clips generated by Sora at different times may differ. To maintain consistency as much as possible, users need to describe the scenes in the text prompts in as much detail as possible, including the types of clothing for characters and props. However, even so, Sora still has limitations in controlling the consistency between shots as it does not yet have a complete set of functions to achieve full control.

The closest thing we can do is add more detailed descriptions in our prompts, Patrick explained. Explaining the characters' clothing and the type of balloons is our way of achieving consistency because there is still no complete set of functions to control the consistency from shot to shot.

Each independent clip generated by Sora is amazing in terms of the technology it represents. However, how to effectively use these clips depends on the user's understanding of the implicit or explicit shot generation methods of Sora.

For example, if you ask Sora to generate a long-distance tracking shot in a kitchen with a banana on the table, Sora will rely on its implicit understanding of the concept of banana to generate a video showing a banana.

Through training data, Sora has learned implicit properties of bananas: such as yellow, curved, dark end, etc. But it does not have actual recorded images of bananas, nor a banana inventory database; it has a smaller compressed hidden or latent space to represent the concept of a banana. Therefore, each run of generation will display different interpretations of this latent space, meaning that the prompts entered by users must be based on an understanding of these implicit features.

Consistency of character Sonny:

The team tried to maintain the consistency of Sonny, the yellow balloon head, in different shots, but Sora could not ensure that the color and style of the balloons were exactly the same in each shot. Sometimes the color or style of the balloon did not match the prompts, and even unexpected facial patterns appeared.

Resolution and Image Processing:

AirHead used shots generated by Sora, but many of them were graded, processed, and stabilized, all shots were enlarged or enhanced in resolution. The clips processed by the team were produced at a lower resolution and then enlarged using AI tools outside of Sora or OpenAI. All our 'AirHead' was made at 480 speed and then corrected with Topaz.

Imprecision of Time Control:

ShyKids used the earliest prototype (Sora is still under continuous improvement). Although keyframes can be adjusted on the timeline, precise control over the exact timing of actions is not exact, resulting in some uncertainties.

Aspect Ratio Selection:

Sora allows users to select different aspect ratios, such as portrait or landscape mode, which is crucial for specific shot designs. Although Sora provides flexibility, it has limitations in rendering certain complex shot movements. For example, when a shot is needed from Sonny's jeans up to his balloon head, Sora cannot directly generate such a shot. To overcome this limitation, the team first rendered the shot in portrait mode and then manually created the panoramic post-production.

Camera Direction Cues:

Sora is not mature in understanding and executing camera motion instructions. Although users can input cues like camera pan, Sora is not always able to execute it accurately.

Rendering Duration:

Depending on different cloud usage demands and time requirements, rendering a clip may take 10 to 20 minutes. The team tends to render longer clips to have more editing and adjustment space in post-production.

Rotation:

Although all images are generated in Sora, balloons still require a lot of post-work. In addition to isolating the balloon for recoloring, some unwanted facial patterns or other traces need to be removed.

Material-to-Finished Product Ratio:

Patrick estimates that the one-and-a-half-minute shot in the final film is made based on hundreds of generations, each 10 to 20 seconds, which is about a 300:1 ratio of source material to the final product.

Shot Compositing and Retiming:

In AirHead, most shots are generated in one go, without multiple shots composited together.

Many video clips generated by Sora seem to be processed into slow motion automatically, with speeds ranging from 50% to 75% of normal speed. The team needs to retime them to make them look like real-time shots.

Copyright:

Sora does not allow the generation of content that would constitute copyright infringement or appear as imitating specific works.

For example, inputting prompts like on a futuristic spaceship, shot with 35mm film, a man holding a lightsaber walking forward, Sora will not allow the generation of the clip because the content is too close to Star Wars. ShyKids also encountered this problem in early tests. Patrick recalled, I entered 'Tarkovsky-style shot' and received feedback that it could not be executed. He also mentioned that Hitchcockian zoom is also a prompt that Sora would reject.

02. Summary

Last year, the rapid development of large models led to a major Hollywood screenwriters' strike, and the film industry's concerns about this technology began to increase. In February this year, OpenAI's release of Sora was seen as a signal of Silicon Valley once again challenging Hollywood. In early March, with plans for a four-year design and construction of 12 studios with a budget of around $800 million and a site area of 330 acres, the expansion plan for the film studio was promptly shelved due to the emergence of Sora. For a moment, it seemed like everyone in Hollywood was in jeopardy.

However, as ShyKids, who made a short film with Sora, revealed the full truth about this technology and a lot of manual post-production was used, and Sora still cannot meet some advanced and complex requirements, it seems that Hollywood has more time to buffer again—after all, Sora is still in a very early stage and is far from replacing manual work in various aspects of the film industry.

But it is worth noting that artificial intelligence is advancing intelligence at a faster pace than Moore's Law in the past, something that has been proven with big language models like GPT. Today, the emergence of Sora means that video generation models have reached a turning point, and perhaps before long, we will see video big models that can be used in the video industry and even the film industry.

This article is from the WeChat public account: Geek Park (ID: geekpark), by Li 南 (Lian Ran)

Likes