As more and more companies continue to double down on their generative AI capabilities, organizations are racing to build more capable products for them. Case in point: LumiereA spatiotemporal diffusion model proposed by researchers. Google, Weizmann Institute of Science and Tel Aviv University Helps generate realistic videos.
of A paper detailing the technology has just been published, but the model cannot be tested yet. If this changes, Google could introduce a very strong player into the AI video space, which is currently dominated by players like Runway, Pika, and Stability AI.
The researchers claim that the model takes a different approach than existing players and synthesizes videos that depict realistic, diverse, and consistent motion, a critical challenge in video synthesis.
What can Lumiere do?
At the heart of Lumiere, which stands for light, is a video diffusion model that provides users with the ability to produce realistic and stylized videos. It also provides an option to edit with commands.
Users enter text that describes what they want in natural language, and the model generates a video representing it. Users can also add prompts to upload an existing still image and convert it into a dynamic video. This model also supports additional features such as repair, which inserts specific objects to edit videos with text prompts. A cinemagraph that adds movement to specific parts of a scene. Stylized generation that takes a reference style from one image and uses it to generate a video.
“We demonstrated state-of-the-art text-to-video generation results and demonstrated that our design supports a wide range of content creation tasks and video editing applications, including image-to-video conversion, video restoration, and stylized generation. “We show that this can be done easily and easily,” the researchers wrote in their paper.
While these features are not new to the industry and are provided by players such as Runway and Pika, the authors note that most existing models use a cascade approach to generate additional temporal data related to video generation. It claims to be working on dimensions (representing states in time). First, a base model generates distant keyframes, and then a subsequent temporal super-resolution (TSR) model generates missing data between keyframes in non-overlapping segments. While this works, it makes temporal coherence difficult to achieve and often creates limitations in terms of video length, overall visual quality, and the degree of realistic movement that can be produced.
On the Lumiere side, we address this gap using the spatiotemporal U-Net architecture to generate the entire temporal duration of the video at once through a single pass through the model, resulting in more realistic and consistent motion. Masu.
“By introducing both spatial and (importantly) temporal downsampling and upsampling and leveraging a pre-trained text-to-image diffusion model, our model is able to perform full-frame-rate low-resolution “We learn to process and directly generate video at multiple spatiotemporal scales,” the researchers wrote in their paper.
The video model is trained on a dataset of 30 million videos and their text captions and can generate 80 frames at 16 fps. However, the origin of this data remains unknown at this stage.
Performance against known AI video models
When comparing this model to products from Pika, Runway, and Stability AI, the researchers found that while these models had higher frame-by-frame visual quality, their 4-second output movement was very limited, and in some cases I noticed that the results in a nearly static clip. . Another player in this category, ImagenVideo, produced reasonable movement but lagged behind in terms of quality.
“In contrast, our method produces 5-second videos with greater motion while maintaining temporal consistency and overall quality,” the researchers wrote. Users who surveyed the quality of these models also said they preferred his Lumiere over competing products for text and image to video generation.
This could be the start of something new in the rapidly changing AI video market, but it’s important to note that Lumiere can’t be tested yet. The company also notes that this model has certain limitations. You cannot generate videos that consist of multiple shots or that contain transitions between scenes. This is an open question for future research.
VentureBeat’s mission will be a digital town square for technical decision makers to gain knowledge about transformative enterprise technologies and transactions. Please see the briefing.