Innovative Dynamic Typography Technology Open-Sourced by HKUST and Tel Aviv University

The team from Hong Kong University of Science and Technology and Tel Aviv University has open-sourced the 'Dynamic Typography' technology based on large-scale video models. By selecting a letter and providing a simple text description, SVG animations can be generated to make the letter 'come alive'.

For example, 'M' in ROMANTIC turns into a couple holding hands and walking together.

'h' in Father is interpreted as a father patiently walking with his child.

'N' in PASSION can be transformed into a couple kissing each other.

'S' in SWAN turns into an elegant swan stretching its neck.

'P' in TELESCOPE becomes a real telescope! Slowly turning towards the lens.

This is the latest work brought to us by the research teams from HKUST and Tel Aviv University: Dynamic Typography.

Paper Link

Project Page

Make Text Come Alive

Text animation is an expressive medium that transforms static communication into dynamic experiences, evoking emotions, emphasizing text meanings, and constructing engaging narratives. However, creating semantically meaningful animations requires expertise in graphic design and animation production.

Therefore, researchers have proposed a novel automated text animation solution, 'Dynamic Typography,' achieving a perfect fusion of text and animation.

The solution can be decomposed into two steps:

1. Based on the user's description, letters are deformed to convey text semantics.

2. The deformed letters are given vivid dynamic effects according to the user's description, creating text animations.

Maintaining readability while smoothly moving the text is extremely challenging. Current text-to-video models struggle to ensure the generation of readable text and fail to deform the text based on its semantic information to better convey motion information. Retraining such models requires a large and difficult-to-obtain stylized text video dataset.

Researchers used the ScoreDistillationSampling (SDS) technique to distill the prior knowledge in large-parameter text-to-video base models, predict the displacement of control points in the vector graphic of the text at each frame, and achieve readability and appearance retention during text motion through additional readability constraints and structure preserving techniques.

The researchers demonstrated the universality of their proposed framework on various text-to-video models and emphasized the superiority of their method over baseline methods. Experimental results show that their technology can successfully generate text animations that are consistent with user descriptions and maintain the readability of the original text.

Methodology

1. Data Representation

In this work, the outlines of letters are represented as a series of connected cubic Bezier curves, with the shape determined by the control points of the Bezier curves. The authors' method predicts the displacement of each control point for each frame. These displacements deform the letters to convey semantic information and add motion through different displacements in each frame.

2. Model Architecture

Given a letter represented by Bezier curves, the researchers first use a coordinate-based MLP (referred to as the BaseField) to deform the letter into a baseshape that represents its semantic information. This baseshape is then copied to each frame and predicted by another coordinate-based MLP (referred to as the DisplacementField) to displace each control point in each frame, adding motion to the baseshape.

Each frame is then rendered into pixel images by a differentiable renderer and concatenated into the output video. The BaseField and DisplacementField undergo end-to-end optimization based on the prior knowledge of the text-to-video and other constraint terms.

3. Optimization

Current diffusion-based text-to-video models, such as Stable Diffusion, are trained on a large scale of two-dimensional pixel images, containing rich prior knowledge. ScoreDistillationSampling (SDS) aims to distill the prior knowledge in the diffusion model for training other models to generate content of other modalities, such as training MLP in NeRF to generate 3D models.

In this work, the researchers distilled a diffusion-based text-to-video model through SDS and trained the parameters in the BaseField and DisplacementField based on the acquired prior knowledge.

Furthermore, to ensure that each frame of the generated video maintains the readability of the letters themselves (e.g., the letter 'M' in the word 'CAMEL' visually resembling a camel while still retaining the shape of the letter 'M' so that users can recognize it as the letter M), this work introduced a constraint term based on Learned Perceptual Image Patch Similarity (LPIPS) to constrain the similarity between the baseshape and the original letter perceptually.

To alleviate the observed severe flickering caused by frequent intersections of Bezier curves, this work incorporated a structure preservation constraint based on triangulation to maintain stable skeletal structure during deformation and motion.

Experiments

In terms of experiments, researchers evaluated the work from two aspects: legibility of the text and consistency between the text descriptions provided by users and the generated videos.

This work was compared with two types of methods: pixel-based text-to-video models and vector-based general animation solutions.

In the pixel-based text-to-video models, this work was compared with leading text-to-video models Gen-2 and video-to-video model DynamiCrafter.

Qualitative and quantitative comparisons with other methods show that other methods mostly struggle to maintain the readability of letters during video generation or fail to generate movements that match the semantics. The method proposed in this paper effectively maintains the readability of the letters during motion and generates movements that match the text descriptions provided by users.

To further prove the effectiveness of each module in this work, the researchers conducted extensive ablation experiments. The results show that the design of the baseshape and the structure-preserving techniques based on triangulation effectively enhance the video quality, while the readability constraints based on perceptual similarity maintain the readability of the letters during motion.

The researchers further emphasize the universality of their proposed framework on various text-to-video models, indicating that the framework can be compatible with the further development of future video generation models and produce more attractive text animations as video generation models improve.

Reference

This article is from the WeChat official account: NewSpectrum (ID: AI_era)

Likes