✨ Highlights
- A novel two-stage pipeline for multi-object sketch animation with user-defined grouping and motion trajectory priors.
- Group-based Displacement Network with Context-conditioned Feature Enhancement for improved temporal consistency.
- Significantly outperforms existing methods in visual quality and temporal consistency for complex multi-object scenarios.
Abstract
We introduce GroupSketch, a novel pipeline for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single objects or suffering from temporal inconsistency and generalization issues. Our method addresses these limitations through a two-stage approach: Motion Initialization and Motion Refinement. The first stage allows users to divide sketches into semantic groups and define key frames, generating a coarse animation through interpolation. The second stage employs our Group-based Displacement Network (GDN) to refine this animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN incorporates specialized components including Context-conditioned Feature Enhancement (CCFE) to improve temporal consistency. Extensive experimental results demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex multi-object sketches, expanding the practical applications of sketch animation.
Methodology
GroupSketch comprises two main steps, Motion Initialization and Motion Refinement.

(a) In the Motion Initialization Stage, the model takes an input sketch and obtains the semantic groups and motion trajectories through a Canvas-based interactive process. This stage outputs a coarse-level sketch animation. In the Motion Refinement Stage, these groups are fed into GDN, which computes displacement fields to refine their motion. The updated motion is merged and then rendered by a differentiable rasterizer. The calculated loss is backpropagated to update GDNs'parameters.
(b) The GDN architecture includes two components: (1) Context-conditioned Feature Enhancement (CCFE) is composed of two key components: Frame-aware Positional Encoding (FPE), which encodes the temporal positions of input point sequences, and Motion Context Learning (MCL), which enhances the feature by conditioning on the context information extracted from all frames. (2) Group Displacement Field Prediction, which combines local and global grouping paths to produce final displacements for each group.
Multi-Object Results
We compare our GroupSketch with FlipSketch and LiveSketch on the multi-object cases.
Single-Object Results
We compare our GroupSketch with FlipSketch and LiveSketch on the single-object cases.
Comparison with Video Generation Models
We compare our GroupSketch with Dynamicrafter and I2VGen-XL on the video generation task.
Input SVG or PNG
Dynamicrafter

I2VGen-XL

Ours

Different Actions from the Same SVG with Varying Prompts
We show the results of our GroupSketch on the same SVG with different prompts.
Input SVG or PNG
Result 1 (Ours)

Result 2 (Ours)

Result 3 (Ours)

Ablation Study
We perform an ablation study to evaluate the contribution of each component of our GroupSketch.
Input SVG or PNG
Full Model (Ours)

LiveSketch (Baseline)

w/o Motion Trajectory Priors

w/o Grouping

w/o Motion Context Learning

w/o Frame-aware Positional Encoding

with LLM Motion Priors
