RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Abstract

We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

Architecture

A concise overview of the RoPECraft architecture and its optimisation pipeline.

Motion-augmented RoPE

Instead of tiling standard 1-D rotary embeddings independently across axes, RoPECraft warps the spatial indices with per-row and per-column optical-flow cues before the complex exponential is applied. This yields motion-aware embeddings that tell the transformer which patches should attend to one another, while the temporal term remains untouched to avoid artefacts.

Flow-matching optimisation

During the earliest diffusion steps we align the predicted velocity field v_θ(t,x_t) with a target velocity derived from a reference latent video. Starting from the warped RoPE initialisation improves subject placement and motion direction, giving cleaner, more accurate trajectories than either default RoPE or optimisation from an unwarped state.

Phase-consistency constraint

To eliminate residual duplicates or mis-oriented subjects, we add an ℓ₁ penalty that matches the cosine and sine of the spatial-temporal Fourier-phase of target and predicted velocities. Representing phase on the unit circle makes the loss smooth and differentiable, fixing remaining artefacts and further stabilising motion across generated frames.

Fréchet Trajectory Distance (FTD)

Step 1: Sampling

Sample n foreground (red) and n background (green) seeds in the first frame.

Step 2: Occlusion-aware tracking

Track each seed with an occlusion-aware filler: copy the nearest visible neighbour while occluded and discard tracks that never re-appear.

Step 3: Distance computation

Measure the RMS Fréchet distance between generated (fake) and reference (real) tracks.

How it works

Uniformly sample n foreground and n background points in the first frame (red / green).
Track those 2 n seeds with an occlusion-aware tracker; if a seed disappears it is reassigned to its nearest visible neighbour, and tracks that never re-appear are discarded.
Normalise all track coordinates by frame width W and height H to make the metric resolution-invariant.
Compute the root-mean-square of the discrete Fréchet distance between each real / fake track pair:
$$ \mathrm{FTD} = \left( \frac{1}{N} \sum_{i=1}^{N} D_F^2\left(\mathcal{T}_i^{\mathrm{real}},\; \mathcal{T}_i^{\mathrm{fake}}\right) \right)^{1/2} $$

Comparison with Motion Fidelity (MF)

MF computes cosine similarity between successive displacements on a fixed grid, ignoring overall path shape, magnitude, and occlusions; it can still score highly when trajectories diverge and exhibits high variance.
FTD focuses on reliable, task-relevant tracks, handles occlusions explicitly, and evaluates full-curve similarity through the discrete Fréchet distance, delivering a more stable and physically meaningful motion-alignment metric.

Results

RoPECraft achieves state-of-the-art results. Hover over the videos to play.

An interstellar rally craft close to ground zooms through, thrusters blazing as it weaves between spinning rock fragments.

A robotic courier zips through a maze of automated warehouses, packages whizzing past on conveyor belts.

Children, one in a duck costume, paddles a small inflatable boat across a pond.

A flock of black swans swims gracefully through a misty, moonlit swamp, their dark forms contrasting with the eerie glow on the water.

A collection of origami swans, enchanted to life, delicately paddles across a reflecting pool in a serene Japanese garden, their paper necks craning with surprising fluidity.

A group of children as captains imagines their rubber ducks are a mighty fleet of galleons sailing across a vast, soapy ocean.

A towboat pushes a barge filled with shipping containers through the harbor.

A herd of sheep marches along the street while a shepherd watches from the sidewalk.

A miniature car made of cheese, driven by a team of mice, speeds around a tiny roundabout on a kitchen counter, avoiding crumbs.

The silhouette of a ballerina leaps across a sunlit studio, delicate shadow dancing on polished hardwood floors.

A rugged off-road buggy powers through a muddy forest trail, throwing up clods of earth.

A sleek electric skateboard zooms around the circuit, leaving neon trails in its wake.

A man rides a small scooter in an underground parking lot.

A band of elves in shimmering blue tunics walks through an enchanted forest, narrating their magical discoveries to a crystal ball that records everything.

Comparisons

The following results showcase the comparison of RoPECraft across a wide range of reference motion videos. Hover over each clip to preview the generated motion transfer in action. In each video, the original video is on the top row labeled as "DAVIS" and the generated video is on the other 2 rows labeled as the name of the method. The prompt used for each video is also provided at the bottom.

A silver motorhome is seen pulling up to a campsite nestled among the towering redwoods of a misty forest, with a dense canopy above it.

A woman wearing a black dress walks down a worn wooden dock along the edge of a misty lake at dusk.

A large group of fire dancers are spinning together in a circle, with one seasoned performer leading the mesmerizing display under the warm glow of string lights on a tropical beach at sunset.

The vintage motorcycle drives down a dirt road, passes a truck, and then speeds up as it goes.

A sailboat is anchored in a tranquil cove surrounded by lush greenery, as the man walks along the weathered wooden dock, gazing out at the serene ocean waters while the sailboat glides smoothly alongside him.

A woman rides a bike down a wooden dock alongside a serene lake at sunrise.

BibTeX

@misc{gokmen2025ropecrafttrainingfreemotiontransfer,
      title={RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers}, 
      author={Ahmet Berke Gokmen and Yigit Ekin and Bahri Batuhan Bilecen and Aysegul Dundar},
      year={2025},
      eprint={2505.13344},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.13344}, 
}

↑