RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Ahmet Berke Gökmen1, * Yiğit Ekin1, * Bahri Batuhan Bilecen1, * Aysegul Dundar1
1 Bilkent University
* Equal Contribution

Abstract

We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

Architecture

A concise overview of the RoPECraft architecture and its optimisation pipeline.

RoPECraft Architecture Diagram

Motion-augmented RoPE

Instead of tiling standard 1-D rotary embeddings independently across axes, RoPECraft warps the spatial indices with per-row and per-column optical-flow cues before the complex exponential is applied. This yields motion-aware embeddings that tell the transformer which patches should attend to one another, while the temporal term remains untouched to avoid artefacts.

Flow-matching optimisation

During the earliest diffusion steps we align the predicted velocity field vθ(t,xt) with a target velocity derived from a reference latent video. Starting from the warped RoPE initialisation improves subject placement and motion direction, giving cleaner, more accurate trajectories than either default RoPE or optimisation from an unwarped state.

Phase-consistency constraint

To eliminate residual duplicates or mis-oriented subjects, we add an 1 penalty that matches the cosine and sine of the spatial-temporal Fourier-phase of target and predicted velocities. Representing phase on the unit circle makes the loss smooth and differentiable, fixing remaining artefacts and further stabilising motion across generated frames.

Fréchet Trajectory Distance (FTD)

Step 1: Sampling

Sample n foreground (red) and n background (green) seeds in the first frame.

Step 2: Occlusion-aware tracking

Track each seed with an occlusion-aware filler: copy the nearest visible neighbour while occluded and discard tracks that never re-appear.

Step 3: Distance computation

Measure the RMS Fréchet distance between generated (fake) and reference (real) tracks.

Fréchet Trajectory Distance (FTD) Illustration

How it works

  1. Uniformly sample n foreground and n background points in the first frame (red / green).
  2. Track those 2 n seeds with an occlusion-aware tracker; if a seed disappears it is reassigned to its nearest visible neighbour, and tracks that never re-appear are discarded.
  3. Normalise all track coordinates by frame width W and height H to make the metric resolution-invariant.
  4. Compute the root-mean-square of the discrete Fréchet distance between each real / fake track pair:
    $$ \mathrm{FTD} = \left( \frac{1}{N} \sum_{i=1}^{N} D_F^2\left(\mathcal{T}_i^{\mathrm{real}},\; \mathcal{T}_i^{\mathrm{fake}}\right) \right)^{1/2} $$

Comparison with Motion Fidelity (MF)

MF computes cosine similarity between successive displacements on a fixed grid, ignoring overall path shape, magnitude, and occlusions; it can still score highly when trajectories diverge and exhibits high variance.
FTD focuses on reliable, task-relevant tracks, handles occlusions explicitly, and evaluates full-curve similarity through the discrete Fréchet distance, delivering a more stable and physically meaningful motion-alignment metric.

Results

RoPECraft achieves state-of-the-art results. Hover over the videos to play.

Comparisons

The following results showcase the comparison of RoPECraft across a wide range of reference motion videos. Hover over each clip to preview the generated motion transfer in action. In each video, the original video is on the top row labeled as "DAVIS" and the generated video is on the other 2 rows labeled as the name of the method. The prompt used for each video is also provided at the bottom.

BibTeX

@misc{gokmen2025ropecrafttrainingfreemotiontransfer,
      title={RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers}, 
      author={Ahmet Berke Gokmen and Yigit Ekin and Bahri Batuhan Bilecen and Aysegul Dundar},
      year={2025},
      eprint={2505.13344},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.13344}, 
}