arxiv:2606.26058

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Published on Jun 24

· Submitted by

Nan Chen on Jun 25

#3 Paper of the day

Upvote

Authors:

Nan Chen ,

Abstract

DomainShuttle enables open ___domain subject-driven text-to-video generation with high fidelity and flexibility across in-___domain and cross-___domain scenarios through ___domain-aware modeling and dual RoPE schemes.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Open ___domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open ___domain S2V mainly involves two scenarios: in-___domain, which requires retaining the reference subject features as much as possible, and cross-___domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-___domain scenarios, which limits their editability and adaptability in cross-___domain scenarios, such as novel styles, semantic combinations, or ___domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-___domain and cross-___domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open ___domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the ___domain-aware AdaLN for ___domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open ___domain application scenarios.

View arXiv page View PDF Project page GitHub 143 Add to collection

Community

CNcreator0331

Paper author Paper submitter 5 days ago

We propose DomainShuttle, an open-___domain subject-driven text-to-video method that flexibly handles both in-___domain fidelity and cross-___domain editability by decoupling reference and video features, modeling ___domain attributes, and learning intrinsic subject representations.