CineOrchestra is a unified video diffusion model that controls multiple subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each acts as an entity over a specific temporal interval, which can be expressed through a set of shared entity-centric conditioning primitives, augmented with reference images for visual entities.
This formulation reduces the architectural challenge to a single positional encoding problem, solved with two parameter-free coordinated rotary embeddings:
- Interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration.
- 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region.
On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, showing consistent gains in pairwise user studies and component ablations.
Blogger's Review: The innovation of CineOrchestra lies in its ability to handle multiple complex cinematic elements in a unified manner, significantly enhancing the flexibility and precision of video generation. This approach opens new possibilities for future film creation and video generation, making it a noteworthy development in the field.