[CS.AI] CineOrchestra: Unified Entity-Centric Conditionin...

CineOrchestra is a unified video diffusion model that controls multiple subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each acts as an entity over a specific temporal interval, which can be expressed through a set of shared entity-centric conditioning primitives, augmented with reference images for visual entities.

This formulation reduces the architectural challenge to a single positional encoding problem, solved with two parameter-free coordinated rotary embeddings:

Interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration.
2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region.

On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, showing consistent gains in pairwise user studies and component ablations.

Blogger's Review: The innovation of CineOrchestra lies in its ability to handle multiple complex cinematic elements in a unified manner, significantly enhancing the flexibility and precision of video generation. This approach opens new possibilities for future film creation and video generation, making it a noteworthy development in the field.

[CS.AI] CineOrchestra: Unified Entity-Centric Conditioning for Video Generation