NeFut Logo NeFut
Admin Login

[CS.AI] CineOrchestra: Unified Entity-Centric Conditioning for Video Generation

Published at: 2026-06-16 22:00 Last updated: 2026-06-17 01:38
#AI #Machine Learning #Open Source

CineOrchestra is a unified video diffusion model that controls multiple subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each acts as an entity over a specific temporal interval, which can be expressed through a set of shared entity-centric conditioning primitives, augmented with reference images for visual entities.

This formulation reduces the architectural challenge to a single positional encoding problem, solved with two parameter-free coordinated rotary embeddings:

  1. Interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration.
  2. 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region.

On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, showing consistent gains in pairwise user studies and component ablations.

Blogger's Review: The innovation of CineOrchestra lies in its ability to handle multiple complex cinematic elements in a unified manner, significantly enhancing the flexibility and precision of video generation. This approach opens new possibilities for future film creation and video generation, making it a noteworthy development in the field.

Original Source: https://arxiv.org/abs/2606.13768

[h] Back to Home