[CS.AI] REINS: Training-Free Safety Alignment of Video Di...

Abstract

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation. Existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation.

Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers. A single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead.

Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

Blogger's Review: The introduction of the REINS method offers an innovative safety alignment strategy for video generation, avoiding the trade-off between safety and generative capability seen in traditional methods. Its approach of steering representations during inference not only enhances safety but also maintains the flexibility of the generative model, showcasing broad application potential.

[CS.AI] REINS: Training-Free Safety Alignment of Video Diffusion Models

Abstract