Stream guardrails enable token-level safety detection before full responses are generated. However, they often make overly conservative judgements, mistakenly blocking sensitive but safe tokens, known as over-refusal. They also fail to detect implicitly harmful content from jailbreaking due to a lack of full context.
To tackle these challenges, we propose FreoStream, a novel streaming guardrail framework. Specifically, FreoStream fine-tunes a LoRA module to perform Future-Aware Reasoning when the base guardrail detects unsafe tokens. The reasoning process follows a Future-Reason-Judge paradigm: predict the future, reason about the full context, and give the final judgement. This design effectively reduces over-refusal by incorporating future information.
Moreover, we introduce the Safety-Aligned Optimization module, which extracts the safety-aligned component from the reasoning gradients to update the base guardrail model, thereby enhancing streaming safety detection. Extensive experiments on various safety benchmarks demonstrate that FreoStream achieves lower over-refusal rates and better jailbreak defense compared to existing streaming guardrails.
Blogger's Review: FreoStream effectively addresses the over-refusal issue in streaming safety detection through innovative future-aware reasoning. This framework presents a new avenue for enhancing the security in streaming technologies, making it a noteworthy topic for further exploration.