In this study, we introduce a novel offline reinforcement learning (RL) algorithm called Reversal Q-Learning (RQL), which trains a flow policy based on prior data. Our approach is rooted in the "expanded" Markov Decision Process (MDP) framework, treating individual flow refinement steps as separate actions within an MDP.
To enable offline RL in this framework, we apply two techniques: first, generating virtual on-policy trajectories by "reversing" flows to make the framework compatible with prior data; and second, employing a bias-and-variance reduction technique to mitigate the curse of horizon in offline RL.
RQL offers several advantages over previous flow-based methods: it avoids backpropagation through time, better utilizes the learned value function, and directly trains the complete, expressive flow policy.
Through experiments on 50 challenging simulated robotic tasks, we demonstrate that RQL achieves superior average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.
Blogger's Review: Reversal Q-Learning significantly enhances the efficiency and effectiveness of offline reinforcement learning by introducing innovative flow reversal techniques. This algorithm not only addresses several pain points in traditional methods but also validates its superiority in complex tasks through empirical evidence, showcasing the potential for future RL research.