Policy-regularized Offline Safe Reinforcement Learning with Preference Aligned Sampling

Published in Project done in CMU, 2024

Offline safe reinforcement learning (RL) aims to learn a safe and relatively rewarding policy with a precollected dataset. One prevalent method to deal with this problem is offfine policy-regularized method, which typically incorporates a behavior cloning mechanism into the policy learning to regularize the learned policy stay close enough to the behavior policy, hence mitigates the distribution shift challenge. However, this framework may suffer from suboptimality of behavior policy due to the imbalanced dataset. In this work, we propose DIAM (distribution aligned sampling), a preference aligned sampling method customized for policy-regularized offfine safe algorithms. Comprehensive evaluation in various tasks illustrates the ability of DIAM in optimizing the behavior policy, hence beneffts policy-regularized offfine safe algorithms. DIAM shows superiority compared to other model-centric method and data-centric method, making it more applicable and universal, even with simple structure.

Download Slides

Share on

Twitter Facebook LinkedIn

Cheng Tang

Share on