Cross Entropy (CE) RL Methods

  • learns policy by filtering out low-reward episodes trajectory data and favor high-reward episodes trajectory data