This is a deep reinforcement learning package including main-stream RL algorithms such as DDPG, TRPO, PPO, etc. It will be updated continuously to cover more up-to-date main-stream RL algorithms. It aims to help all the RL beginners and researchers understand and utilize RL algorithms more easily in their own research. All the included algorithms will be implemented so that they can achieve the claimed performance in the corresponding papers. By now, the following algorithms are nearly-SOTA:
DDPG, NAF, TD3, TRPO, PPO
The Definition of "nearly-SOTA": I don't have enough time to test all the envs included in the corresponding papers and provide the comparison with baselines. I just test my implemented version in one of the envs (most probable the Hopper-v2) and it achieves the same or higher performance compared with the baseline.
It also includes our newly proposed algorithm Hindsight Trust Region Policy Optimization. A demo video of Hindsight Trust Region Policy Optimization is in demo/ and it shows how our algorithm works. Hindsight Trust Region Policy Optimization has already been submitted to ICLR 2020.
python 3.4
torch 1.1.0
numpy 1.16.2
gym 0.12.1
tensorboardX 1.7
mujoco-py 2.0.2.2
robosuite
Please make sure that the versions of all the requirements match the ones above, which is necessary for running the code.
Alg. | SOTA? |
---|---|
Deep Q-Learning (DQN) | × |
Double DQN | × |
Dueling DQN | × |
Normalized Advantage Function with DQN | ✔️ |
Deep Deterministic Policy Gradient (DDPG) | ✔️ |
Twin Delayed DDPG | ✔️ |
Vanilla Policy Gradient | - |
Natual Policy Gradient | - |
Trust Region Policy Optimization | ✔️ |
Proximal Policy Optimization | ✔️ |
Hindsight Policy Gradients | ️✔️ |
Hindsight Trust Region Policy Gradient | ✔️ |
For running continuous envs (e.g. FetchPush-v1) with HTRPO algorithm:
python main.py --alg HTRPO --env FetchPush-v1 --num_steps 2000000 --num_evals 200 --eval_interval 19200 (--cpu)
For running discrete envs (e.g. FlipBit8):
python main.py --alg HTRPO --env FlipBit8 --unnormobs --num_steps 50000 --num_evals 200 --eval_interval 1024 (--cpu)
--cpu is used only when you want to train the policy using CPU, which will be much slower than using GPU.
--unnormobs is used when you do not want to do input normalization. In our paper, all the discrete envs do not use this trick at all.
Note for users:
-
DDPG, TD3 and NAF should turn on the switches named "unnormobs" and "unnormret" during training. The normalization is not optimized for these 3 methods by now and hense, with observation normalization or return normalization, the performance will be much lower than the baselines.
-
All the experimental results compared with baselines will be continuously updated when I have time.
-
All the mojoco envs.
-
Our Discrete Envs: FlipBit8, FlipBit16, EmptyMaze, FourRoomMaze, FetchReachDiscrete, FetchPushDiscrete, FetchSlideDiscrete, MsPacman
All the listed names can be directly used in command line for training policies. BUT NOTE: sparse reward envs only support HTRPO and dense reward envs do not support HTRPO.
- Zhang, Hanbo, et al. "Hindsight Trust Region Policy Optimization." arXiv preprint arXiv:1907.12439 (2019).
- Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529.
- Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep reinforcement learning with double q-learning." Thirtieth AAAI conference on artificial intelligence. 2016.
- Wang, Ziyu, et al. "Dueling Network Architectures for Deep Reinforcement Learning." International Conference on Machine Learning. 2016.
- Gu, Shixiang, et al. "Continuous deep q-learning with model-based acceleration." International Conference on Machine Learning. 2016.
- Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
- Fujimoto, Scott, Herke Hoof, and David Meger. "Addressing Function Approximation Error in Actor-Critic Methods." International Conference on Machine Learning. 2018.
- Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
- Kakade, Sham M. "A natural policy gradient." Advances in neural information processing systems. 2002.
- Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. 2015.
- Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015).