.. _tqc: .. automodule:: sb3_contrib.trpo TRPO ==== `Trust Region Policy Optimization (TRPO) `_ is an iterative approach for optimizing policies with guaranteed monotonic improvement. .. rubric:: Available Policies .. autosummary:: :nosignatures: MlpPolicy CnnPolicy MultiInputPolicy Notes ----- - Original paper: https://arxiv.org/abs/1502.05477 - OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/ Can I use? ---------- - Recurrent policies: ❌ - Multi processing: ✔️ - Gym spaces: ============= ====== =========== Space Action Observation ============= ====== =========== Discrete ✔️ ✔️ Box ✔️ ✔️ MultiDiscrete ✔️ ✔️ MultiBinary ✔️ ✔️ Dict ❌ ✔️ ============= ====== =========== Example ------- .. code-block:: python import gymnasium as gym import numpy as np from sb3_contrib import TRPO env = gym.make("Pendulum-v1", render_mode="human") model = TRPO("MlpPolicy", env, verbose=1) model.learn(total_timesteps=10_000, log_interval=4) model.save("trpo_pendulum") del model # remove to demonstrate saving and loading model = TRPO.load("trpo_pendulum") obs, _ = env.reset() while True: action, _states = model.predict(obs, deterministic=True) obs, reward, terminated, truncated, info = env.step(action) env.render() if terminated or truncated: obs, _ = env.reset() Results ------- Result on the MuJoCo benchmark (1M steps on ``-v3`` envs with MuJoCo v2.1.0) using 3 seeds. The complete learning curves are available in the `associated PR `_. ===================== ============ Environments TRPO ===================== ============ HalfCheetah 1803 +/- 46 Ant 3554 +/- 591 Hopper 3372 +/- 215 Walker2d 4502 +/- 234 Swimmer 359 +/- 2 ===================== ============ How to replicate the results? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Clone RL-Zoo and checkout the branch ``feat/trpo``: .. code-block:: bash git clone https://github.com/cyprienc/rl-baselines3-zoo cd rl-baselines3-zoo/ Run the benchmark (replace ``$ENV_ID`` by the envs mentioned above): .. code-block:: bash python train.py --algo tqc --env $ENV_ID --n-eval-envs 10 --eval-episodes 20 --eval-freq 50000 Plot the results: .. code-block:: bash python scripts/all_plots.py -a trpo -e HalfCheetah Ant Hopper Walker2d Swimmer -f logs/ -o logs/trpo_results python scripts/plot_from_file.py -i logs/trpo_results.pkl -latex -l TRPO Parameters ---------- .. autoclass:: TRPO :members: :inherited-members: .. _trpo_policies: TRPO Policies ------------- .. autoclass:: MlpPolicy :members: :inherited-members: .. autoclass:: stable_baselines3.common.policies.ActorCriticPolicy :members: :noindex: .. autoclass:: CnnPolicy :members: .. autoclass:: stable_baselines3.common.policies.ActorCriticCnnPolicy :members: :noindex: .. autoclass:: MultiInputPolicy :members: .. autoclass:: stable_baselines3.common.policies.MultiInputActorCriticPolicy :members: :noindex: