Fix typo in changelog (#237 )

Release v2.3.0 (#236 )
Log success rate for PPO variants (#235 )
2024-05-06 14:20:28 +01:00 · 2024-05-06 14:20:28 +01:00 · 2024-05-06 14:20:28 +01:00 · 2024-05-06 14:20:28 +01:00 · 2024-05-06 14:20:28 +01:00 · 2024-05-06 14:20:28 +01:00
18 changed files with 206 additions and 222 deletions
--- a/10
+++ b/10
@ -12,23 +12,23 @@ type: mypy
 lint:
 	# stop the build if there are Python syntax errors or undefined names
 	# see https://www.flake8rules.com/
-	ruff ${LINT_PATHS} --select=E9,F63,F7,F82 --show-source
+	ruff check ${LINT_PATHS} --select=E9,F63,F7,F82 --output-format=full
 	# exit-zero treats all errors as warnings.
-	ruff ${LINT_PATHS} --exit-zero
+	ruff check ${LINT_PATHS} --exit-zero
 format:
 	# Sort imports
-	ruff --select I ${LINT_PATHS} --fix
+	ruff check --select I ${LINT_PATHS} --fix
 	# Reformat using black
 	black ${LINT_PATHS}
 check-codestyle:
 	# Sort imports
-	ruff --select I ${LINT_PATHS}
+	ruff check --select I ${LINT_PATHS}
 	# Reformat using black
 	black --check ${LINT_PATHS}
 commit-checks: format type lint
 doc:
--- a/docs/conf.py
+++ b/docs/conf.py
@ -11,6 +11,7 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
 import datetime
 import os
 import sys
 from typing import Dict
@ -44,7 +45,7 @@ with open(version_file) as file_handler:
 # -- Project information -----------------------------------------------------
 project = "Stable Baselines3 - Contrib"
-copyright = "2023, Stable Baselines3"
+copyright = f"2021-{datetime.date.today().year}, Stable Baselines3"
 author = "Stable Baselines3 Contributors"
 # The short X.Y version
--- a/docs/guide/examples.rst
+++ b/docs/guide/examples.rst
@ -35,6 +35,12 @@ MaskablePPO
 Train a PPO with invalid action masking agent on a toy environment.
 .. warning::
  You must use ``MaskableEvalCallback`` from ``sb3_contrib.common.maskable.callbacks`` instead of the base ``EvalCallback`` to properly evaluate a model with action masks.
  Similarly, you must use ``evaluate_policy`` from ``sb3_contrib.common.maskable.evaluation`` instead of the SB3 one.
 .. code-block:: python
  from sb3_contrib import MaskablePPO
--- a/docs/misc/changelog.rst
+++ b/docs/misc/changelog.rst
@ -3,13 +3,55 @@
 Changelog
 ==========
 Release 2.3.0 (2024-03-31)
 --------------------------
-Release 2.2.0 (2023-11-16)
+**New defaults hyperparameters for QR-DQN**
 Breaking Changes:
 ^^^^^^^^^^^^^^^^^
 - Upgraded to Stable-Baselines3 >= 2.3.0
 - The default ``learning_starts`` parameter of ``QRDQN`` have been changed to be consistent with the other offpolicy algorithms
 .. code-block:: python
  # SB3 < 2.3.0 default hyperparameters, 50_000 corresponded to Atari defaults hyperparameters
  # model = QRDQN("MlpPolicy", env, learning_starts=50_000)
  # SB3 >= 2.3.0:
  model = QRDQN("MlpPolicy", env, learning_starts=100)
 New Features:
 ^^^^^^^^^^^^^
 - Added ``rollout_buffer_class`` and ``rollout_buffer_kwargs`` arguments to MaskablePPO
 - Log success rate ``rollout/success_rate`` when available for on policy algorithms
 Bug Fixes:
 ^^^^^^^^^^
 Deprecations:
 ^^^^^^^^^^^^^
 Others:
 ^^^^^^^
 - Fixed ``train_freq`` type annotation for tqc and qrdqn (@Armandpl)
 - Fixed ``sb3_contrib/common/maskable/*.py`` type annotations
 - Fixed ``sb3_contrib/ppo_mask/ppo_mask.py`` type annotations
 - Fixed ``sb3_contrib/common/vec_env/async_eval.py`` type annotations
 Documentation:
 ^^^^^^^^^^^^^^
 - Add some additional notes about ``MaskablePPO`` (evaluation and multi-process) (@icheered)
 Release 2.2.1 (2023-11-17)
 --------------------------
 Breaking Changes:
 ^^^^^^^^^^^^^^^^^
- Upgraded to Stable-Baselines3 >= 2.2.0
+- Upgraded to Stable-Baselines3 >= 2.2.1
 - Switched to ``ruff`` for sorting imports (isort is no longer needed), black and ruff version now require a minimum version
 - Dropped ``x is False`` in favor of ``not x``, which means that callbacks that wrongly returned None (instead of a boolean) will cause the training to stop (@iwishiwasaneagle)
@ -512,4 +554,4 @@ Contributors:
 -------------
@ku2482 @guyk1971 @minhlong94 @ayeright @kronion @glmcdona @cyprienc @sgillen @Gregwar @rnederstigt @qgallouedec
-@mlodel @CppMaster @burakdmb @honglu2875 @ZikangXiong @AlexPasqua @jonasreiher
+@mlodel @CppMaster @burakdmb @honglu2875 @ZikangXiong @AlexPasqua @jonasreiher @icheered @Armandpl
--- a/docs/modules/ppo_mask.rst
+++ b/docs/modules/ppo_mask.rst
@ -45,6 +45,16 @@ MultiBinary   ✔️      ✔️
 Dict          ❌      ✔️
 ============= ====== ===========
 .. warning::
  You must use ``MaskableEvalCallback`` from ``sb3_contrib.common.maskable.callbacks`` instead of the base ``EvalCallback`` to properly evaluate a model with action masks.
  Similarly, you must use ``evaluate_policy`` from ``sb3_contrib.common.maskable.evaluation`` instead of the SB3 one.
 .. warning::
  In order to use ``SubprocVecEnv`` with ``MaskablePPO``, you must implement the ``action_masks`` inside the environment (``ActionMasker`` cannot be used).
  You can have a look at the `built-in environments with invalid action masks <https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/master/sb3_contrib/common/envs/invalid_actions_env.py>`_ to have a working example. 
 Example
 -------
@ -58,6 +68,8 @@ returns the invalid action mask (``True`` if the action is valid, ``False`` othe
  from sb3_contrib.common.envs import InvalidActionEnvDiscrete
  from sb3_contrib.common.maskable.evaluation import evaluate_policy
  from sb3_contrib.common.maskable.utils import get_action_masks
  # This is a drop-in replacement for EvalCallback
  from sb3_contrib.common.maskable.callbacks import MaskableEvalCallback
  env = InvalidActionEnvDiscrete(dim=80, n_invalid_actions=60)
--- a/pyproject.toml
+++ b/pyproject.toml
@ -3,16 +3,18 @@
 line-length = 127
 # Assume Python 3.8
 target-version = "py38"
 [tool.ruff.lint]
 select = ["E", "F", "B", "UP", "C90", "RUF"]
 # B028: Ignore explicit stacklevel`
 # RUF013: Too many false positives (implicit optional)
 ignore = ["B028", "RUF013"]
-[tool.ruff.per-file-ignores]
+[tool.ruff.lint.per-file-ignores]
 # ClassVar, implicit optional check not needed for tests
 "./tests/*.py"= ["RUF012", "RUF013"]
-[tool.ruff.mccabe]
+[tool.ruff.lint.mccabe]
 # Unlike Flake8, ruff default to a complexity level of 10.
 max-complexity = 15
@ -28,12 +30,6 @@ exclude = """(?x)(
 	| sb3_contrib/ars/ars.py$
 	| sb3_contrib/common/recurrent/policies.py$
 	| sb3_contrib/common/recurrent/buffers.py$
 	| sb3_contrib/common/maskable/distributions.py$
 	| sb3_contrib/common/maskable/callbacks.py$
 	| sb3_contrib/common/maskable/policies.py$
 	| sb3_contrib/common/maskable/buffers.py$
 	| sb3_contrib/common/vec_env/async_eval.py$
 	| sb3_contrib/ppo_mask/ppo_mask.py$
 	| tests/test_train_eval_mode.py$
  )"""
--- a/sb3_contrib/common/maskable/buffers.py
+++ b/sb3_contrib/common/maskable/buffers.py
@ -18,7 +18,7 @@ class MaskableRolloutBufferSamples(NamedTuple):
    action_masks: th.Tensor
-class MaskableDictRolloutBufferSamples(MaskableRolloutBufferSamples):
+class MaskableDictRolloutBufferSamples(NamedTuple):
    observations: TensorDict
    actions: th.Tensor
    old_values: th.Tensor
@ -42,6 +42,8 @@ class MaskableRolloutBuffer(RolloutBuffer):
    :param n_envs: Number of parallel environments
    """
    action_masks: np.ndarray
    def __init__(
        self,
        buffer_size: int,
@ -53,7 +55,6 @@ class MaskableRolloutBuffer(RolloutBuffer):
        n_envs: int = 1,
    ):
        super().__init__(buffer_size, observation_space, action_space, device, gae_lambda, gamma, n_envs)
        self.action_masks = None
    def reset(self) -> None:
        if isinstance(self.action_space, spaces.Discrete):
@ -61,6 +62,10 @@ class MaskableRolloutBuffer(RolloutBuffer):
        elif isinstance(self.action_space, spaces.MultiDiscrete):
            mask_dims = sum(self.action_space.nvec)
        elif isinstance(self.action_space, spaces.MultiBinary):
            assert isinstance(self.action_space.n, int), (
                f"Multi-dimensional MultiBinary({self.action_space.n}) action space is not supported. "
                "You can flatten it instead."
            )
            mask_dims = 2 * self.action_space.n  # One mask per binary outcome
        else:
            raise ValueError(f"Unsupported action space {type(self.action_space)}")
@ -79,7 +84,7 @@ class MaskableRolloutBuffer(RolloutBuffer):
        super().add(*args, **kwargs)
-    def get(self, batch_size: Optional[int] = None) -> Generator[MaskableRolloutBufferSamples, None, None]:
+    def get(self, batch_size: Optional[int] = None) -> Generator[MaskableRolloutBufferSamples, None, None]:  # type: ignore[override]
        assert self.full, ""
        indices = np.random.permutation(self.buffer_size * self.n_envs)
        # Prepare the data
@ -105,7 +110,7 @@ class MaskableRolloutBuffer(RolloutBuffer):
            yield self._get_samples(indices[start_idx : start_idx + batch_size])
            start_idx += batch_size
-    def _get_samples(self, batch_inds: np.ndarray, env: Optional[VecNormalize] = None) -> MaskableRolloutBufferSamples:
+    def _get_samples(self, batch_inds: np.ndarray, env: Optional[VecNormalize] = None) -> MaskableRolloutBufferSamples:  # type: ignore[override]
        data = (
            self.observations[batch_inds],
            self.actions[batch_inds],
@ -143,17 +148,18 @@ class MaskableDictRolloutBuffer(DictRolloutBuffer):
    :param n_envs: Number of parallel environments
    """
    action_masks: np.ndarray
    def __init__(
        self,
        buffer_size: int,
-        observation_space: spaces.Space,
+        observation_space: spaces.Dict,
        action_space: spaces.Space,
        device: Union[th.device, str] = "auto",
        gae_lambda: float = 1,
        gamma: float = 0.99,
        n_envs: int = 1,
    ):
        self.action_masks = None
        super().__init__(buffer_size, observation_space, action_space, device, gae_lambda, gamma, n_envs=n_envs)
    def reset(self) -> None:
@ -162,6 +168,10 @@ class MaskableDictRolloutBuffer(DictRolloutBuffer):
        elif isinstance(self.action_space, spaces.MultiDiscrete):
            mask_dims = sum(self.action_space.nvec)
        elif isinstance(self.action_space, spaces.MultiBinary):
            assert isinstance(self.action_space.n, int), (
                f"Multi-dimensional MultiBinary({self.action_space.n}) action space is not supported. "
                "You can flatten it instead."
            )
            mask_dims = 2 * self.action_space.n  # One mask per binary outcome
        else:
            raise ValueError(f"Unsupported action space {type(self.action_space)}")
@ -180,7 +190,7 @@ class MaskableDictRolloutBuffer(DictRolloutBuffer):
        super().add(*args, **kwargs)
-    def get(self, batch_size: Optional[int] = None) -> Generator[MaskableDictRolloutBufferSamples, None, None]:
+    def get(self, batch_size: Optional[int] = None) -> Generator[MaskableDictRolloutBufferSamples, None, None]:  # type: ignore[override]
        assert self.full, ""
        indices = np.random.permutation(self.buffer_size * self.n_envs)
        # Prepare the data
@ -203,7 +213,7 @@ class MaskableDictRolloutBuffer(DictRolloutBuffer):
            yield self._get_samples(indices[start_idx : start_idx + batch_size])
            start_idx += batch_size
-    def _get_samples(self, batch_inds: np.ndarray, env: Optional[VecNormalize] = None) -> MaskableDictRolloutBufferSamples:
+    def _get_samples(self, batch_inds: np.ndarray, env: Optional[VecNormalize] = None) -> MaskableDictRolloutBufferSamples:  # type: ignore[override]
        return MaskableDictRolloutBufferSamples(
            observations={key: self.to_torch(obs[batch_inds]) for (key, obs) in self.observations.items()},
            actions=self.to_torch(self.actions[batch_inds]),
--- a/sb3_contrib/common/maskable/callbacks.py
+++ b/sb3_contrib/common/maskable/callbacks.py
@ -55,7 +55,7 @@ class MaskableEvalCallback(EvalCallback):
            # Note that evaluate_policy() has been patched to support masking
            episode_rewards, episode_lengths = evaluate_policy(
-                self.model,
+                self.model,  # type: ignore[arg-type]
                self.eval_env,
                n_eval_episodes=self.n_eval_episodes,
                render=self.render,
@ -67,6 +67,8 @@ class MaskableEvalCallback(EvalCallback):
            )
            if self.log_path is not None:
                assert isinstance(episode_rewards, list)
                assert isinstance(episode_lengths, list)
                self.evaluations_timesteps.append(self.num_timesteps)
                self.evaluations_results.append(episode_rewards)
                self.evaluations_length.append(episode_lengths)
@ -87,7 +89,7 @@ class MaskableEvalCallback(EvalCallback):
            mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)
            mean_ep_length, std_ep_length = np.mean(episode_lengths), np.std(episode_lengths)
-            self.last_mean_reward = mean_reward
+            self.last_mean_reward = float(mean_reward)
            if self.verbose > 0:
                print(f"Eval num_timesteps={self.num_timesteps}, " f"episode_reward={mean_reward:.2f} +/- {std_reward:.2f}")
@ -111,7 +113,7 @@ class MaskableEvalCallback(EvalCallback):
                    print("New best mean reward!")
                if self.best_model_save_path is not None:
                    self.model.save(os.path.join(self.best_model_save_path, "best_model"))
-                self.best_mean_reward = mean_reward
+                self.best_mean_reward = float(mean_reward)
                # Trigger callback on new best model, if needed
                if self.callback_on_new_best is not None:
                    continue_training = self.callback_on_new_best.on_step()
--- a/sb3_contrib/common/maskable/distributions.py
+++ b/sb3_contrib/common/maskable/distributions.py
@ -1,5 +1,5 @@
 from abc import ABC, abstractmethod
-from typing import List, Optional, Tuple, TypeVar
+from typing import List, Optional, Tuple, TypeVar, Union
 import numpy as np
 import torch as th
@ -13,6 +13,7 @@ SelfMaskableCategoricalDistribution = TypeVar("SelfMaskableCategoricalDistributi
 SelfMaskableMultiCategoricalDistribution = TypeVar(
    "SelfMaskableMultiCategoricalDistribution", bound="MaskableMultiCategoricalDistribution"
 )
 MaybeMasks = Union[th.Tensor, np.ndarray, None]
 class MaskableCategorical(Categorical):
@ -36,14 +37,14 @@ class MaskableCategorical(Categorical):
        probs: Optional[th.Tensor] = None,
        logits: Optional[th.Tensor] = None,
        validate_args: Optional[bool] = None,
-        masks: Optional[np.ndarray] = None,
+        masks: MaybeMasks = None,
    ):
        self.masks: Optional[th.Tensor] = None
        super().__init__(probs, logits, validate_args)
        self._original_logits = self.logits
        self.apply_masking(masks)
-    def apply_masking(self, masks: Optional[np.ndarray]) -> None:
+    def apply_masking(self, masks: MaybeMasks) -> None:
        """
        Eliminate ("mask out") chosen categorical outcomes by setting their probability to 0.
@ -84,7 +85,7 @@ class MaskableCategorical(Categorical):
 class MaskableDistribution(Distribution, ABC):
    @abstractmethod
-    def apply_masking(self, masks: Optional[np.ndarray]) -> None:
+    def apply_masking(self, masks: MaybeMasks) -> None:
        """
        Eliminate ("mask out") chosen distribution outcomes by setting their probability to 0.
@ -94,6 +95,13 @@ class MaskableDistribution(Distribution, ABC):
            previously applied masking is removed, and the original logits are restored.
        """
    @abstractmethod
    def proba_distribution_net(self, *args, **kwargs) -> nn.Module:
        """Create the layers and parameters that represent the distribution.
        Subclasses must define this, but the arguments and return type vary between
        concrete classes."""
 class MaskableCategoricalDistribution(MaskableDistribution):
    """
@ -154,7 +162,7 @@ class MaskableCategoricalDistribution(MaskableDistribution):
        log_prob = self.log_prob(actions)
        return actions, log_prob
-    def apply_masking(self, masks: Optional[np.ndarray]) -> None:
+    def apply_masking(self, masks: MaybeMasks) -> None:
        assert self.distribution is not None, "Must set distribution parameters"
        self.distribution.apply_masking(masks)
@ -192,7 +200,7 @@ class MaskableMultiCategoricalDistribution(MaskableDistribution):
        reshaped_logits = action_logits.view(-1, sum(self.action_dims))
        self.distributions = [
-            MaskableCategorical(logits=split) for split in th.split(reshaped_logits, tuple(self.action_dims), dim=1)
+            MaskableCategorical(logits=split) for split in th.split(reshaped_logits, list(self.action_dims), dim=1)
        ]
        return self
@ -229,18 +237,16 @@ class MaskableMultiCategoricalDistribution(MaskableDistribution):
        log_prob = self.log_prob(actions)
        return actions, log_prob
-    def apply_masking(self, masks: Optional[np.ndarray]) -> None:
+    def apply_masking(self, masks: MaybeMasks) -> None:
        assert len(self.distributions) > 0, "Must set distribution parameters"
        split_masks = [None] * len(self.distributions)
        if masks is not None:
-            masks = th.as_tensor(masks)
+            masks_tensor = th.as_tensor(masks)
            # Restructure shape to align with logits
-            masks = masks.view(-1, sum(self.action_dims))
+            masks_tensor = masks_tensor.view(-1, sum(self.action_dims))
            # Then split columnwise for each discrete action
-            split_masks = th.split(masks, tuple(self.action_dims), dim=1)
+            split_masks = th.split(masks_tensor, list(self.action_dims), dim=1)  # type: ignore[assignment]
        for distribution, mask in zip(self.distributions, split_masks):
            distribution.apply_masking(mask)
@ -268,10 +274,13 @@ def make_masked_proba_distribution(action_space: spaces.Space) -> MaskableDistri
    """
    if isinstance(action_space, spaces.Discrete):
-        return MaskableCategoricalDistribution(action_space.n)
+        return MaskableCategoricalDistribution(int(action_space.n))
    elif isinstance(action_space, spaces.MultiDiscrete):
-        return MaskableMultiCategoricalDistribution(action_space.nvec)
+        return MaskableMultiCategoricalDistribution(list(action_space.nvec))
    elif isinstance(action_space, spaces.MultiBinary):
        assert isinstance(
            action_space.n, int
        ), f"Multi-dimensional MultiBinary({action_space.n}) action space is not supported. You can flatten it instead."
        return MaskableBernoulliDistribution(action_space.n)
    else:
        raise NotImplementedError(
--- a/sb3_contrib/common/maskable/policies.py
+++ b/sb3_contrib/common/maskable/policies.py
@ -13,7 +13,7 @@ from stable_baselines3.common.torch_layers import (
    MlpExtractor,
    NatureCNN,
 )
-from stable_baselines3.common.type_aliases import Schedule
+from stable_baselines3.common.type_aliases import PyTorchObs, Schedule
 from torch import nn
 from sb3_contrib.common.maskable.distributions import MaskableDistribution, make_masked_proba_distribution
@ -141,8 +141,8 @@ class MaskableActorCriticPolicy(BasePolicy):
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob
-    def extract_features(
+    def extract_features(  # type: ignore[override]
-        self, obs: th.Tensor, features_extractor: Optional[BaseFeaturesExtractor] = None
+        self, obs: PyTorchObs, features_extractor: Optional[BaseFeaturesExtractor] = None
    ) -> Union[th.Tensor, Tuple[th.Tensor, th.Tensor]]:
        """
        Preprocess the observation if needed and extract features.
@ -233,7 +233,11 @@ class MaskableActorCriticPolicy(BasePolicy):
                module.apply(partial(self.init_weights, gain=gain))
        # Setup optimizer with initial learning rate
-        self.optimizer = self.optimizer_class(self.parameters(), lr=lr_schedule(1), **self.optimizer_kwargs)
+        self.optimizer = self.optimizer_class(
            self.parameters(),
            lr=lr_schedule(1),  # type: ignore[call-arg]
            **self.optimizer_kwargs,
        )
    def _get_action_dist_from_latent(self, latent_pi: th.Tensor) -> MaskableDistribution:
        """
@ -245,9 +249,9 @@ class MaskableActorCriticPolicy(BasePolicy):
        action_logits = self.action_net(latent_pi)
        return self.action_dist.proba_distribution(action_logits=action_logits)
-    def _predict(
+    def _predict(  # type: ignore[override]
        self,
-        observation: th.Tensor,
+        observation: PyTorchObs,
        deterministic: bool = False,
        action_masks: Optional[np.ndarray] = None,
    ) -> th.Tensor:
@ -284,35 +288,45 @@ class MaskableActorCriticPolicy(BasePolicy):
        # Switch to eval mode (this affects batch norm / dropout)
        self.set_training_mode(False)
-        observation, vectorized_env = self.obs_to_tensor(observation)
+        # Check for common mistake that the user does not mix Gym/VecEnv API
        # Tuple obs are not supported by SB3, so we can safely do that check
        if isinstance(observation, tuple) and len(observation) == 2 and isinstance(observation[1], dict):
            raise ValueError(
                "You have passed a tuple to the predict() function instead of a Numpy array or a Dict. "
                "You are probably mixing Gym API with SB3 VecEnv API: `obs, info = env.reset()` (Gym) "
                "vs `obs = vec_env.reset()` (SB3 VecEnv). "
                "See related issue https://github.com/DLR-RM/stable-baselines3/issues/1694 "
                "and documentation for more information: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api"
            )
        obs_tensor, vectorized_env = self.obs_to_tensor(observation)
        with th.no_grad():
-            actions = self._predict(observation, deterministic=deterministic, action_masks=action_masks)
+            actions = self._predict(obs_tensor, deterministic=deterministic, action_masks=action_masks)
            # Convert to numpy
            actions = actions.cpu().numpy()
        if isinstance(self.action_space, spaces.Box):
            if self.squash_output:
                # Rescale to proper domain when using squashing
-                actions = self.unscale_action(actions)
+                actions = self.unscale_action(actions)  # type: ignore[assignment, arg-type]
            else:
                # Actions could be on arbitrary scale, so clip the actions to avoid
                # out of bound error (e.g. if sampling from a Gaussian distribution)
-                actions = np.clip(actions, self.action_space.low, self.action_space.high)
+                actions = np.clip(actions, self.action_space.low, self.action_space.high)  # type: ignore[assignment, arg-type]
        if not vectorized_env:
-            if state is not None:
+            assert isinstance(actions, np.ndarray)
                raise ValueError("Error: The environment must be vectorized when using recurrent policies.")
            actions = actions.squeeze(axis=0)
-        return actions, None
+        return actions, state  # type: ignore[return-value]
    def evaluate_actions(
        self,
        obs: th.Tensor,
        actions: th.Tensor,
-        action_masks: Optional[np.ndarray] = None,
+        action_masks: Optional[th.Tensor] = None,
-    ) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
+    ) -> Tuple[th.Tensor, th.Tensor, Optional[th.Tensor]]:
        """
        Evaluate actions according to the current policy,
        given the observations.
@ -337,7 +351,7 @@ class MaskableActorCriticPolicy(BasePolicy):
        values = self.value_net(latent_vf)
        return values, log_prob, distribution.entropy()
-    def get_distribution(self, obs: th.Tensor, action_masks: Optional[np.ndarray] = None) -> MaskableDistribution:
+    def get_distribution(self, obs: PyTorchObs, action_masks: Optional[np.ndarray] = None) -> MaskableDistribution:
        """
        Get the current policy distribution given the observations.
@ -352,7 +366,7 @@ class MaskableActorCriticPolicy(BasePolicy):
            distribution.apply_masking(action_masks)
        return distribution
-    def predict_values(self, obs: th.Tensor) -> th.Tensor:
+    def predict_values(self, obs: PyTorchObs) -> th.Tensor:
        """
        Get the estimated values according to the current policy given the observations.
--- a/sb3_contrib/common/recurrent/policies.py
+++ b/sb3_contrib/common/recurrent/policies.py
@ -170,7 +170,7 @@ class RecurrentActorCriticPolicy(ActorCriticPolicy):
        Do a forward pass in the LSTM network.
        :param features: Input tensor
-        :param lstm_states: previous cell and hidden states of the LSTM
+        :param lstm_states: previous hidden and cell states of the LSTM, respectively
        :param episode_starts: Indicates when a new episode starts,
            in that case, we need to reset LSTM states.
        :param lstm: LSTM object.
--- a/sb3_contrib/common/vec_env/async_eval.py
+++ b/sb3_contrib/common/vec_env/async_eval.py
@ -59,10 +59,11 @@ def _worker(
                remote.send(vec_env.seed(seed=data))
            elif cmd == "set_options":
                # Note: the options will only be effective at the next reset
-                remote.send(vec_env.set_options(data))
+                remote.send(vec_env.set_options(data))  # type: ignore[func-returns-value]
            elif cmd == "get_obs_rms":
                remote.send(obs_rms)
            elif cmd == "sync_obs_rms":
                assert vec_normalize is not None, "Tried to call `sync_obs_rms` when not using VecNormalize"
                vec_normalize.obs_rms = data
                obs_rms = data
            elif cmd == "close":
@ -130,7 +131,7 @@ class AsyncEval:
                n_eval_episodes,
            )
            # daemon=True: if the main process crashes, we should not cause things to hang
-            process = ctx.Process(target=_worker, args=args, daemon=True)  # pytype:disable=attribute-error
+            process = ctx.Process(target=_worker, args=args, daemon=True)  # type: ignore[attr-defined]
            process.start()
            self.processes.append(process)
            work_remote.close()
@ -157,6 +158,10 @@ class AsyncEval:
        :param seed: The seed for the pseudo-random generators.
        :return:
        """
        if seed is None:
            # Do nothing
            return []
        for idx, remote in enumerate(self.remotes):
            remote.send(("seed", seed + idx))
        return [remote.recv() for remote in self.remotes]
--- a/sb3_contrib/ppo_mask/ppo_mask.py
+++ b/sb3_contrib/ppo_mask/ppo_mask.py
@ -1,18 +1,14 @@
 import sys
 import time
 from collections import deque
 from typing import Any, ClassVar, Dict, Optional, Tuple, Type, TypeVar, Union
 import numpy as np
 import torch as th
 from gymnasium import spaces
 from stable_baselines3.common import utils
 from stable_baselines3.common.buffers import RolloutBuffer
-from stable_baselines3.common.callbacks import BaseCallback, CallbackList, ConvertCallback, ProgressBarCallback
+from stable_baselines3.common.callbacks import BaseCallback
 from stable_baselines3.common.on_policy_algorithm import OnPolicyAlgorithm
 from stable_baselines3.common.policies import BasePolicy
 from stable_baselines3.common.type_aliases import GymEnv, MaybeCallback, Schedule
-from stable_baselines3.common.utils import explained_variance, get_schedule_fn, obs_as_tensor, safe_mean
+from stable_baselines3.common.utils import explained_variance, get_schedule_fn, obs_as_tensor
 from stable_baselines3.common.vec_env import VecEnv
 from torch.nn import functional as F
@ -74,6 +70,8 @@ class MaskablePPO(OnPolicyAlgorithm):
        "CnnPolicy": CnnPolicy,
        "MultiInputPolicy": MultiInputPolicy,
    }
    policy: MaskableActorCriticPolicy  # type: ignore[assignment]
    rollout_buffer: MaskableRolloutBuffer  # type: ignore[assignment]
    def __init__(
        self,
@ -91,6 +89,8 @@ class MaskablePPO(OnPolicyAlgorithm):
        ent_coef: float = 0.0,
        vf_coef: float = 0.5,
        max_grad_norm: float = 0.5,
        rollout_buffer_class: Optional[Type[RolloutBuffer]] = None,
        rollout_buffer_kwargs: Optional[Dict[str, Any]] = None,
        target_kl: Optional[float] = None,
        stats_window_size: int = 100,
        tensorboard_log: Optional[str] = None,
@ -101,7 +101,7 @@ class MaskablePPO(OnPolicyAlgorithm):
        _init_setup_model: bool = True,
    ):
        super().__init__(
-            policy,
+            policy,  # type: ignore[arg-type]
            env,
            learning_rate=learning_rate,
            n_steps=n_steps,
@ -112,6 +112,8 @@ class MaskablePPO(OnPolicyAlgorithm):
            max_grad_norm=max_grad_norm,
            use_sde=False,
            sde_sample_freq=-1,
            rollout_buffer_class=rollout_buffer_class,
            rollout_buffer_kwargs=rollout_buffer_kwargs,
            stats_window_size=stats_window_size,
            tensorboard_log=tensorboard_log,
            policy_kwargs=policy_kwargs,
@ -140,20 +142,24 @@ class MaskablePPO(OnPolicyAlgorithm):
        self._setup_lr_schedule()
        self.set_random_seed(self.seed)
-        buffer_cls = MaskableDictRolloutBuffer if isinstance(self.observation_space, spaces.Dict) else MaskableRolloutBuffer
+        self.policy = self.policy_class(  # type: ignore[assignment]
        self.policy = self.policy_class(
            self.observation_space,
            self.action_space,
            self.lr_schedule,
-            **self.policy_kwargs,  # pytype:disable=not-instantiable
+            **self.policy_kwargs,
        )
        self.policy = self.policy.to(self.device)
        if not isinstance(self.policy, MaskableActorCriticPolicy):
            raise ValueError("Policy must subclass MaskableActorCriticPolicy")
-        self.rollout_buffer = buffer_cls(
+        if self.rollout_buffer_class is None:
            if isinstance(self.observation_space, spaces.Dict):
                self.rollout_buffer_class = MaskableDictRolloutBuffer
            else:
                self.rollout_buffer_class = MaskableRolloutBuffer
        self.rollout_buffer = self.rollout_buffer_class(  # type: ignore[assignment]
            self.n_steps,
            self.observation_space,
            self.action_space,
@ -161,6 +167,7 @@ class MaskablePPO(OnPolicyAlgorithm):
            gamma=self.gamma,
            gae_lambda=self.gae_lambda,
            n_envs=self.n_envs,
            **self.rollout_buffer_kwargs,
        )
        # Initialize schedules for policy/value clipping
@ -171,86 +178,6 @@ class MaskablePPO(OnPolicyAlgorithm):
            self.clip_range_vf = get_schedule_fn(self.clip_range_vf)
    def _init_callback(
        self,
        callback: MaybeCallback,
        use_masking: bool = True,
        progress_bar: bool = False,
    ) -> BaseCallback:
        """
        :param callback: Callback(s) called at every step with state of the algorithm.
        :param use_masking: Whether or not to use invalid action masks during evaluation
        :param progress_bar: Display a progress bar using tqdm and rich.
        :return: A hybrid callback calling `callback` and performing evaluation.
        """
        # Convert a list of callbacks into a callback
        if isinstance(callback, list):
            callback = CallbackList(callback)
        # Convert functional callback to object
        if not isinstance(callback, BaseCallback):
            callback = ConvertCallback(callback)
        # Add progress bar callback
        if progress_bar:
            callback = CallbackList([callback, ProgressBarCallback()])
        callback.init_callback(self)
        return callback
    def _setup_learn(
        self,
        total_timesteps: int,
        callback: MaybeCallback = None,
        reset_num_timesteps: bool = True,
        tb_log_name: str = "run",
        use_masking: bool = True,
        progress_bar: bool = False,
    ) -> Tuple[int, BaseCallback]:
        """
        Initialize different variables needed for training.
        :param total_timesteps: The total number of samples (env steps) to train on
        :param callback: Callback(s) called at every step with state of the algorithm.
        :param reset_num_timesteps: Whether to reset or not the ``num_timesteps`` attribute
        :param tb_log_name: the name of the run for tensorboard log
        :param use_masking: Whether or not to use invalid action masks during training
        :param progress_bar: Display a progress bar using tqdm and rich.
        :return:
        """
        self.start_time = time.time_ns()
        if self.ep_info_buffer is None or reset_num_timesteps:
            # Initialize buffers if they don't exist, or reinitialize if resetting counters
            self.ep_info_buffer = deque(maxlen=self._stats_window_size)
            self.ep_success_buffer = deque(maxlen=self._stats_window_size)
        if reset_num_timesteps:
            self.num_timesteps = 0
            self._episode_num = 0
        else:
            # Make sure training timesteps are ahead of the internal counter
            total_timesteps += self.num_timesteps
        self._total_timesteps = total_timesteps
        self._num_timesteps_at_start = self.num_timesteps
        # Avoid resetting the environment when calling ``.learn()`` consecutive times
        if reset_num_timesteps or self._last_obs is None:
            self._last_obs = self.env.reset()
            self._last_episode_starts = np.ones((self.env.num_envs,), dtype=bool)
            # Retrieve unnormalized observation for saving into the buffer
            if self._vec_normalize_env is not None:
                self._last_original_obs = self._vec_normalize_env.get_original_obs()
        # Configure logger's outputs if no logger was passed
        if not self._custom_logger:
            self._logger = utils.configure_logger(self.verbose, self.tensorboard_log, tb_log_name, reset_num_timesteps)
        # Create eval callback if needed
        callback = self._init_callback(callback, use_masking, progress_bar)
        return total_timesteps, callback
    def collect_rollouts(
        self,
        env: VecEnv,
@ -312,7 +239,7 @@ class MaskablePPO(OnPolicyAlgorithm):
            if not callback.on_step():
                return False
-            self._update_info_buffer(infos)
+            self._update_info_buffer(infos, dones)
            n_steps += 1
            if isinstance(self.action_space, spaces.Discrete):
@ -341,14 +268,14 @@ class MaskablePPO(OnPolicyAlgorithm):
                log_probs,
                action_masks=action_masks,
            )
-            self._last_obs = new_obs
+            self._last_obs = new_obs  # type: ignore[assignment]
            self._last_episode_starts = dones
        with th.no_grad():
            # Compute value for the last timestep
            # Masking is not needed here, the choice of action doesn't matter.
            # We only want the value of the current observation.
-            values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))
+            values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))  # type: ignore[arg-type]
        rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)
@ -356,9 +283,9 @@ class MaskablePPO(OnPolicyAlgorithm):
        return True
-    def predict(
+    def predict(  # type: ignore[override]
        self,
-        observation: np.ndarray,
+        observation: Union[np.ndarray, Dict[str, np.ndarray]],
        state: Optional[Tuple[np.ndarray, ...]] = None,
        episode_start: Optional[np.ndarray] = None,
        deterministic: bool = False,
@ -388,10 +315,10 @@ class MaskablePPO(OnPolicyAlgorithm):
        # Update optimizer learning rate
        self._update_learning_rate(self.policy.optimizer)
        # Compute current clip range
-        clip_range = self.clip_range(self._current_progress_remaining)
+        clip_range = self.clip_range(self._current_progress_remaining)  # type: ignore[operator]
        # Optional: clip range for the value function
        if self.clip_range_vf is not None:
-            clip_range_vf = self.clip_range_vf(self._current_progress_remaining)
+            clip_range_vf = self.clip_range_vf(self._current_progress_remaining)  # type: ignore[operator]
        entropy_losses = []
        pg_losses, value_losses = [], []
@ -499,7 +426,7 @@ class MaskablePPO(OnPolicyAlgorithm):
        if self.clip_range_vf is not None:
            self.logger.record("train/clip_range_vf", clip_range_vf)
-    def learn(
+    def learn(  # type: ignore[override]
        self: SelfMaskablePPO,
        total_timesteps: int,
        callback: MaybeCallback = None,
@ -516,12 +443,13 @@ class MaskablePPO(OnPolicyAlgorithm):
            callback,
            reset_num_timesteps,
            tb_log_name,
            use_masking,
            progress_bar,
        )
        callback.on_training_start(locals(), globals())
        assert self.env is not None
        while self.num_timesteps < total_timesteps:
            continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, self.n_steps, use_masking)
@ -533,16 +461,7 @@ class MaskablePPO(OnPolicyAlgorithm):
            # Display training infos
            if log_interval is not None and iteration % log_interval == 0:
-                time_elapsed = max((time.time_ns() - self.start_time) / 1e9, sys.float_info.epsilon)
+                self._dump_logs(iteration)
                fps = int((self.num_timesteps - self._num_timesteps_at_start) / time_elapsed)
                self.logger.record("time/iterations", iteration, exclude="tensorboard")
                if len(self.ep_info_buffer) > 0 and len(self.ep_info_buffer[0]) > 0:
                    self.logger.record("rollout/ep_rew_mean", safe_mean([ep_info["r"] for ep_info in self.ep_info_buffer]))
                    self.logger.record("rollout/ep_len_mean", safe_mean([ep_info["l"] for ep_info in self.ep_info_buffer]))
                self.logger.record("time/fps", fps)
                self.logger.record("time/time_elapsed", int(time_elapsed), exclude="tensorboard")
                self.logger.record("time/total_timesteps", self.num_timesteps, exclude="tensorboard")
                self.logger.dump(step=self.num_timesteps)
            self.train()
--- a/sb3_contrib/ppo_recurrent/ppo_recurrent.py
+++ b/sb3_contrib/ppo_recurrent/ppo_recurrent.py
@ -1,5 +1,3 @@
 import sys
 import time
 from copy import deepcopy
 from typing import Any, ClassVar, Dict, Optional, Type, TypeVar, Union
@ -11,7 +9,7 @@ from stable_baselines3.common.callbacks import BaseCallback
 from stable_baselines3.common.on_policy_algorithm import OnPolicyAlgorithm
 from stable_baselines3.common.policies import BasePolicy
 from stable_baselines3.common.type_aliases import GymEnv, MaybeCallback, Schedule
-from stable_baselines3.common.utils import explained_variance, get_schedule_fn, obs_as_tensor, safe_mean
+from stable_baselines3.common.utils import explained_variance, get_schedule_fn, obs_as_tensor
 from stable_baselines3.common.vec_env import VecEnv
 from sb3_contrib.common.recurrent.buffers import RecurrentDictRolloutBuffer, RecurrentRolloutBuffer
@ -149,7 +147,7 @@ class RecurrentPPO(OnPolicyAlgorithm):
            self.action_space,
            self.lr_schedule,
            use_sde=self.use_sde,
-            **self.policy_kwargs,  # pytype:disable=not-instantiable
+            **self.policy_kwargs,
        )
        self.policy = self.policy.to(self.device)
@ -260,7 +258,7 @@ class RecurrentPPO(OnPolicyAlgorithm):
            if not callback.on_step():
                return False
-            self._update_info_buffer(infos)
+            self._update_info_buffer(infos, dones)
            n_steps += 1
            if isinstance(self.action_space, spaces.Discrete):
@ -453,42 +451,11 @@ class RecurrentPPO(OnPolicyAlgorithm):
        reset_num_timesteps: bool = True,
        progress_bar: bool = False,
    ) -> SelfRecurrentPPO:
-        iteration = 0
+        return super().learn(
-
+            total_timesteps=total_timesteps,
-        total_timesteps, callback = self._setup_learn(
+            callback=callback,
-            total_timesteps,
+            log_interval=log_interval,
-            callback,
+            tb_log_name=tb_log_name,
-            reset_num_timesteps,
+            reset_num_timesteps=reset_num_timesteps,
-            tb_log_name,
+            progress_bar=progress_bar,
            progress_bar,
        )
        callback.on_training_start(locals(), globals())
        while self.num_timesteps < total_timesteps:
            continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
            if not continue_training:
                break
            iteration += 1
            self._update_current_progress_remaining(self.num_timesteps, total_timesteps)
            # Display training infos
            if log_interval is not None and iteration % log_interval == 0:
                time_elapsed = max((time.time_ns() - self.start_time) / 1e9, sys.float_info.epsilon)
                fps = int((self.num_timesteps - self._num_timesteps_at_start) / time_elapsed)
                self.logger.record("time/iterations", iteration, exclude="tensorboard")
                if len(self.ep_info_buffer) > 0 and len(self.ep_info_buffer[0]) > 0:
                    self.logger.record("rollout/ep_rew_mean", safe_mean([ep_info["r"] for ep_info in self.ep_info_buffer]))
                    self.logger.record("rollout/ep_len_mean", safe_mean([ep_info["l"] for ep_info in self.ep_info_buffer]))
                self.logger.record("time/fps", fps)
                self.logger.record("time/time_elapsed", int(time_elapsed), exclude="tensorboard")
                self.logger.record("time/total_timesteps", self.num_timesteps, exclude="tensorboard")
                self.logger.dump(step=self.num_timesteps)
            self.train()
        callback.on_training_end()
        return self
--- a/sb3_contrib/qrdqn/qrdqn.py
+++ b/sb3_contrib/qrdqn/qrdqn.py
@ -20,7 +20,8 @@ class QRDQN(OffPolicyAlgorithm):
    """
    Quantile Regression Deep Q-Network (QR-DQN)
    Paper: https://arxiv.org/abs/1710.10044
-    Default hyperparameters are taken from the paper and are tuned for Atari games.
+    Default hyperparameters are taken from the paper and are tuned for Atari games
    (except for the ``learning_starts`` parameter).
    :param policy: The policy model to use (MlpPolicy, CnnPolicy, ...)
    :param env: The environment to learn from (if registered in Gym, can be str)
@ -77,11 +78,11 @@ class QRDQN(OffPolicyAlgorithm):
        env: Union[GymEnv, str],
        learning_rate: Union[float, Schedule] = 5e-5,
        buffer_size: int = 1000000,  # 1e6
-        learning_starts: int = 50000,
+        learning_starts: int = 100,
        batch_size: int = 32,
        tau: float = 1.0,
        gamma: float = 0.99,
-        train_freq: int = 4,
+        train_freq: Union[int, Tuple[int, str]] = 4,
        gradient_steps: int = 1,
        replay_buffer_class: Optional[Type[ReplayBuffer]] = None,
        replay_buffer_kwargs: Optional[Dict[str, Any]] = None,
--- a/sb3_contrib/tqc/tqc.py
+++ b/sb3_contrib/tqc/tqc.py
@ -88,7 +88,7 @@ class TQC(OffPolicyAlgorithm):
        batch_size: int = 256,
        tau: float = 0.005,
        gamma: float = 0.99,
-        train_freq: int = 1,
+        train_freq: Union[int, Tuple[int, str]] = 1,
        gradient_steps: int = 1,
        action_noise: Optional[ActionNoise] = None,
        replay_buffer_class: Optional[Type[ReplayBuffer]] = None,
--- a/sb3_contrib/version.txt
+++ b/sb3_contrib/version.txt
@ -1 +1 @@
-2.2.0
+2.3.0
--- a/setup.py
+++ b/setup.py
@ -65,7 +65,7 @@ setup(
    packages=[package for package in find_packages() if package.startswith("sb3_contrib")],
    package_data={"sb3_contrib": ["py.typed", "version.txt"]},
    install_requires=[
-        "stable_baselines3>=2.2.0,<3.0",
+        "stable_baselines3>=2.3.0,<3.0",
    ],
    description="Contrib package of Stable Baselines3, experimental code.",
    author="Antonin Raffin",
Author	SHA1	Message	Date
Antonin RAFFIN	adfbeb1b01	Fix typo in changelog (#237 )	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	17cd797d5c	Release v2.3.0 (#236 )	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	34eceaf382	Log success rate for PPO variants (#235 )	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	89d0113037	Update ruff and SB3 dependencies (#232 )	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	7dd6c39fba	Fix MaskablePPO type annotations (#233 ) * Fix `sb3_contrib/common/maskable/callbacks.py` type annotations * Fix `sb3_contrib/common/vec_env/async_eval.py` type annotations * Fix `sb3_contrib/common/maskable/buffers.py` type hints * Fix `sb3_contrib/common/maskable/distributions.py` type hints * Fix `sb3_contrib/common/maskable/policies.py` type hints * Fix `sb3_contrib/ppo_mask/ppo_mask.py` type hints * Update changelog and fix type hints	2024-05-06 14:20:28 +01:00
Armand du Parc Locmaria	cd31e89e26	Fix `train_freq` type annotation for TQC and QR-DQN (#229 ) * fix train_freq type for tqc and qrdn * fix typo * Update changelog --------- Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de>	2024-05-06 14:20:28 +01:00
Tjeerd Bakker	bc3c0a9595	Add notes about MaskablePPO (#227 ) * Add note about MaskableEvalCallback * Update doc --------- Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de>	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	3f0c5088b3	Update QRDQN defaults (#225 )	2024-05-06 14:20:28 +01:00
Rogério Júnior	1553b66ee4	Update `_process_sequence()` docstring (#219 ) Small modification in the documentation of `_process_sequence` to reflect the correct order (`hidden_state`, `cell_state`) of `lstm_states`. See order here: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html	2024-05-06 14:20:28 +01:00
Antonin RAFFIN	94a5daab02	Update SB3 version (#217 )	2024-05-06 14:20:28 +01:00