* MaskablePPO docs
Added a warning about possible crashes caused by chack_env in case of invalid actions.
* Reformat with black 23
* Rephrase note on action sampling
* Fix action noise
* Update changelog
---------
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
* `to(device)` to `device=device` and `float()` to `dtype=th.float32`
* Update changelog
* Fix type checking
Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
* Modified sb3_contrib/common/maskable/policies.py
- Added support for non-shared features extractor in file sb3_contrib/common/maskable/policies.py
- updated changelog
* Modified sb3_contrib/common/recurrent/policies.py
* Modified sb3_contrib/qrdqn/policies.py and sb3_contrib/tqc/policies.py
* Updated test_cnn.py
* Upgrade SB3 version
* Revert changes in formatting
* Remove duplicate normalize_images
* Add test for image-like inputs
* Fixes and add more tests
* Update SB3 version
* Fix ARS warnings
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
* Update contribution.md
* New loop struct to make mypy happy
* Update setup.cfg
* Update changelog
* fix squash_output = False in ARS policy
* Add with_bias parameter to ARSPolicy
* Make ARSLinearPolicy a special case of ARSPolicy
* Remove ars_policy from mypy exclude
* Update changelog
* Update SB3 version
* Fix to save ARS linear policy saved with sb3-contrib < 1.7.0
* Fix test
* Turn docstring into comment
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de>
* Default device for buffer is auto
* `device=auto` in ARS
* Undo ARS change
* Update changelog
* Update min SB3 version
Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
* Running (not working yet) version of recurrent PPO
* Fixes for multi envs
* Save WIP, rework the sampling
* Add Box support
* Fix sample order
* Being cleanup, code is broken (again)
* First working version (no shared lstm)
* Start cleanup
* Try rnn with value function
* Re-enable batch size
* Deactivate vf rnn
* Allow any batch size
* Add support for evaluation
* Add CNN support
* Fix start of sequence
* Allow shared LSTM
* Rename mask to episode_start
* Fix type hint
* Enable LSTM for critic
* Clean code
* Fix for CNN LSTM
* Fix sampling with n_layers > 1
* Add std logger
* Update wording
* Rename and add dict obs support
* Fixes for dict obs support
* Do not run slow tests
* Fix doc
* Update recurrent PPO example
* Update README
* Use Pendulum-v1 for tests
* Fix image env
* Speedup LSTM forward pass (#63)
* added more efficient lstm implementation
* Rename and add comment
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
* Fixes
* Remove OpenAI sampling and improve coverage
* Sync with SB3 PPO
* Pass state shape and allow lstm kwargs
* Update tests
* Add masking for padded sequences
* Update default in perf test
* Remove TODO, mask is now working
* Add helper to remove duplicated code, remove hack for padding
* Enable LSTM critic and raise threshold for cartpole with no vel
* Fix tests
* Update doc and tests
* Doc fix
* Fix for new Sphinx version
* Fix doc note
* Switch to batch first, no more additional swap
* Add comments and mask entropy loss
Co-authored-by: Neville Walo <43504521+Walon1998@users.noreply.github.com>
* Pendulum-v0 -> Pendulum-v1
* Reformat with black
* Update changelog
* Fix dtype bug in TimeFeatureWrapper
* Update version and removed forward calls
* Update CI
* Fix min version
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
* first pass at ars, replicates initial results, still needs more testing, cleanup
* add a few docs and tests, bugfixes for ARS
* debug and comment
* break out dump logs
* rollback so there are now predict workers, some refactoring
* remove callback from self, remove torch multiprocessing
* add module docs
* run formatter
* fix load and rerun formatter
* rename to less mathy variable names, rename _validate_hypers
* refactor to use evaluatate_policy, linear policy no longer uses bias or squashing
* move everything to torch, add support for discrete action spaces, bugfix for alive reward offset
* added tests, passing all of them, add support for discrete action spaces
* update documentation
* allow for reward offset when there are multiple envs
* update results again
* Reformat
* Ignore unused imports
* Renaming + Cleanup
* Experimental multiprocessing
* Cleaner multiprocessing
* Reformat
* Fixes for callback
* Fix combining stats
* 2nd way
* Make the implementation cpu only
* Fixes + POC with mp module
* POC Processes
* Cleaner aync implementation
* Remove unused arg
* Add typing
* Revert vec normalize offset hack
* Add `squash_output` parameter
* Add more tests
* Add comments
* Update doc
* Add comments
* Add more logging
* Fix TRPO issue on GPU
* Tmp fix for ARS tests on GPU
* Additional tmp fixes for ARS
* update docstrings + formatting, fix bad exceptioe string in ARSPolicy
* Add comments and docstrings
* Fix missing import
* Fix type check
* Add dosctrings
* GPU support, first attempt
* Fix test
* Add missing docstring
* Typos
* Update defaults hyperparameters
Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
* Feat: adding TRPO algorithm (WIP)
WIP - Trust Region Policy Algorithm
Currently the Hessian vector product is not working (see inline comments for more detail)
* Feat: adding TRPO algorithm (WIP)
Adding no_grad block for the line search
Additional assert in the conjugate solver to help debugging
* Feat: adding TRPO algorithm (WIP)
- Adding ActorCriticPolicy.get_distribution
- Using the Distribution object to compute the KL divergence
- Checking for objective improvement in the line search
- Moving magic numbers to instance variables
* Feat: adding TRPO algorithm (WIP)
Improving numerical stability of the conjugate gradient algorithm
Critic updates
* Feat: adding TRPO algorithm (WIP)
Changes around the alpha of the line search
Adding TRPO to __init__ files
* feat: TRPO - addressing PR comments
- renaming cg_solver to conjugate_gradient_solver and renaming parameter Avp_fun to matrix_vector_dot_func + docstring
- extra comments + better variable names in trpo.py
- defining a method for the hessian vector product instead of an inline function
- fix registering correct policies for TRPO and using correct policy base in constructor
* refactor: TRPO - policier
- refactoring sb3_contrib.common.policies to reuse as much code as possible from sb3
* feat: using updated ActorCriticPolicy from SB3
- get_distribution will be added directly to the SB3 version of ActorCriticPolicy, this commit reflects this
* Bump version for `get_distribution` support
* Add basic test
* Reformat
* [ci skip] Fix changelog
* fix: setting train mode for trpo
* fix: batch_size type hint in trpo.py
* style: renaming variables + docstring in trpo.py
* Rename + cleanup
* Move grad computation to separate method
* Remove grad norm clipping
* Remove n epochs and add sub-sampling
* Update defaults
* Add Doc
* Add more test and fixes for CNN
* Update doc + add benchmark
* Add tests + update doc
* Fix doc
* Improve names for conjugate gradient
* Update comments
* Update changelog
Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>