stable-baselines3-contrib-sacd

Commit Graph

Author	SHA1	Message	Date
Quentin Gallouédec	dec7b5303a	Deprecate ``create_eval_env``, ``eval_env`` and ``eval_freq`` parameter (#105 ) * Deprecate ``eval_env``, ``eval_freq```and ``create_eval_env`` * Update changelog * Typo * Raise deprecation warining in _setup_learn * Upgrade to latest SB3 version and update changelog Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>	2022-10-10 17:12:40 +02:00
Antonin RAFFIN	75b2de1399	Recurrent PPO (#53 ) * Running (not working yet) version of recurrent PPO * Fixes for multi envs * Save WIP, rework the sampling * Add Box support * Fix sample order * Being cleanup, code is broken (again) * First working version (no shared lstm) * Start cleanup * Try rnn with value function * Re-enable batch size * Deactivate vf rnn * Allow any batch size * Add support for evaluation * Add CNN support * Fix start of sequence * Allow shared LSTM * Rename mask to episode_start * Fix type hint * Enable LSTM for critic * Clean code * Fix for CNN LSTM * Fix sampling with n_layers > 1 * Add std logger * Update wording * Rename and add dict obs support * Fixes for dict obs support * Do not run slow tests * Fix doc * Update recurrent PPO example * Update README * Use Pendulum-v1 for tests * Fix image env * Speedup LSTM forward pass (#63) * added more efficient lstm implementation * Rename and add comment Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org> * Fixes * Remove OpenAI sampling and improve coverage * Sync with SB3 PPO * Pass state shape and allow lstm kwargs * Update tests * Add masking for padded sequences * Update default in perf test * Remove TODO, mask is now working * Add helper to remove duplicated code, remove hack for padding * Enable LSTM critic and raise threshold for cartpole with no vel * Fix tests * Update doc and tests * Doc fix * Fix for new Sphinx version * Fix doc note * Switch to batch first, no more additional swap * Add comments and mask entropy loss Co-authored-by: Neville Walo <43504521+Walon1998@users.noreply.github.com>	2022-05-30 04:31:12 +02:00
Sean Gillen	675304d8fa	Augmented Random Search (ARS) (#42 ) * first pass at ars, replicates initial results, still needs more testing, cleanup * add a few docs and tests, bugfixes for ARS * debug and comment * break out dump logs * rollback so there are now predict workers, some refactoring * remove callback from self, remove torch multiprocessing * add module docs * run formatter * fix load and rerun formatter * rename to less mathy variable names, rename _validate_hypers * refactor to use evaluatate_policy, linear policy no longer uses bias or squashing * move everything to torch, add support for discrete action spaces, bugfix for alive reward offset * added tests, passing all of them, add support for discrete action spaces * update documentation * allow for reward offset when there are multiple envs * update results again * Reformat * Ignore unused imports * Renaming + Cleanup * Experimental multiprocessing * Cleaner multiprocessing * Reformat * Fixes for callback * Fix combining stats * 2nd way * Make the implementation cpu only * Fixes + POC with mp module * POC Processes * Cleaner aync implementation * Remove unused arg * Add typing * Revert vec normalize offset hack * Add `squash_output` parameter * Add more tests * Add comments * Update doc * Add comments * Add more logging * Fix TRPO issue on GPU * Tmp fix for ARS tests on GPU * Additional tmp fixes for ARS * update docstrings + formatting, fix bad exceptioe string in ARSPolicy * Add comments and docstrings * Fix missing import * Fix type check * Add dosctrings * GPU support, first attempt * Fix test * Add missing docstring * Typos * Update defaults hyperparameters Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>	2022-01-18 13:57:27 +01:00
Cyprien	59be198da0	Add Trust Region Policy Optimization (TRPO) (#40 ) * Feat: adding TRPO algorithm (WIP) WIP - Trust Region Policy Algorithm Currently the Hessian vector product is not working (see inline comments for more detail) * Feat: adding TRPO algorithm (WIP) Adding no_grad block for the line search Additional assert in the conjugate solver to help debugging * Feat: adding TRPO algorithm (WIP) - Adding ActorCriticPolicy.get_distribution - Using the Distribution object to compute the KL divergence - Checking for objective improvement in the line search - Moving magic numbers to instance variables * Feat: adding TRPO algorithm (WIP) Improving numerical stability of the conjugate gradient algorithm Critic updates * Feat: adding TRPO algorithm (WIP) Changes around the alpha of the line search Adding TRPO to __init__ files * feat: TRPO - addressing PR comments - renaming cg_solver to conjugate_gradient_solver and renaming parameter Avp_fun to matrix_vector_dot_func + docstring - extra comments + better variable names in trpo.py - defining a method for the hessian vector product instead of an inline function - fix registering correct policies for TRPO and using correct policy base in constructor * refactor: TRPO - policier - refactoring sb3_contrib.common.policies to reuse as much code as possible from sb3 * feat: using updated ActorCriticPolicy from SB3 - get_distribution will be added directly to the SB3 version of ActorCriticPolicy, this commit reflects this * Bump version for `get_distribution` support * Add basic test * Reformat * [ci skip] Fix changelog * fix: setting train mode for trpo * fix: batch_size type hint in trpo.py * style: renaming variables + docstring in trpo.py * Rename + cleanup * Move grad computation to separate method * Remove grad norm clipping * Remove n epochs and add sub-sampling * Update defaults * Add Doc * Add more test and fixes for CNN * Update doc + add benchmark * Add tests + update doc * Fix doc * Improve names for conjugate gradient * Update comments * Update changelog Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>	2021-12-29 11:58:03 +01:00
kronion	ab24f8039f	PPO variant with invalid action masking (#25 ) * Add wrappers * Add maskable distributions * Add mypy configuration * Add maskable base datastructures * Add ppo_mask package * Fix circular dependency and remove test code that slipped in * Automatically mask vecenv if env is masked * Fix debugging change that slipped in * Workaround for subclassing RolloutBufferSamples * Duplicate lots of policy code in order to swap out the distributions used * Fix pytype error * Maintain py 3.6 compatibility * Fix isort lint errors * Use pyproject.toml to configure black line length * Blacken * Remove mypy.ini * Fully replace RolloutBufferSamples * Drop support for continuous distributions, remove SDE-related code * Eliminate MaskableAlgorithm and MaskableOnPolicyAlgorithm * Fix formatting * Override superclass methods as needed, fix circular import, improve naming * Fix codestyle * Eliminate VecActionMasker, replace with utils * Fix codestyle * Support masking for MultiDiscrete action spaces * Fix codestyle * Don't require the env to provide the mask already flattened * Consistent naming, prefer 'Maskable' to 'Masked' * Register policy * Link to abstract instead of pdf * Allow distribution masking to be unapplied + improved comments and docstrings * Don't use deprecated implicit optional typing * Check codestyle * Add docstring and remove misplaced TODO * Simplify env masking API, error if API unmet. Make use_masking a learn() kwarg * Fix codestyle * Update various internals to be consistent with latest SB3 * Simplify MaskableRolloutBuffer reset * Add docstring and type annotations * Ensure old probs aren't cached * Fix for new logger * Add test + fixes * Start doc * Fix type annotation * Remove abstract class + add test * Fix evaluation (add support for multi envs) * Handle merge conflicts in documentation * Bugfix: mask updates should apply to original logits, not the last masked output * Add test of distribution masking behavior * Reformat * Add MultiBinary support, remove unneeded distribution type checks * Remove unused import * Fix when using multiple envs * Remove addressed TODO * Upgrade for SB3 1.2.0 * Update docs with results + how to replicate * Add action masker tests, move wrapper tests * Move distributions, add more distribution tests * Add MaskablePPO tests, simplify and rename discrete test env * Address TODO * Add tests for MaskableMultiCategoricalDistribution, fix distributions * Add maskable identity envs for all supported action spaces, add tests, fix bug * Formatting fixes * Update doc env * Dict support not ready * Cleanup Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>	2021-09-23 14:50:10 +02:00
Antonin RAFFIN	3665695d1e	Dictionary Observations (#29 ) * Add TQC support for new HER version * Add dict obs support * Add support for dict obs	2021-05-11 13:24:31 +02:00
Toshiki Watanabe	b30397fff5	Add QR-DQN (#13 ) * Add QR-DQN(WIP) * Update docstring * Add quantile_huber_loss * Fix typo * Remove unnecessary lines * Update variable names and comments in quantile_huber_loss * Fix mutable arguments * Update variable names * Ignore import not used warnings * Fix default parameter of optimizer in QR-DQN * Update quantile_huber_loss to have more reasonable interface * update tests * Add assertion to quantile_huber_loss * Update variable names of quantile regression * Update comments * Reduce the number of quantiles during test * Update comment * Update quantile_huber_loss * Fix isort * Add document of QR-DQN without results * Update docs * Fix bugs * Update doc * Add comments about shape * Minor edits * Update comments * Add benchmark * Doc fixes * Update doc * Bug fix in saving/loading + update tests Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>	2020-12-21 11:17:48 +01:00
Antonin RAFFIN	6bafcf6e88	Add TimeFeatureWrapper (#7 ) * Add TimeFeatureWrapper * Update README * Address comments	2020-11-13 13:00:56 +02:00
Antonin RAFFIN	0d9f2e229e	Add TQC and base scripts	2020-09-25 12:47:45 +02:00

9 Commits