Module functional¶
The functional module implements various functional needed for reinforcement learning calculations.
- Exposed functions:
Exposed functions¶
Loss functions (functional.loss)¶
Collection of loss functions necessary for reinforcement learning objective calculations.
-
pytorch_seed_rl.functional.loss.entropy(logits: torch.Tensor) → torch.Tensor[source]¶ Return the entropy loss, i.e., the negative entropy of the policy.
This can be used to discourage an RL model to converge prematurely.
- Parameters
logits (
torch.Tensor) – Logits returned by the models policy network.
-
pytorch_seed_rl.functional.loss.policy_gradient(logits: torch.Tensor, actions: torch.Tensor, advantages: torch.Tensor) → torch.Tensor[source]¶ Compute the policy gradient loss.
See also
- Parameters
logits (
torch.Tensor) – Logits returned by the models policy network.actions (
torch.Tensor) – Actions that were selected fromlogitsadvantages (
torch.Tensor) – Advantages that resulted for the related states.
Vtrace (functional.vtrace)¶
Functions to compute V-trace off-policy actor critic targets.
See also
“IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures” on arXiv by Espeholt, Soyer, Munos et al.
All exposed functions return a VTraceFromLogitsReturns.
-
class
pytorch_seed_rl.functional.vtrace.VTraceFromLogitsReturns(vs, pg_advantages, log_rhos, behavior_action_log_probs, target_action_log_probs)¶ Bases:
tuple-
property
behavior_action_log_probs¶ Alias for field number 3
-
property
log_rhos¶ Alias for field number 2
-
property
pg_advantages¶ Alias for field number 1
-
property
target_action_log_probs¶ Alias for field number 4
-
property
vs¶ Alias for field number 0
-
property
-
pytorch_seed_rl.functional.vtrace.from_logits(behavior_policy_logits: torch.Tensor, target_policy_logits: torch.Tensor, values: torch.Tensor, bootstrap_value: torch.Tensor, actions: torch.Tensor, discounts: torch.Tensor, rewards: torch.Tensor, clip_rho_threshold: float = 1.0, clip_pg_rho_threshold: float = 1.0) → pytorch_seed_rl.functional.vtrace.VTraceFromLogitsReturns[source]¶ V-trace for softmax policies.
- Parameters
behavior_policy_logits (torch.Tensor) – The policies logits used for action sampling during interaction with the environment.
target_policy_logits (torch.Tensor) – The policies logits returned by the learning model.
values (torch.Tensor) – The values returned by the learning model.
bootstrap_value (torch.Tensor) – The value used for bootstrapping (usually most recent value returned by learning model.)
actions (torch.Tensor) – The actions used during interaction with the environment.
discounts (torch.Tensor) – The discounted rewards.
rewards (torch.Tensor) – The original rewards.
clip_rho_threshold (float,) – Clipping value for Vtrace. See paper for details.
clip_pg_rho_threshold (float,) – Clipping value for Vtrace. See paper for details.