Module functional

The functional module implements various functional needed for reinforcement learning calculations.

Exposed functions:

Exposed functions

Loss functions (functional.loss)

Collection of loss functions necessary for reinforcement learning objective calculations.

pytorch_seed_rl.functional.loss.entropy(logits: torch.Tensor)torch.Tensor[source]

Return the entropy loss, i.e., the negative entropy of the policy.

This can be used to discourage an RL model to converge prematurely.

Parameters

logits (torch.Tensor) – Logits returned by the models policy network.

pytorch_seed_rl.functional.loss.policy_gradient(logits: torch.Tensor, actions: torch.Tensor, advantages: torch.Tensor)torch.Tensor[source]

Compute the policy gradient loss.

Parameters
  • logits (torch.Tensor) – Logits returned by the models policy network.

  • actions (torch.Tensor) – Actions that were selected from logits

  • advantages (torch.Tensor) – Advantages that resulted for the related states.

Vtrace (functional.vtrace)

Functions to compute V-trace off-policy actor critic targets.

class pytorch_seed_rl.functional.vtrace.VTraceFromLogitsReturns(vs, pg_advantages, log_rhos, behavior_action_log_probs, target_action_log_probs)

Bases: tuple

property behavior_action_log_probs

Alias for field number 3

property log_rhos

Alias for field number 2

property pg_advantages

Alias for field number 1

property target_action_log_probs

Alias for field number 4

property vs

Alias for field number 0

pytorch_seed_rl.functional.vtrace.from_logits(behavior_policy_logits: torch.Tensor, target_policy_logits: torch.Tensor, values: torch.Tensor, bootstrap_value: torch.Tensor, actions: torch.Tensor, discounts: torch.Tensor, rewards: torch.Tensor, clip_rho_threshold: float = 1.0, clip_pg_rho_threshold: float = 1.0)pytorch_seed_rl.functional.vtrace.VTraceFromLogitsReturns[source]

V-trace for softmax policies.

Parameters
  • behavior_policy_logits (torch.Tensor) – The policies logits used for action sampling during interaction with the environment.

  • target_policy_logits (torch.Tensor) – The policies logits returned by the learning model.

  • values (torch.Tensor) – The values returned by the learning model.

  • bootstrap_value (torch.Tensor) – The value used for bootstrapping (usually most recent value returned by learning model.)

  • actions (torch.Tensor) – The actions used during interaction with the environment.

  • discounts (torch.Tensor) – The discounted rewards.

  • rewards (torch.Tensor) – The original rewards.

  • clip_rho_threshold (float,) – Clipping value for Vtrace. See paper for details.

  • clip_pg_rho_threshold (float,) – Clipping value for Vtrace. See paper for details.