Skip to content

NevarokML: SAC Algorithm

The NevarokML plugin integrates the Soft Actor-Critic (SAC) algorithm into the Unreal Engine environment. SAC is an off-policy maximum entropy deep reinforcement learning algorithm that combines actor-critic methods with entropy regularization to achieve both exploration and exploitation in learning. It is based on the original SAC paper and subsequent improvements.


SAC Algorithm Overview

The SAC algorithm, as implemented in NevarokML, utilizes a stochastic actor-critic architecture with double Q-targets to estimate the action-value function. It incorporates entropy regularization to encourage exploration and maximize the entropy of the policy distribution. Here are the key features and parameters of the SAC algorithm used in NevarokML:

  • owner: Parameter represents the owner of the object, usually the object creating the algorithm.
  • policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
  • learningRate (Learning Rate): Set the learning rate for the Adam optimizer. The same learning rate will be used for all networks (Q-Values, Actor, and Value function).
  • bufferSize (Replay Buffer Size): Specify the size of the replay buffer, which stores the agent's experiences for training.
  • learningStarts (Learning Starts): Determine how many steps of the model to collect transitions for before learning starts. It ensures that enough experiences are collected before training begins.
  • batchSize (Batch Size): Set the minibatch size for each gradient update. It controls the number of experiences sampled from the replay buffer for each training iteration.
  • tau (Soft Update Coefficient): Set the coefficient for the soft update of the target networks. It determines the interpolation weight between the current and target networks during the update.
  • gamma (Discount Factor): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
  • trainFreq (Train Frequency): Update the model every trainFreq steps.
  • gradientSteps (Gradient Steps): Specify how many gradient steps to perform after each rollout. Set to -1 to perform as many gradient steps as steps done in the environment during the rollout.
  • optimizeMemoryUsage (Optimize Memory Usage): Enable a memory-efficient variant of the replay buffer at the cost of increased complexity. See here for more details.
  • entCoefAuto: Set the entropy regularization coefficient. Set it to 'auto' to learn it automatically.
  • entCoef (Entropy Coefficient): Set the entropy regularization coefficient. It controls the trade-off between exploration and exploitation. Value less or equal to '0.0' will set 'entCoef' to 'auto' to learn it automatically, value greater '0.0' will set 'entCoef' to 'auto_{entCoef} if 'entCoefAuto' is set to 'true''.
  • targetUpdateInterval (Target Update Interval): Specify the interval to update the target network. It determines how often the target network is synchronized with the online network.
  • targetEntropyAuto: Set the target entropy when learning the entropy coefficient. Set it to 'auto' to learn it automatically.
  • targetEntropy (Target Entropy): Set the target entropy when learning the entropy coefficient. Ignored if 'targetEntropyAuto' is set to 'true'. Value less or equal to '0.0' will set 'targetEntropy' to 'auto'.
  • useSde (Use Generalized State Dependent Exploration): Enable the use of generalized State Dependent Exploration (gSDE) instead of action noise exploration.
  • sdeSampleFreq (SDE Sample Frequency): Set the frequency to sample a new noise matrix when using gSDE. Set to -1 to sample only at the beginning of the rollout.
  • useSdeAtWarmup (Use SDE at Warmup): Specify whether to use gSDE instead of uniform sampling during the warm-up phase before learning starts.
  • verbose (Verbose Level): Control the verbosity level of the training process. Set it to 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is the API for the SAC algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* SAC(UObject* owner,
                                    const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                    const float learningRate = 3e-4,
                                    const int bufferSize = 1000000,
                                    const int learningStarts = 100,
                                    const int batchSize = 256,
                                    const float tau = 0.005,
                                    const float gamma = 0.99,
                                    const int trainFreq = 1,
                                    const int gradientSteps = 1,
                                    const bool optimizeMemoryUsage = false,
                                    const bool entCoefAuto = true,
                                    const float entCoef = 0.0,
                                    const int targetUpdateInterval = 1,
                                    const bool targetEntropyAuto = true,
                                    const float targetEntropy = 0.0,
                                    const bool useSde = false,
                                    const int sdeSampleFreq = -1,
                                    const bool useSdeAtWarmup = false,
                                    const int verbose = 1)

By setting the appropriate parameter values, you can customize the behavior of the SAC algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines SAC algorithm, please refer to the SAC Paper, stable-baselines3 documentation page and the OpenAI Spinning Up Introduction to SAC.