Skip to content

NevarokML: TD3 Algorithm

The NevarokML plugin integrates the Twin Delayed DDPG (TD3) algorithm into the Unreal Engine environment. TD3 is an off-policy actor-critic algorithm that addresses function approximation errors in traditional actor-critic methods. It combines insights from the Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG papers to improve stability and performance.


TD3 Algorithm Overview

The TD3 algorithm, as implemented in NevarokML, utilizes a twin critic architecture and delayed policy updates to improve the learning process. It maintains two Q-value networks to reduce overestimation bias. Here are the key features and parameters of the TD3 algorithm used in NevarokML:

  • owner: Parameter represents the owner of the object, usually the object creating the algorithm.
  • policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
  • learningRate (Learning Rate): Set the learning rate for the Adam optimizer. The same learning rate will be used for all networks (Q-Values, Actor, and Value function).
  • bufferSize (Replay Buffer Size): Specify the size of the replay buffer, which stores the agent's experiences for training.
  • learningStarts (Learning Starts): Determine how many steps of the model to collect transitions for before learning starts. It ensures that enough experiences are collected before training begins.
  • batchSize (Batch Size): Set the minibatch size for each gradient update. It controls the number of experiences sampled from the replay buffer for each training iteration.
  • tau (Soft Update Coefficient): Set the coefficient for the soft update of the target networks. It determines the interpolation weight between the current and target networks during the update.
  • gamma (Discount Factor): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
  • trainFreq (Train Frequency): Update the model every trainFreq steps.
  • gradientSteps (Gradient Steps): Specify how many gradient steps to perform after each rollout. Set to -1 to perform as many gradient steps as steps done in the environment during the rollout.
  • optimizeMemoryUsage (Optimize Memory Usage): Enable a memory-efficient variant of the replay buffer at the cost of increased complexity. See here for more details.
  • policyDelay (Policy Delay): Set the number of steps between policy updates. The policy and target networks will only be updated once every policy_delay steps per training step.
  • targetPolicyNoise (Target Policy Noise): Set the standard deviation of Gaussian noise added to the target policy (smoothing noise).
  • targetNoiseClip (Target Noise Clip): Set the limit for the absolute value of the target policy smoothing noise.
  • verbose (Verbosity Level): Set the verbosity level for the algorithm's output. Use 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is the API for the TD3 algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* TD3(UObject* owner,
                                    const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                    const float learningRate = 1e-3,
                                    const int bufferSize = 1000000,
                                    const int learningStarts = 100,
                                    const int batchSize = 100,
                                    const float tau = 0.005,
                                    const float gamma = 0.99,
                                    const int trainFreq = 1,
                                    const int gradientSteps = -1,
                                    const bool optimizeMemoryUsage = false,
                                    const int policyDelay = 2,
                                    const float targetPolicyNoise = 0.2,
                                    const float targetNoiseClip = 0.5,
                                    const int verbose = 1)

By setting the appropriate parameter values, you can customize the behavior of the TD3 algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines TD3 algorithm, please refer to the TD3 Paper, stable-baselines3 documentation page and the OpenAI Spinning Up Introduction to TD3.