NevarokML: DDPG (TD3) Algorithm

The NevarokML plugin integrates the Deep Deterministic Policy Gradient (DDPG) algorithm, which is a special case of its successor Twin Delayed DDPG (TD3), into the Unreal Engine environment. DDPG is an off-policy actor-critic algorithm that can handle continuous action spaces. It combines the benefits of the deterministic policy gradient algorithm and the Q-learning algorithm.

DDPG Algorithm Overview

The DDPG algorithm, as implemented in NevarokML, consists of an actor-critic architecture where the actor is a deterministic policy and the critic estimates the Q-value function. It uses a replay buffer to store and sample experiences for training. Here are the key features and parameters of the DDPG algorithm used in NevarokML:

owner: Parameter represents the owner of the object, usually the object creating the algorithm.
policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
learningRate (Learning Rate): Set the learning rate for the Adam optimizer. The same learning rate will be used for all networks (Q-Values, Actor, and Value function).
bufferSize (Replay Buffer Size): Specify the size of the replay buffer, which stores the agent's experiences for training.
learningStarts (Learning Starts): Determine how many steps of the model to collect transitions for before learning starts. It ensures that enough experiences are collected before training begins.
batchSize (Batch Size): Set the minibatch size for each gradient update. It controls the number of experiences sampled from the replay buffer for each training iteration.
tau (Soft Update Coefficient): Set the coefficient for the soft update of the target networks. It determines the interpolation weight between the current and target networks during the update.
gamma (Discount Factor): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
trainFreq (Train Frequency): Update the model every trainFreq steps.
gradientSteps (Gradient Steps): Specify how many gradient steps to perform after each rollout. Set to -1 to perform as many gradient steps as steps done in the environment during the rollout.
optimizeMemoryUsage (Optimize Memory Usage): Enable a memory-efficient variant of the replay buffer at the cost of increased complexity. See here for more details.
verbose (Verbose Level): Control the verbosity level of the training process. Set it to 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is the API for the DDPG (TD3) algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* DDPG(UObject* owner,
                                     const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                     const float learningRate = 1e-3,
                                     const int bufferSize = 1000000,
                                     const int learningStarts = 100,
                                     const int batchSize = 100,
                                     const float tau = 0.005,
                                     const float gamma = 0.99,
                                     const int trainFreq = 1,
                                     const int gradientSteps = -1,
                                     const bool optimizeMemoryUsage = false,
                                     const int verbose = 1)

By setting the appropriate parameter values, you can customize the behavior of the DDPG (TD3) algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines DDPG algorithm, please refer to the Deterministic Policy Gradient paper, DDPG Paper, stable-baselines3 documentation page, and the introduction to DDPG guide.