NevarokML: DQN Algorithm

The NevarokML plugin integrates the Deep Q-Network (DQN) algorithm into the Unreal Engine environment. DQN is an off-policy algorithm that combines reinforcement learning with deep neural networks to solve complex tasks. It is based on the original DQN paper and subsequent improvements.

DQN Algorithm Overview

The DQN algorithm, as implemented in NevarokML, uses a combination of a deep neural network and a replay buffer to approximate the optimal action-value function. It employs techniques such as experience replay and target networks to stabilize the learning process. Here are the key features and parameters of the DQN algorithm used in NevarokML:

owner: Parameter represents the owner of the object, usually the object creating the algorithm.
policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
learningRate (Learning Rate): Set the learning rate for the Adam optimizer. The same learning rate will be used for all networks (Q-Values, Actor, and Value function).
bufferSize (Replay Buffer Size): Specify the size of the replay buffer, which stores the agent's experiences for training.
learningStarts (Learning Starts): Determine how many steps of the model to collect transitions for before learning starts. It ensures that enough experiences are collected before training begins.
batchSize (Batch Size): Set the minibatch size for each gradient update. It controls the number of experiences sampled from the replay buffer for each training iteration.
tau (Soft Update Coefficient): Set the coefficient for the soft update of the target networks. It determines the interpolation weight between the current and target networks during the update.
gamma (Discount Factor): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
trainFreq (Train Frequency): Update the model every trainFreq steps.
gradientSteps (Gradient Steps): Specify how many gradient steps to perform after each rollout. Set to -1 to perform as many gradient steps as steps done in the environment during the rollout.
optimizeMemoryUsage (Optimize Memory Usage): Enable a memory-efficient variant of the replay buffer at the cost of increased complexity. See here for more details.
targetUpdateInterval (Target Update Interval): Specify the interval to update the target network. It determines how often the target network is synchronized with the online network.
explorationFraction (Exploration Fraction): Set the fraction of the entire training period over which the exploration rate is reduced. It controls the balance between exploration and exploitation.
explorationInitialEps (Exploration Initial Epsilon): Set the initial value of the random action probability. It determines the exploration rate at the beginning of the training.
explorationFinalEps (Exploration Final Epsilon): Set the final value of the random action probability. It determines the exploration rate at the end of the training.
maxGradNorm (Maximum Gradient Norm): Specify the maximum value for gradient clipping. It prevents large updates that could destabilize the training process.
verbose (Verbose Level): Control the verbosity level of the training process. Set it to 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is the API for the DQN algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* DQN(UObject* owner,
                                    const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                    const float learningRate = 1e-4,
                                    const int bufferSize = 1000000,
                                    const int learningStarts = 50000,
                                    const int batchSize = 32,
                                    const float tau = 1.0,
                                    const float gamma = 0.99,
                                    const int trainFreq = 4,
                                    const int gradientSteps = 1,
                                    const bool optimizeMemoryUsage = false,
                                    const int targetUpdateInterval = 10000,
                                    const float explorationFraction = 0.1,
                                    const float explorationInitialEps = 1.0,
                                    const float explorationFinalEps = 0.05,
                                    const float maxGradNorm = 10,
                                    const int verbose = 1)

By setting the appropriate parameter values, you can customize the behavior of the DQN algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines DQN algorithm, please refer to the DQN Paper, stable-baselines3 documentation page, and the Nature paper.