NevarokML: PPO Algorithm

The NevarokML plugin integrates the powerful Stable Baselines Proximal Policy Optimization (PPO) algorithm into the Unreal Engine environment. PPO is a state-of-the-art reinforcement learning algorithm that has demonstrated excellent performance in a wide range of applications.

PPO Algorithm Overview

The PPO algorithm, as implemented in NevarokML, combines the benefits of policy gradient methods with an iterative optimization approach. It leverages a clipped surrogate objective function to ensure stable and efficient training. Here are the key features and parameters of the PPO algorithm used in NevarokML:

owner: Parameter represents the owner of the object, usually the object creating the algorithm.
policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
learningRate (Learning Rate): Set the learning rate for the PPO algorithm. It controls the step size during optimization and affects the convergence speed and stability.
nSteps (Number of Steps): Specify the number of steps to run for each environment per update. This determines the size of the rollout buffer and influences the trade-off between bias and variance.
batchSize (Batch Size): Set the minibatch size used for training. Larger batch sizes can improve training stability but require more computational resources.
nEpochs (Number of Epochs): Define the number of epochs when optimizing the surrogate loss. Each epoch consists of multiple update steps on the collected samples.
gamma (Discount Factor (Gamma)): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
gaeLambda (Generalized Advantage Estimation (GAE) Lambda): Specify the trade-off factor between bias and variance for the Generalized Advantage Estimator. It affects how rewards are accumulated over time and influences the agent's value function estimation.
clipRange (Clip Range): Set the clipping parameter for the PPO objective. It restricts the policy update to a certain range, preventing drastic policy changes.
entCoef (Entropy Coefficient): Specify the entropy coefficient for the loss calculation. It encourages exploration by adding an entropy term to the objective function.
vfCoef (Value Function Coefficient): Set the value function coefficient for the loss calculation. It balances the importance of the value function and the policy gradient during optimization.
maxGradNorm (Maximum Gradient Norm): Specify the maximum value for gradient clipping. It prevents large updates that could destabilize the training process.
useSde (Use SDE): Enable the use of Generalized State Dependent Exploration (gSDE) instead of action noise exploration.
sdeSampleFreq (SDE Sample Frequency): Set the frequency to sample a new noise matrix when using gSDE.
verbose (Verbose Level): Control the verbosity level of the training process. Set it to 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is API for PPO algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* PPO(UObject* owner,
                                    const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                    const float learningRate = 3e-4,
                                    const int nSteps = 2048,
                                    const int batchSize = 64,
                                    const int nEpochs = 10,
                                    const float gamma = 0.99,
                                    const float gaeLambda = 0.95,
                                    const float clipRange = 0.2,
                                    const float entCoef = 0.0,
                                    const float vfCoef = 0.5,
                                    const float maxGradNorm = 0.5,
                                    const bool useSde = false,
                                    const int sdeSampleFreq = -1,
                                    const int verbose = 1);

By setting the appropriate parameter values, you can customize the behavior of the PPO algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines PPO algorithm, please refer to the original paper, stable-baselines3 documentation page and the Spinning Up guide.