Skip to content

NevarokML: A2C Algorithm

The NevarokML plugin integrates the powerful Stable Baselines Advantage Actor Critic (A2C) algorithm into the Unreal Engine environment. A2C is an on-policy actor-critic algorithm that combines the benefits of value-based and policy-based methods.


A2C Algorithm Overview

The A2C algorithm, as implemented in NevarokML, utilizes a policy model and value function model to estimate the advantage function. It employs advantage estimation and a discounted cumulative reward to optimize both the policy and value function simultaneously. Here are the key features and parameters of the A2C algorithm used in NevarokML:

  • owner: Parameter represents the owner of the object, usually the object creating the algorithm.
  • policy (Policy): The policy model to use, such as MlpPolicy, CnnPolicy, etc. It determines the architecture and learning capabilities of the agent's policy network.
  • learningRate (Learning Rate): Set the learning rate for the A2C algorithm. It controls the step size during optimization and affects the convergence speed and stability.
  • nSteps (Number of Steps): Specify the number of steps to run for each environment per update. This determines the size of the rollout buffer and influences the trade-off between bias and variance.
  • gamma (Discount Factor (Gamma)): Set the discount factor that determines the weight of future rewards compared to immediate rewards. It influences the agent's preference for short-term or long-term rewards.
  • gaeLambda (Generalized Advantage Estimation (GAE) Lambda): Specify the trade-off factor between bias and variance for the Generalized Advantage Estimator. It affects how rewards are accumulated over time and influences the agent's value function estimation.
  • entCoef (Entropy Coefficient): Set the entropy coefficient for the loss calculation. It encourages exploration by adding an entropy term to the objective function.
  • vfCoef (Value Function Coefficient): Set the value function coefficient for the loss calculation. It balances the importance of the value function and the policy gradient during optimization.
  • maxGradNorm (Maximum Gradient Norm): Specify the maximum value for gradient clipping. It prevents large updates that could destabilize the training process.
  • rmsPropEps (RMSProp Epsilon): Set the epsilon value used in the RMSProp optimizer. It stabilizes the square root computation in the denominator of the RMSProp update.
  • useRmsProp (Use RMSProp): Choose whether to use RMSProp (default) or Adam as the optimizer.
  • useSde (Use SDE): Enable the use of Generalized State Dependent Exploration (gSDE) instead of action noise exploration.
  • sdeSampleFreq (SDE Sample Frequency): Set the frequency to sample a new noise matrix when using gSDE.
  • normalizeAdvantage (Normalize Advantage): Choose whether to normalize the advantage values during training.
  • verbose (Verbose Level): Control the verbosity level of the training process. Set it to 0 for no output, 1 for info messages, and 2 for debug messages.

API

Here is API for A2C algorithm in NevarokML, along with the corresponding default parameter settings:

#include "Models/NevarokMLBaseAlgorithm.h"

UFUNCTION(BlueprintPure, Category = "NevarokML|BaseAlgorithm")
static UNevarokMLBaseAlgorithm* A2C(UObject* owner,
                                    const ENevarokMLPolicy policy = ENevarokMLPolicy::MLP_POLICY,
                                    const float learningRate = 7e-4,
                                    const int nSteps = 5,
                                    const float gamma = 0.99,
                                    const float gaeLambda = 1.0,
                                    const float entCoef = 0.0,
                                    const float vfCoef = 0.5,
                                    const float maxGradNorm = 0.5,
                                    const float rmsPropEps = 1e-5,
                                    const bool useRmsProp = true,
                                    const bool useSde = false,
                                    const int sdeSampleFreq = -1,
                                    const bool normalizeAdvantage = false,
                                    const int verbose = 1);

By setting the appropriate parameter values, you can customize the behavior of the A2C algorithm to suit your specific reinforcement learning problem.

For more details on the Stable Baselines A2C algorithm, please refer to the original paper, stable-baselines3 documentation page, and the introduction to A2C guide.