Engee documentation

ReinforcementLearningCore.jl

RLBase.plan!(p::AbstractExplorer, x[, mask])

Define how to select an action based on action values.

A hook is called at different stage during a run to allow users to inject customized runtime logic. By default, an AbstractHook will do nothing. One can customize the behavior by implementing the following methods:

  • Base.push!(hook::YourHook, ::PreActStage, agent, env)

  • Base.push!(hook::YourHook, ::PostActStage, agent, env)

  • Base.push!(hook::YourHook, ::PreEpisodeStage, agent, env)

  • Base.push!(hook::YourHook, ::PostEpisodeStage, agent, env)

  • Base.push!(hook::YourHook, ::PostExperimentStage, agent, env)

By convention, the Base.getindex(h::YourHook) is implemented to extract the metrics we are interested in. Users can compose different AbstractHooks with +.

AbstractLearner

Abstract type for a learner.

ActorCritic(;actor, critic, optimizer=Adam())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.

Agent(;policy, trajectory) <: AbstractPolicy

A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages. Agent is a Callable and its call method accepts varargs and keyword arguments to be passed to the policy.

BatchExplorer(explorer::AbstractExplorer)
BatchStepsPerEpisode(batchsize::Int; tag = "TRAINING")

Similar to StepsPerEpisode, but is specific to environments which return a Vector of rewards (a typical case with MultiThreadEnv).

CategoricalNetwork(model)([rng,] state::AbstractArray [, mask::AbstractArray{Bool}]; is_sampling::Bool=false, is_return_log_prob::Bool = false)

CategoricalNetwork wraps a model (typically a neural network) that takes a state input and outputs logits for a categorical distribution. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

  • rng::AbstractRNG=Random.default_rng()

  • is_sampling::Bool=false, whether to sample from the obtained normal categorical distribution (returns a Flux.OneHotArray z).

  • is_return_log_prob::Bool=false, whether to return the logits (i.e. the unnormalized logprobabilities) of getting the sampled actions in the given state.

Only applies if is_sampling is true and will return z, logits.

If is_sampling = false, returns only the logits obtained by a simple forward pass into model.

(model::CategoricalNetwork)([rng::AbstractRNG,] state::AbstractArray{<:Any, 3}, [mask::AbstractArray{Bool},] action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). Always returns the logits of each action along in a tensor with the same dimensions. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

CovGaussianNetwork(;pre=identity, μ, Σ)

Returns μ and Σ when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional. μ has dimensions (action_size x 1 x batchsize) and Σ has dimensions (action_size x action_size x batchsize). The Σ head of the CovGaussianNetwork should not directly return a square matrix but a vector of length action_size x (action_size + 1) ÷ 2. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it. Sample from MvNormal.(μ, Σ).

(model::CovGaussianNetwork)(state::AbstractArray, action::AbstractArray)

Return the logpdf of the model sampling action when in state. State must be a 3D tensor with dimensions (state_size x 1 x batchsize). Multiple actions may be taken per state, action must have dimensions (action_size x action_samples_per_state x batchsize). Returns a 3D tensor with dimensions (1 x action_samples_per_state x batchsize).

If given 2D matrices as input, will return a 2D matrix of logpdf. States and actions are paired column-wise, one action per state.

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions per state in state and return the actions, logpdf(actions). This function is compatible with a multidimensional action space. The outputs are 3D tensors with dimensions (action_size x action_samples x batchsize) and (1 x action_samples x batchsize) for actions and logdpf respectively.

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}; is_sampling::Bool=false, is_return_log_prob::Bool=false)

This function is compatible with a multidimensional action space. To work with covariance matrices, the outputs are 3D tensors. If sampling, return an actions tensor with dimensions (action_size x action_samples x batchsize) and a logp_π tensor with dimensions (1 x action_samples x batchsize). If not sampling, returns μ with dimensions (action_size x 1 x batchsize) and L, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (action_size x action_size x batchsize) The covariance matrices can be retrieved with Σ = stack(map(l -> l*l', eachslice(L, dims=3)); dims=3)

  • rng::AbstractRNG=Random.default_rng()

  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.

  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)

Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.

CurrentPlayerIterator(env::E) where {E<:AbstractEnv}

CurrentPlayerIterator`is an iterator that iterates over the players in the environment, returning thecurrentplayerfor each iteration. This is only necessary for`MultiAgent`environments. After each iteration,`RLBase.next*player!`is called to advance the`current*player. As long asRLBase.nextplayer!is defined for the environment, this iterator will work correctly in theBase.run`` function.

DoEveryNEpisodes(f; n=1, t=0)

Execute f(t, agent, env) every n episode. t is a counter of episodes.

DoEveryNSteps(f; n=1, t=0)

Execute f(t, agent, env) every n step. t is a counter of steps.

DoOnExit(f)

Call the lambda function f at the end of an Experiment.

DuelingNetwork(;base, val, adv)

Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.

Nothing but a placeholder.

EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implemented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

Keywords

  • T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.

  • step::Int = 1: record the current step.

  • ϵ_init::Float64 = 1.0: initial epsilon.

  • warmup_steps::Int=0: the number of steps to use ϵ_init.

  • decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.

  • ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.

  • is_break_tie=false: randomly select an action of the same maximum values if set to true.

  • rng=Random.default_rng(): set the internal RNG.

Example

s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")
epsilon greedy selector
Experiment(policy::AbstractPolicy, env::AbstractEnv, stop_condition::AbstractStopCondition, hook::AbstractHook)

A struct to hold the information of an experiment. It is used to run an experiment with the given policy, environment, stop condition and hook.

FluxApproximator(model, optimiser)

Wraps a Flux trainable model and implements the RLBase.optimise!(::FluxApproximator, ::Gradient) interface. See the RLCore documentation for more information on proper usage.

FluxApproximator(; model, optimiser, usegpu=false)

Constructs an FluxApproximator object for reinforcement learning.

Arguments

  • model: The model used for approximation.

  • optimiser: The optimizer used for updating the model.

  • usegpu: A boolean indicating whether to use GPU for computation. Default is false.

Returns

An FluxApproximator object.

(model::GaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.

This function is compatible with a multidimensional action space.

  • rng::AbstractRNG=Random.default_rng()

  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.

  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

MultiAgentHook(hooks::NT) where {NT<: NamedTuple}

MultiAgentHook is a hook struct that contains <:AbstractHook structs indexed by the player’s symbol.

MultiAgentPolicy(agents::NT) where {NT<: NamedTuple}

MultiAgentPolicy is a policy struct that contains <:AbstractPolicy structs indexed by the player’s symbol.

OfflineAgent(policy::AbstractPolicy, trajectory::Trajectory, offline_behavior::OfflineBehavior = OfflineBehavior()) <: AbstractAgent

OfflineAgent is an AbstractAgent that, unlike the usual online Agent, does not interact with the environment during training in order to collect data. Just like Agent, it contains an AbstractPolicy to be trained an a Trajectory that contains the training data. The difference being that the trajectory is filled prior to training and is not updated. An OfflineBehavior can optionally be provided to provide an second "behavior agent" that will generate the training data at the PreExperimentStage. Does nothing by default.

OfflineBehavior(; agent:: Union{<:Agent, Nothing}, steps::Int, reset_condition)

Used to provide an OfflineAgent with a "behavior agent" that will generate the training data at the PreExperimentStage. If agent is nothing (by default), does nothing. The trajectory of agent should be the same as that of the parent OfflineAgent. steps is the number of data elements to generate, defaults to the capacity of the trajectory. reset_condition is the episode reset condition for the data generation (defaults to ResetIfEnvTerminated()).

The behavior agent will interact with the main environment of the experiment to generate the data.

This function accepts state and action, and then outputs actions after disturbance.

PlayerTuple

A NamedTuple that maps players to their respective values.

Stage that is executed after the Agent acts.

Stage that is executed after the Episode is over.

Stage that is executed after the Experiment is over.

Stage that is executed before the Agent acts.

Stage that is executed before the Episode starts.

Stage that is executed before the Experiment starts.

QBasedPolicy(;learner, explorer)

Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal action of an environment at its current state. It is typically a table or a neural network. QBasedPolicy can be queried for an action with RLBase.plan!, the explorer will affect the action selection accordingly.

RandomPolicy(action_space=nothing; rng=Random.default_rng())

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.

You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

ResetAfterNSteps(n)

A reset condition that resets the environment after n steps.

ResetIfEnvTerminated()

A reset condition that resets the environment if is_terminated(env) is true.

RewardsPerEpisode(; rewards = Vector{Vector{Float64}}())

Store each reward of each step in every episode in the field of rewards.

SoftGaussianNetwork(;pre=identity, μ, σ, min_σ=0f0, max_σ=Inf32, squash = tanh)

Like GaussianNetwork but with a differentiable reparameterization trick. Mainly used for SAC. Returns μ and σ when called. Create a distribution to sample from using Normal.(μ, σ). min_σ and max_σ are used to clip the output from σ. pre is a shared body before the two heads of the NN. σ should be > 0. You may enforce this using a softplus output activation. Actions are squashed by a tanh and a correction is applied to the logpdf.

(model::SoftGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.

This function is compatible with a multidimensional action space.

  • rng::AbstractRNG=Random.default_rng()

  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.

  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

StackFrames(::Type{T}=Float32, d::Int...)

Use a pre-initialized CircularArrayBuffer to store the latest several states specified by d. Before processing any observation, the buffer is filled with `zero{T} by default.

StepsPerEpisode(; steps = Int[], count = 0)

Store steps of each episode in the field of steps.

StopAfterNEpisodes(episode; cur = 0, is_show_progress = true)

Return true after being called episode. If is_show_progress is true, the ProgressMeter will be used to show progress.

StopAfterNSeconds

parameter:

  1. time budget

stop training after N seconds

StopAfterNSteps(step; cur = 1, is_show_progress = true)

Return true after being called step times.

StopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

  1. () -> reward(env)

  2. () -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.

AnyStopCondition(stop_conditions...)

The result of stop_conditions is reduced by any.

StopIfEnvTerminated()

Return true if the environment is terminated.

StopSignal()

Create a stop signal initialized with a value of false. You can manually set it to true by s[] = true to stop the running loop at any time.

TDLearner(;approximator, method, γ=1.0, α=0.01, n=0)

Use temporal-difference method to estimate state value or state-action value.

Fields

  • approximator is <:TabularApproximator.

  • γ=1.0, discount rate.

  • method: only :SARS (Q-learning) is supported for the time being.

  • n=0: the number of time steps used minus 1.

TabularApproximator(table<:AbstractArray)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.

For table of 2-d, the first dimension is action and the second dimension is state.

TabularQApproximator(; n_state, n_action, init = 0.0)

Create a TabularQApproximator with n_state states and n_action actions.

TargetNetwork(network::FluxApproximator; sync_freq::Int = 1, ρ::Float32 = 0f0)

Wraps an FluxApproximator to hold a target network that is updated towards the model of the approximator.

  • sync_freq is the number of updates of network between each update of the target.

  • ρ ( ho) is "how much of the target is kept when updating it".

The two common usages of TargetNetwork are

  • use ρ = 0 to totally replace target with network every sync_freq updates.

  • use ρ < 1 (but close to one) and sync_freq = 1 to let the target follow network with polyak averaging.

Implements the RLBase.optimise!(::TargetNetwork, ::Gradient) interface to update the model with the gradient and the target with weights replacement or Polyak averaging.

Note to developers: model(::TargetNetwork) will return the trainable Flux model and target(::TargetNetwork) returns the target model and target(::FluxApproximator) returns the non-trainable Flux model. See the RLCore documentation.

TargetNetwork(network; sync_freq = 1, ρ = 0f0, use_gpu = false)

Constructs a target network for reinforcement learning.

Arguments

  • network: The main network used for training.

  • sync_freq: The frequency (in number of calls to optimise!) at which the target network is synchronized with the main network. Default is 1.

  • ρ: The interpolation factor used for updating the target network. Must be in the range [0, 1]. Default is 0 (the old weights are completely replaced by the new ones).

  • use_gpu: Specifies whether to use GPU for the target network. Default is false.

Returns

A TargetNetwork object.

TimePerStep(;max_steps=100)
TimePerStep(times::CircularVectorBuffer{Float64}, t::Float64)

Store time cost in seconds of the latest max_steps in the times field.

TotalRewardPerEpisode(; is_display_on_exit = true)

Store the total reward of each episode in the field of rewards. If is_display_on_exit is set to true, a unicode plot will be shown at the PostExperimentStage.

UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)

Arguments

  • na is the number of actions used to create a internal counter.

  • t is used to store current time step.

  • c is used to control the degree of exploration.

  • seed, set the seed of inner RNG.

VAE(;encoder, decoder, latent_dims)
WeightedExplorer(;is_normalized::Bool, rng=Random.default_rng())

is_normalized is used to indicate if the fed action values are already normalized to have a sum of 1.0.

Elements are assumed to be >=0.

WeightedSoftmaxExplorer(;rng=Random.default_rng())

See also: WeightedExplorer

When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function dispatches games using MultiAgentPolicy and MultiAgentHook to the appropriate run function based on the Sequential or Simultaneous trait of the environment.

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Sequential,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Sequential trait. It iterates over the current_player for each turn in the environment, and runs the full run loop, like in the SingleAgent case. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Simultaneous,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Simultaneous trait. It iterates over the players in the environment, and for each player, it selects the appropriate policy from the MultiAgentPolicy. All agent actions are collected before the environment is updated. After each player has taken an action, it calls optimise! on the policy. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

RLBase.plan!(x::BatchExplorer, values::AbstractMatrix)

Apply inner explorer to each column of values.

RLBase.plan!(s::EpsilonGreedyExplorer, values; step) where T

If multiple values with the same maximum value are found. Then a random one will be returned when is_break_tie==true.

`NaN` will be filtered unless all the values are `NaN`. In that case, a random one will be returned.
prob(p::AbstractExplorer, x, mask)

Similar to prob(p::AbstractExplorer, x), but here only the masked elements are considered.

prob(p::AbstractExplorer, x) -> AbstractDistribution

Get the action distribution given action values.

prob(s::EpsilonGreedyExplorer, values) -> Categorical
prob(s::EpsilonGreedyExplorer, values, mask) -> Categorical

Return the probability of selecting each action given the estimated values of each action.

assuming rewards and new_rewards are Vector

assuming rewards and advantages are Vector

bellman_update!(app::TabularApproximator, s::Int, s_plus_one::Int, a::Int, α::Float64, π_::Float64, γ::Float64)

Update the Q-value of the given state-action pair.

Inject some customized checkings here by overwriting this function

cholesky_matrix_to_vector_index(i, j)

Return the position in a cholesky_vec (of length da) of the element of the lower triangular matrix at coordinates (i,j).

For example if cholesky_vec = [1,2,3,4,5,6], the corresponding lower triangular matrix is

L = [1 0 0
     2 4 0
     3 5 6]

and cholesky_matrix_to_vector_index(3, 2) == 5

diagnormkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and diagonal standard deviations σ1, σ2. Arguments must be Vectors or arrays of column vectors.

diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU compatible and automatically differentiable version for the logpdf function of normal distributions with diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer). Accepts arguments of the same shape: vectors, matrices or 3D array (with dimension 2 of size 1).

discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.

  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

  • init=nothing, init can be used to provide the the reward estimation of the last state.

Example

flatten_batch(x::AbstractArray)

Merge the last two dimension.

Example

julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

[:, :, 3] =
  9  11
 10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.

  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

Example

logdetLorU(LorU::AbstractMatrix)

Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.

mvnormkldivergence(μ1, L1, μ2, L2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and with cholesky decomposition of covariance matrices L1, L2.

mvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)

GPU compatible and automatically differentiable version for the logpdf function of multivariate normal distributions. Takes as inputs mu the mean vector, L the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x a matrix of samples where each column is a sample. Return a Vector containing the logpdf of each column of x for the MvNormal parametrized by μ and Σ = L*L'.

mvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray

Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample. μ is a (actionsize x 1 x batchsize) matrix, L is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize). Return a 3D matrix of size (1 x actionsamples x batchsize).

normkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two univariate Gaussian distributions with means μ1, μ2 and standard deviations σ1, σ2 respectively.

 normlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU automatic differentiable version for the logpdf function of a univariate normal distribution. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

Transform a vector containing the non-zero elements of a lower triangular da x da matrix into that matrix.

In addition to containing the run loop, RLCore is a collection of pre-implemented components that are frequently used in RL.

QBasedPolicy

QBasedPolicy is an AbstractPolicy that wraps a Q-Value learner (tabular or approximated) and an explorer. Use this wrapper to implement a policy that directly uses a Q-value function to decide its next action. In that case, instead of creating an AbstractPolicy subtype for your algorithm, define an AbstractLearner subtype and specialize RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory). This way you will not have to code the interaction between your policy and the explorer yourself. RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.

Parametric approximators

Approximator

If your algorithm uses a neural network or a linear approximator to approximate a function trained with Flux.jl, use the Approximator. It wraps a Flux model and an Optimiser (such as Adam or SGD). Your optimise!(::PolicyOrLearner, batch) function will probably consist in computing a gradient and call the RLBase.optimise!(app::Approximator, gradient::Flux.Grads) after that.

Approximator implements the model(::Approximator) and target(::Approximator) interface. Both return the underlying Flux model. The advantage of this interface is explained in the TargetNetwork section below.

TargetNetwork

The use of a target network is frequent in state or action value-based RL. The principle is to hold a copy of of the main approximator, which is trained using a gradient, and a copy of it that is either only partially updated, or just less frequently updated. TargetNetwork is constructed by wrapping an Approximator. Set the sync_freq keyword argument to a value greater that one to copy the main model into the target every sync_freq updates, or set the \rho parameter to a value greater than 0 (usually 0.99f0) to let the target be partially updated towards the main model every update. RLBase.optimise!(tn::TargetNetwork, gradient::Flux.Grads) will take care of updating the target for you.

The other advantage of TargetNetwork is that it uses Julia’s multiple dispatch to let your algorithm be agnostic to the presence or absence of a target network. For example, the DQNLearner in RLZoo has an approximator field typed to be a Union{Approximator, TargetNetwork}. When computing the temporal difference error, the learner calls Q = model(learner.approximator) and Qt = target(learner.approximator). If learner.approximator is a Approximator, then no target network is used because both calls point to the same neural network, if it is a TargetNetwork then the automatically managed target is returned.

Architectures

Common model architectures are also provided such as the GaussianNetwork for continuous policies with diagonal multivariate policies; and CovGaussianNetwork for full covariance (very slow on GPUs at the moment).