ReinforcementLearningCore.jl
#
ReinforcementLearningCore.AbstractExplorer — Type
RLBase.plan!(p::AbstractExplorer, x[, mask])
Define how to select an action based on action values.
#
ReinforcementLearningCore.AbstractHook — Type
A hook is called at different stage during a run to allow users to inject customized runtime logic. By default, an AbstractHook will do nothing. One can customize the behavior by implementing the following methods:
- 
Base.push!(hook::YourHook, ::PreActStage, agent, env) - 
Base.push!(hook::YourHook, ::PostActStage, agent, env) - 
Base.push!(hook::YourHook, ::PreEpisodeStage, agent, env) - 
Base.push!(hook::YourHook, ::PostEpisodeStage, agent, env) - 
Base.push!(hook::YourHook, ::PostExperimentStage, agent, env) 
By convention, the Base.getindex(h::YourHook) is implemented to extract the metrics we are interested in. Users can compose different AbstractHooks with +.
#
ReinforcementLearningCore.ActorCritic — Type
ActorCritic(;actor, critic, optimizer=Adam())
The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.
#
ReinforcementLearningCore.Agent — Type
Agent(;policy, trajectory) <: AbstractPolicy
A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages. Agent is a Callable and its call method accepts varargs and keyword arguments to be passed to the policy.
#
ReinforcementLearningCore.BatchExplorer — Type
BatchExplorer(explorer::AbstractExplorer)
#
ReinforcementLearningCore.BatchStepsPerEpisode — Method
BatchStepsPerEpisode(batchsize::Int; tag = "TRAINING")
Similar to StepsPerEpisode, but is specific to environments which return a Vector of rewards (a typical case with MultiThreadEnv).
#
ReinforcementLearningCore.CategoricalNetwork — Type
CategoricalNetwork(model)([rng,] state::AbstractArray [, mask::AbstractArray{Bool}]; is_sampling::Bool=false, is_return_log_prob::Bool = false)
CategoricalNetwork wraps a model (typically a neural network) that takes a state input  and outputs logits for a categorical distribution. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to  -Inf and/or a zero-probability of being sampled.
- 
rng::AbstractRNG=Random.default_rng() - 
is_sampling::Bool=false, whether to sample from the obtained normal categorical distribution (returns a Flux.OneHotArrayz). - 
is_return_log_prob::Bool=false, whether to return the logits (i.e. the unnormalized logprobabilities) of getting the sampled actions in the given state. 
Only applies if is_sampling is true and will return z, logits.
If is_sampling = false, returns only the logits obtained by a simple forward pass into model.
#
ReinforcementLearningCore.CategoricalNetwork — Method
(model::CategoricalNetwork)([rng::AbstractRNG,] state::AbstractArray{<:Any, 3}, [mask::AbstractArray{Bool},] action_samples::Int)
Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize).  Always returns the logits of each action along in a tensor with the same dimensions. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to  -Inf and/or a zero-probability of being sampled.
#
ReinforcementLearningCore.CovGaussianNetwork — Type
CovGaussianNetwork(;pre=identity, μ, Σ)
Returns μ and Σ when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional.  μ has dimensions (action_size x 1 x batchsize) and Σ has dimensions (action_size x action_size x batchsize).  The Σ head of the CovGaussianNetwork should not directly return a square matrix but a vector of length action_size x (action_size + 1) ÷ 2. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it.  Sample from MvNormal.(μ, Σ).
#
ReinforcementLearningCore.CovGaussianNetwork — Method
(model::CovGaussianNetwork)(state::AbstractArray, action::AbstractArray)
Return the logpdf of the model sampling action when in state.  State must be a 3D tensor with dimensions (state_size x 1 x batchsize).  Multiple actions may be taken per state, action must have dimensions (action_size x action_samples_per_state x batchsize). Returns a 3D tensor with dimensions (1 x action_samples_per_state x batchsize).
#
ReinforcementLearningCore.CovGaussianNetwork — Method
If given 2D matrices as input, will return a 2D matrix of logpdf. States and actions are paired column-wise, one action per state.
#
ReinforcementLearningCore.CovGaussianNetwork — Method
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples actions per state in state and return the actions, logpdf(actions).  This function is compatible with a multidimensional action space.  The outputs are 3D tensors with dimensions (action_size x action_samples x batchsize) and (1 x action_samples x batchsize) for actions and logdpf respectively.
#
ReinforcementLearningCore.CovGaussianNetwork — Method
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}; is_sampling::Bool=false, is_return_log_prob::Bool=false)
This function is compatible with a multidimensional action space. To work with covariance matrices, the outputs are 3D tensors.  If sampling, return an actions tensor with dimensions (action_size x action_samples x batchsize) and a logp_π tensor with dimensions (1 x action_samples x batchsize).  If not sampling, returns μ with dimensions (action_size x 1 x batchsize) and L, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (action_size x action_size x batchsize) The covariance matrices can be retrieved with Σ = stack(map(l -> l*l', eachslice(L, dims=3)); dims=3)
- 
rng::AbstractRNG=Random.default_rng() - 
is_sampling::Bool=false, whether to sample from the obtained normal distribution. - 
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state. 
#
ReinforcementLearningCore.CovGaussianNetwork — Method
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)
Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.
#
ReinforcementLearningCore.CurrentPlayerIterator — Type
CurrentPlayerIterator(env::E) where {E<:AbstractEnv}
CurrentPlayerIteratorRLBase.nextplayer!`is an iterator that iterates over the players in the environment, returning thecurrentplayerfor each iteration. This is only necessary for`MultiAgent`environments. After each iteration,`RLBase.next*player!`is called to advance the`current*player. As long asis defined for the environment, this iterator will work correctly in theBase.run`` function.
#
ReinforcementLearningCore.DoEveryNEpisodes — Type
DoEveryNEpisodes(f; n=1, t=0)
Execute f(t, agent, env) every n episode. t is a counter of episodes.
#
ReinforcementLearningCore.DoEveryNSteps — Type
DoEveryNSteps(f; n=1, t=0)
Execute f(t, agent, env) every n step. t is a counter of steps.
#
ReinforcementLearningCore.DoOnExit — Type
DoOnExit(f)
Call the lambda function f at the end of an Experiment.
#
ReinforcementLearningCore.DuelingNetwork — Type
DuelingNetwork(;base, val, adv)
Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.
#
ReinforcementLearningCore.EmptyHook — Type
Nothing but a placeholder.
#
ReinforcementLearningCore.EpsilonGreedyExplorer — Type
EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)
Epsilon-greedy strategy: The best lever is selected for a proportion
1 - epsilonof the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit
Two kinds of epsilon-decreasing strategy are implemented here (linear and exp).
Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit
Keywords
- 
T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values arelinearandexp. - 
step::Int = 1: record the current step. - 
ϵ_init::Float64 = 1.0: initial epsilon. - 
warmup_steps::Int=0: the number of steps to useϵ_init. - 
decay_steps::Int=0: the number of steps for epsilon to decay fromϵ_inittoϵ_stable. - 
ϵ_stable::Float64: the epsilon afterwarmup_steps + decay_steps. - 
is_break_tie=false: randomly select an action of the same maximum values if set totrue. - 
rng=Random.default_rng(): set the internal RNG. 
Example
s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")
#
ReinforcementLearningCore.Experiment — Type
Experiment(policy::AbstractPolicy, env::AbstractEnv, stop_condition::AbstractStopCondition, hook::AbstractHook)
A struct to hold the information of an experiment. It is used to run an experiment with the given policy, environment, stop condition and hook.
#
ReinforcementLearningCore.FluxApproximator — Type
FluxApproximator(model, optimiser)
Wraps a Flux trainable model and implements the RLBase.optimise!(::FluxApproximator, ::Gradient)  interface. See the RLCore documentation for more information on proper usage.
#
ReinforcementLearningCore.FluxApproximator — Method
FluxApproximator(; model, optimiser, usegpu=false)
Constructs an FluxApproximator object for reinforcement learning.
Arguments
- 
model: The model used for approximation. - 
optimiser: The optimizer used for updating the model. - 
usegpu: A boolean indicating whether to use GPU for computation. Default isfalse. 
Returns
An FluxApproximator object.
#
ReinforcementLearningCore.GaussianNetwork — Method
(model::GaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.
#
ReinforcementLearningCore.GaussianNetwork — Method
This function is compatible with a multidimensional action space.
- 
rng::AbstractRNG=Random.default_rng() - 
is_sampling::Bool=false, whether to sample from the obtained normal distribution. - 
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state. 
#
ReinforcementLearningCore.MultiAgentHook — Type
MultiAgentHook(hooks::NT) where {NT<: NamedTuple}
MultiAgentHook is a hook struct that contains <:AbstractHook structs indexed by the player’s symbol.
#
ReinforcementLearningCore.MultiAgentPolicy — Type
MultiAgentPolicy(agents::NT) where {NT<: NamedTuple}
MultiAgentPolicy is a policy struct that contains <:AbstractPolicy structs indexed by the player’s symbol.
#
ReinforcementLearningCore.OfflineAgent — Type
OfflineAgent(policy::AbstractPolicy, trajectory::Trajectory, offline_behavior::OfflineBehavior = OfflineBehavior()) <: AbstractAgent
OfflineAgent is an AbstractAgent that, unlike the usual online Agent, does not interact with the environment during training in order to collect data. Just like Agent, it contains an AbstractPolicy to be trained an a Trajectory that contains the training data. The difference being that the trajectory is filled prior to training and is not updated. An OfflineBehavior can optionally be provided to provide an second "behavior agent" that will generate the training data at the PreExperimentStage. Does nothing by default.
#
ReinforcementLearningCore.OfflineBehavior — Type
OfflineBehavior(; agent:: Union{<:Agent, Nothing}, steps::Int, reset_condition)
Used to provide an OfflineAgent with a "behavior agent" that will generate the training data at the PreExperimentStage. If agent is nothing (by default), does nothing. The trajectory of agent should  be the same as that of the parent OfflineAgent. steps is the number of data elements to generate, defaults to the capacity of the trajectory. reset_condition is the episode reset condition for the data generation (defaults to ResetIfEnvTerminated()).
The behavior agent will interact with the main environment of the experiment to generate the data.
#
ReinforcementLearningCore.PerturbationNetwork — Method
This function accepts state and action, and then outputs actions after disturbance.
#
ReinforcementLearningCore.PlayerTuple — Type
PlayerTuple
A NamedTuple that maps players to their respective values.
#
ReinforcementLearningCore.PostActStage — Type
Stage that is executed after the Agent acts.
#
ReinforcementLearningCore.PostEpisodeStage — Type
Stage that is executed after the Episode is over.
#
ReinforcementLearningCore.PostExperimentStage — Type
Stage that is executed after the Experiment is over.
#
ReinforcementLearningCore.PreActStage — Type
Stage that is executed before the Agent acts.
#
ReinforcementLearningCore.PreEpisodeStage — Type
Stage that is executed before the Episode starts.
#
ReinforcementLearningCore.PreExperimentStage — Type
Stage that is executed before the Experiment starts.
#
ReinforcementLearningCore.QBasedPolicy — Type
QBasedPolicy(;learner, explorer)
Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal action of an environment at its current state. It is typically a table or a neural network.  QBasedPolicy can be queried for an action with RLBase.plan!, the explorer will affect the action selection accordingly.
#
ReinforcementLearningCore.RandomPolicy — Type
RandomPolicy(action_space=nothing; rng=Random.default_rng())
If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.
| 
 You should always set   | 
#
ReinforcementLearningCore.ResetAfterNSteps — Type
ResetAfterNSteps(n)
A reset condition that resets the environment after n steps.
#
ReinforcementLearningCore.ResetIfEnvTerminated — Type
ResetIfEnvTerminated()
A reset condition that resets the environment if is_terminated(env) is true.
#
ReinforcementLearningCore.RewardsPerEpisode — Type
RewardsPerEpisode(; rewards = Vector{Vector{Float64}}())
Store each reward of each step in every episode in the field of rewards.
#
ReinforcementLearningCore.SoftGaussianNetwork — Type
SoftGaussianNetwork(;pre=identity, μ, σ, min_σ=0f0, max_σ=Inf32, squash = tanh)
Like GaussianNetwork but with a differentiable reparameterization trick. Mainly used for SAC. Returns μ and σ when called.  Create a distribution to sample from using Normal.(μ, σ). min_σ and max_σ are used to clip the output from σ. pre is a shared body before the two heads of the NN. σ should be > 0.  You may enforce this using a softplus output activation. Actions are squashed by a tanh and a correction is applied to the logpdf.
#
ReinforcementLearningCore.SoftGaussianNetwork — Method
(model::SoftGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.
#
ReinforcementLearningCore.SoftGaussianNetwork — Method
This function is compatible with a multidimensional action space.
- 
rng::AbstractRNG=Random.default_rng() - 
is_sampling::Bool=false, whether to sample from the obtained normal distribution. - 
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state. 
#
ReinforcementLearningCore.StackFrames — Type
StackFrames(::Type{T}=Float32, d::Int...)
Use a pre-initialized CircularArrayBuffer to store the latest several states specified by d. Before processing any observation, the buffer is filled with `zero{T} by default.
#
ReinforcementLearningCore.StepsPerEpisode — Type
StepsPerEpisode(; steps = Int[], count = 0)
Store steps of each episode in the field of steps.
#
ReinforcementLearningCore.StopAfterNEpisodes — Type
StopAfterNEpisodes(episode; cur = 0, is_show_progress = true)
Return true after being called episode. If is_show_progress is true, the ProgressMeter will be used to show progress.
#
ReinforcementLearningCore.StopAfterNSeconds — Type
StopAfterNSeconds
parameter:
- 
time budget
 
stop training after N seconds
#
ReinforcementLearningCore.StopAfterNSteps — Type
StopAfterNSteps(step; cur = 1, is_show_progress = true)
Return true after being called step times.
#
ReinforcementLearningCore.StopAfterNoImprovement — Type
StopAfterNoImprovement()
Stop training when a monitored metric has stopped improving.
Parameters:
fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.
- 
() -> reward(env)
 - 
() -> totalrewardper_episode.reward
 
patience: Number of epochs with no improvement after which training will be stopped.
δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.
Return true after the monitored metric has stopped improving.
#
ReinforcementLearningCore.StopIfAny — Type
AnyStopCondition(stop_conditions...)
The result of stop_conditions is reduced by any.
#
ReinforcementLearningCore.StopIfEnvTerminated — Type
StopIfEnvTerminated()
Return true if the environment is terminated.
#
ReinforcementLearningCore.StopSignal — Type
StopSignal()
Create a stop signal initialized with a value of false. You can manually set it to true by s[] = true to stop the running loop at any time.
#
ReinforcementLearningCore.TDLearner — Type
TDLearner(;approximator, method, γ=1.0, α=0.01, n=0)
Use temporal-difference method to estimate state value or state-action value.
Fields
- 
approximatoris<:TabularApproximator. - 
γ=1.0, discount rate. - 
method: only:SARS(Q-learning) is supported for the time being. - 
n=0: the number of time steps used minus 1. 
#
ReinforcementLearningCore.TabularApproximator — Method
TabularApproximator(table<:AbstractArray)
For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.
| 
 For   | 
#
ReinforcementLearningCore.TabularQApproximator — Method
TabularQApproximator(; n_state, n_action, init = 0.0)
Create a TabularQApproximator with n_state states and n_action actions.
#
ReinforcementLearningCore.TargetNetwork — Type
TargetNetwork(network::FluxApproximator; sync_freq::Int = 1, ρ::Float32 = 0f0)
Wraps an FluxApproximator to hold a target network that is updated towards the model of the approximator.
- 
sync_freqis the number of updates ofnetworkbetween each update of thetarget. - 
ρ ( ho) is "how much of the target is kept when updating it".
 
The two common usages of TargetNetwork are
- 
use ρ = 0 to totally replace
targetwithnetworkevery sync_freq updates. - 
use ρ < 1 (but close to one) and sync_freq = 1 to let the target follow
networkwith polyak averaging. 
Implements the RLBase.optimise!(::TargetNetwork, ::Gradient) interface to update the model with the gradient and the target with weights replacement or Polyak averaging.
Note to developers: model(::TargetNetwork) will return the trainable Flux model  and target(::TargetNetwork) returns the target model and target(::FluxApproximator) returns the non-trainable Flux model. See the RLCore documentation.
#
ReinforcementLearningCore.TargetNetwork — Method
TargetNetwork(network; sync_freq = 1, ρ = 0f0, use_gpu = false)
Constructs a target network for reinforcement learning.
Arguments
- 
network: The main network used for training. - 
sync_freq: The frequency (in number of calls tooptimise!) at which the target network is synchronized with the main network. Default is 1. - 
ρ: The interpolation factor used for updating the target network. Must be in the range [0, 1]. Default is 0 (the old weights are completely replaced by the new ones). - 
use_gpu: Specifies whether to use GPU for the target network. Default isfalse. 
Returns
A TargetNetwork object.
#
ReinforcementLearningCore.TimePerStep — Type
TimePerStep(;max_steps=100)
TimePerStep(times::CircularVectorBuffer{Float64}, t::Float64)
Store time cost in seconds of the latest max_steps in the times field.
#
ReinforcementLearningCore.TotalRewardPerEpisode — Type
TotalRewardPerEpisode(; is_display_on_exit = true)
Store the total reward of each episode in the field of rewards. If is_display_on_exit is set to true, a unicode plot will be shown at the PostExperimentStage.
#
ReinforcementLearningCore.UCBExplorer — Method
UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)
Arguments
- 
nais the number of actions used to create a internal counter. - 
tis used to store current time step. - 
cis used to control the degree of exploration. - 
seed, set the seed of inner RNG. 
#
ReinforcementLearningCore.VAE — Type
VAE(;encoder, decoder, latent_dims)
#
ReinforcementLearningCore.WeightedExplorer — Type
WeightedExplorer(;is_normalized::Bool, rng=Random.default_rng())
is_normalized is used to indicate if the fed action values are already normalized to have a sum of 1.0.
| 
 Elements are assumed to be   | 
See also: WeightedSoftmaxExplorer
#
ReinforcementLearningCore.WeightedSoftmaxExplorer — Type
WeightedSoftmaxExplorer(;rng=Random.default_rng())
See also: WeightedExplorer
#
Base.push! — Method
When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.
#
Base.run — Method
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function dispatches games using MultiAgentPolicy and MultiAgentHook to the appropriate run function based on the Sequential or Simultaneous trait of the environment.
#
Base.run — Method
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Sequential,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function handles MultiAgent games with the Sequential trait. It iterates over the current_player for each turn in the environment, and runs the full run loop, like in the SingleAgent case. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.
#
Base.run — Method
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Simultaneous,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function handles MultiAgent games with the Simultaneous trait. It iterates over the players in the environment, and for each player, it selects the appropriate policy from the MultiAgentPolicy. All agent actions are collected before the environment is updated. After each player has taken an action, it calls optimise! on the policy. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.
#
ReinforcementLearningBase.plan! — Method
RLBase.plan!(x::BatchExplorer, values::AbstractMatrix)
Apply inner explorer to each column of values.
#
ReinforcementLearningBase.plan! — Method
RLBase.plan!(s::EpsilonGreedyExplorer, values; step) where T
| 
 If multiple values with the same maximum value are found. Then a random one will be returned when   | 
`NaN` will be filtered unless all the values are `NaN`. In that case, a random one will be returned.
#
ReinforcementLearningBase.prob — Method
prob(p::AbstractExplorer, x, mask)
Similar to prob(p::AbstractExplorer, x), but here only the masked elements are considered.
#
ReinforcementLearningBase.prob — Method
prob(p::AbstractExplorer, x) -> AbstractDistribution
Get the action distribution given action values.
#
ReinforcementLearningBase.prob — Method
prob(s::EpsilonGreedyExplorer, values) -> Categorical
prob(s::EpsilonGreedyExplorer, values, mask) -> Categorical
Return the probability of selecting each action given the estimated values of each action.
#
ReinforcementLearningCore._discount_rewards! — _Method
assuming rewards and new_rewards are Vector
#
ReinforcementLearningCore._generalized_advantage_estimation! — _Method
assuming rewards and advantages are Vector
#
ReinforcementLearningCore.bellman_update! — Method
bellman_update!(app::TabularApproximator, s::Int, s_plus_one::Int, a::Int, α::Float64, π_::Float64, γ::Float64)
Update the Q-value of the given state-action pair.
#
ReinforcementLearningCore.check — Method
Inject some customized checkings here by overwriting this function
#
ReinforcementLearningCore.cholesky_matrix_to_vector_index — Method
cholesky_matrix_to_vector_index(i, j)
Return the position in a cholesky_vec (of length da) of the element of the lower triangular matrix at coordinates (i,j).
For example if cholesky_vec = [1,2,3,4,5,6], the corresponding lower triangular matrix is
L = [1 0 0
     2 4 0
     3 5 6]
and cholesky_matrix_to_vector_index(3, 2) == 5
#
ReinforcementLearningCore.diagnormkldivergence — Method
diagnormkldivergence(μ1, σ1, μ2, σ2)
GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and 	 diagonal standard deviations σ1, σ2. Arguments must be Vectors or arrays of column vectors.
#
ReinforcementLearningCore.diagnormlogpdf — Method
diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)
GPU compatible and automatically differentiable version for the logpdf function of normal distributions with diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer). Accepts arguments of the same shape: vectors, matrices or 3D array (with dimension 2 of size 1).
#
ReinforcementLearningCore.discount_rewards — Method
discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)
Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.
Keyword arguments
- 
dims=:, ifrewardsis aMatrix, thendimscan only be1or2. - 
terminal=nothing, specify if each reward follows by a terminal.nothingmeans the game is not terminated yet. Ifterminalis provided, then the size must be the same withrewards. - 
init=nothing,initcan be used to provide the the reward estimation of the last state. 
Example
#
ReinforcementLearningCore.flatten_batch — Method
flatten_batch(x::AbstractArray)
Merge the last two dimension.
Example
julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4
[:, :, 2] =
 5  7
 6  8
[:, :, 3] =
  9  11
 10  12
julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12
#
ReinforcementLearningCore.generalized_advantage_estimation — Method
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)
Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.
Keyword arguments
- 
dims=:, ifrewardsis aMatrix, thendimscan only be1or2. - 
terminal=nothing, specify if each reward follows by a terminal.nothingmeans the game is not terminated yet. Ifterminalis provided, then the size must be the same withrewards. 
Example
#
ReinforcementLearningCore.logdetLorU — Method
logdetLorU(LorU::AbstractMatrix)
Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.
#
ReinforcementLearningCore.mvnormkldivergence — Method
mvnormkldivergence(μ1, L1, μ2, L2)
GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and 	 with cholesky decomposition of covariance matrices L1, L2.
#
ReinforcementLearningCore.mvnormlogpdf — Method
mvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)
GPU compatible and automatically differentiable version for the logpdf function of multivariate normal distributions.  Takes as inputs mu the mean vector, L the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x a matrix of samples where each column is a sample.  Return a Vector containing the logpdf of each column of x for the MvNormal parametrized by μ and Σ = L*L'.
#
ReinforcementLearningCore.mvnormlogpdf — Method
mvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray
Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample.  μ is a (actionsize x 1 x batchsize) matrix, L is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize).  Return a 3D matrix of size (1 x actionsamples x batchsize).
#
ReinforcementLearningCore.normkldivergence — Method
normkldivergence(μ1, σ1, μ2, σ2)
GPU differentiable implementation of the kl_divergence between two univariate Gaussian  distributions with means μ1, μ2 and standard deviations σ1, σ2 respectively.
#
ReinforcementLearningCore.normlogpdf — Method
 normlogpdf(μ, σ, x; ϵ = 1.0f-8)
GPU automatic differentiable version for the logpdf function of a univariate normal distribution. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).
#
ReinforcementLearningCore.vec_to_tril — Method
Transform a vector containing the non-zero elements of a lower triangular da x da matrix into that matrix.
In addition to containing the run loop, RLCore is a collection of pre-implemented components that are frequently used in RL.
QBasedPolicy
QBasedPolicy is an AbstractPolicy that wraps a Q-Value learner (tabular or approximated) and an explorer. Use this wrapper to implement a policy that directly uses a Q-value function to  decide its next action. In that case, instead of creating an AbstractPolicy subtype for your algorithm, define an AbstractLearner subtype and specialize RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory). This way you will not have to code the interaction between your policy and the explorer yourself.  RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.
Parametric approximators
Approximator
If your algorithm uses a neural network or a linear approximator to approximate a function trained with Flux.jl, use the Approximator. It  wraps a Flux model and an Optimiser (such as Adam or SGD). Your optimise!(::PolicyOrLearner, batch) function will probably consist in computing a gradient  and call the RLBase.optimise!(app::Approximator, gradient::Flux.Grads) after that.
Approximator implements the model(::Approximator) and target(::Approximator) interface. Both return the underlying Flux model. The advantage of this interface is explained in the TargetNetwork section below.
TargetNetwork
The use of a target network is frequent in state or action value-based RL. The principle is to hold a copy of of the main approximator, which is trained using a gradient, and a copy of it that is either only partially updated, or just less frequently updated. TargetNetwork is constructed by wrapping an Approximator. Set the sync_freq keyword argument to a value greater that one to copy the main model into the target every sync_freq updates, or set the \rho parameter to a value greater than 0 (usually 0.99f0) to let the target be partially updated towards the main model every update. RLBase.optimise!(tn::TargetNetwork, gradient::Flux.Grads) will take care of updating the target for you.
The other advantage of TargetNetwork is that it uses Julia’s multiple dispatch to let your algorithm be agnostic to the presence or absence of a target network. For example, the DQNLearner in RLZoo has an approximator field typed to be a Union{Approximator, TargetNetwork}. When computing the temporal difference error, the learner calls Q = model(learner.approximator) and Qt = target(learner.approximator). If learner.approximator is a Approximator, then no target network is used because both calls point to the same neural network, if it is a TargetNetwork then the automatically managed target is returned.