Thanks J. Peter et al for their great work of A Survey on Policy Search for Robotics.
Now let’s discurss different ways of policy update used in policy search. Typical policy update methods of model-free policy consist of policy gradent methods, expectation-maximization-based methods, information-theoretic methods and methods derived from path integral theory.
Policy gradient methods use gradient ascent for maximizing the expected return Jθ :
θk+1=θk+α∇θJθ
where the policy gradient is given by:
∇θJθ=∫τ∇θpθ(τ)R(τ)dτ
We will discuss several ways to estiamte the gradietn
∇θJθ . You can also refer to
here for a simpler discussion.
Finite Difference Methods
The finite difference method is among the simplest ways of obtaining the policy gradient and typically used with the episode-based evaluation strategy and exploration strategy in parameter space. The finite difference method estimate the gradient by applying small perturbations δθ[i] to the paramter vector θk. We may either perturb each parameter value separately or use a probability distribution with small variance to create the perturbations.
The gradient ∇FDθJθ can be obtained by using a first-order Taylor expansion of Jθ and solving for the gradient in a least-squares sense:
∇FDθJθ=(δΘTδΘ)−1δΘTδR
where
δΘ=[δθ[1],…,δθ[N]]T and
δR=[δR[1],…,δR[N]]. This method is also called Least Square-based Finite Difference (LSFD) scheme.
Likelihood-ratio Policy Gradients
The likelihood-ratio methods make use of the so-called likelihood-ratio trick that is given by the identity ∇pθ(y)=pθ(y)∇logpθ(y) . By using it:
∇θJθ=∫pθ(τ)∇θlogpθ(τ)R(τ)dτ=Epθ(τ)[∇θlogpθ(τ)R(τ)]
where the expectation over
pθ(τ) is approximated by using a sum over the sampled trajectories
τ[i]=(x[i]0,u[i]0,x[i]1,u[i]1,…) .
Due to the inherently noisy Monte-Carlo estimates, the resulting gradient estimates suffer from a large variance. The variance can be reduced by a baseline b :
∇θJθ=Epθ(τ)[∇θlogpθ(τ)(R(τ)−b)]
which does not affect the policy gradient estimate. The baseline
b is a free paramter, we can choose it s.t. it minimizes the variance of the gradient estimate.
Step-based likelihood-ratio methods
Step-based algorithms exploit the structure of the trajectory distribution:
pθ(τ)=p(x1)∏t=1Tp(xt+1|xt,ut)πθ(ut|xt,t)
Then by the logarithm:
∇θlogpθ(τ)=∑t=1T−1∇θlogπθ(ut|xt,t)
Note that it has nothing to do with the transition model. However this trick only works for stochastic policies.
REINFORCE
REINFORCE is one of the first PG algorithms. The REINFORCE policy gradient is given by:
∇RFθJθ=Epθ(τ)[∑t=0T−1∇θlogπθ(ut|xt,t)(R(τ)−b)]
The optimal baseline minimizes the variance of
∇RFθJθ :
bRFh=Epθ[(∑T−1t=0∇θhlogπθ(ut|xt,t))2R(τ)]Epθ[(∑T−1t=0∇θhlogπθ(ut|xt,t))2]
G(PO)MDP
Despite using step-based policy evalution strategy, REINFORCE uses the returns R(τ) (Recall that R(τ)=rT(xT)+∑T−1t=0rt(xt,ut) ) of the whole episode as the evaluations of single actions. Note that rewards from the past do not depend on actions in the future, and, hence Epθ[∂θlogπθ(ut|xt,t)rj]=0 for j<t . Hence, the policy gradient of G(PO)MDP is given by:
∇GMDPθJθ=Epθ(τ)⎡⎣∑j=0T−1∑t=0j∇θlogπθ(ut|xt,t)(rj−bj)⎤⎦
The optimal baseline is given by:
bGMDPh=Epθ[(∑jt=0∇θhlogπθ(ut|xt,t))2rj]Epθ[(∑jt=0∇θhlogπθ(ut|xt,t))2]
Policy Gradient Theorem Algorithm
We can use the expected reward to come at time step t rather than the returns R(τ) to evaluate an action ut:
∇PGθJθ=Epθ(τ)[∑t=0T−1∇θlogπθ(ut|xt)Qπt(xt,ut)]
We can also subtract an arbitrary baseline
bt(x) from
Qπt(xt,ut). This method can be used in combination with function approximation. And
Qπt(xt,ut) can be estimated by Monte-Carlo rollouts.
Episode-based likelihood-ratio methods
Episode-based likelihood-ratio methods directly update the upper-level policy πw(θ) for choosing the parameters θ of the lower-level policy πθ(ut|xt,t) . Refer to here for more details about upper-level policy.
∇PEθJθ=Eπω(θ)[∇ωlogπω(θ)(R(θ)−b)]
The optimal baseline is given by:
bPGPEh=Eπω(θ)[(∇ωhπω(θ))2R(θ)]Eπω(θ)[(∇ωhπω(θ))2]
Natural Gradients
Natural gradients often achieve faster convergence than traditional gradients. As the traditional one use an Euclidean metric δθTδθ to determine the direction of the update δθ , i.e. they assume that all parameter dimensions have similary strong effects on the resulting distribution. Small chanegs in θ might result in large changes of the resulting distribution pθ(y). To achieve a stable behavior of the learning process, it is desirable to enforce that the distribution pθ(y) does not change too much in one update step which is the key intuition behind the natural gradient that limits the distance between the distribution pθ(y) and pθ+δθ(y) .
The Kullback-Leibler (KL) divergence is used to measure the distance between pθ(y) and pθ+δθ(y) . The Fisher information matrix can be used to approximate the KL divergence for sufficiently small δθ. Refer to here and here for more details if you are interested in KL divergence and Fisher information matrix.
The Fisher information matrix is defined as :
Fθ=Epθ(y)[∇θlogp(y)∇θlogp(y)T]
and the KL divergence can be estimated as :
KL(pθ+δθ(y)∥pθ(y))≈δθTFθδθ
The natural gradient update has a bounded distance:
KL(pθ+δθ(y)∥pθ(y))≤ϵ
in the distribution space. Hence, we can formulate the following optimization problem:
δθNG=argmaxδθδθTδθVGs.t.δθTFθδθ≤ϵ
where
δθVG is the traditional vanilla gradient update. Due to the deficiency of time, we do not look deeper into this problem but show the application only. The
natural policy gradient is given by:
∇NGθJθ=F−1θ∇θJθ
where
∇θJθ can be estimated by any traditional policy gradient method.
Step-based Natural Gradient Methods
The Fisher information matrix of the trajectory distribution can be written as the average Fisher information matrices for each time step:
Fθ=Epθ(τ)[∑t=0T−1∇θlogπθ(ut|xt,t)∇θlogπθ(ut|xt,t)T]
However, estiamting
Fθ may be difficult as it contains
O(d2) parameters. If we use
compatible function approximations,
Fθ does not need to be estimated explicitly. For simplicity, we do not derivate compatitble functino approximations in details but point out that it evolves from the
Policy Gradient Theorem algorithm :
Let
A~w(xt,ut,t)=ψt(xt,ut)Tw≈Qπt(xt,ut)−bt(x) as a function approximation. A good function approximation does not chagne the gradient in expectation, i.e. it does not introduce a bias. Using
ψt(xt,ut)=∇θlogπθ(ut|xt,t) as basis functions which is also called
compatible function approximation, as the function approximation is compatible with the policy parameterization. Then the policy gradient using compatitble function approximation can be written by:
∇FAθJθ=Epθ(τ)[∑t=0T−1∇θlogπθ(ut|xt,t)∇θlogπθ(ut|xt,t)T]w=Gθw
Hence, in order to compute the traditional gradient, we have to estimate the weight parameters
w of the
advantage function A~w(xt,ut,t) and the matrix
Gθ .
Note that
Gθ=Fθ . Hence, the step-based natural gradient simplifies to:
∇NGθJθ=F−1θ∇FAθJθ=w
Anyway, the natural gradient still requires estimating the function
A~w . Due to the baseline
bt(x) , the function
A~w(xt,ut.t)≈Qπt(xt,ut)−Vt(xt) can be interpreted as the
advantage function. The advantage functino can be estimated by using
temporal difference methods which also require an estimate of the value function
Vt(xt) .
Episodic Natural Actor Critic
The advantage function is easy to learn as its basis functions ψt(xt,ut) are given by the compatible function approximation, appropriate basis functions for the value function are more difficult to specify. Then we would like to find some algorithms avoiding estimating a value function. One such algorithm is the episodic Natural Actor Critic (eNAC) alogrithm where the estimation of the value function Vt can be avoided by considering whole sample paths.
For simplicity, we omite some internal steps and show the final outcome directly:
Rewrite the Bellman equation:
A~w(xt,ut.t)+Vt(x)=rt(x,u)+∫p(x′|x,u)Vt(x′)dx′
Then:
∑t=0T−1∇θlogπθ(ut|xt,t)w+V0(x0)=∑t=1T−1rt(xt,ut)+rT(xT)
Now the value function
Vt needs to be estimated only for the first time step. For a single start state
x0. estimating
V0(x0)=v0 reduces to estimating a constant
v0 . For multiple start states
x0 ,
Vt needs to be parameterized
V0(x0)≈V~v(x0)=φ(x)Tv . By using mulitiple sample paths
τ[i] , we get a regression problem:
(wv)=(ΨTΨ)−1ΨTR
where
ψ[i]=[∑t=0T−1∇θlogπθ(u[i]t∣∣x[i]t,t),φ(x[i])T],Ψ=[ψ[1],…,ψ[N]]T
Natural Actor Critic
eNAC uses the returns R[i] for evaluating the policy, and consequently, gets less accurate for large time-horizons due to the large variance of the returns. The convergence speed can be improved by directly estimating the vaue function. To do so, temporal differnece methods have to first be adapted to learn the advantage function.

Episode-based Natural Policy Gradients
The beneficial properties of the natural gradient can also be exploited for episode-based algorithms. Such methods come from the area of evoluationary algorithms. They perform gradient ascent on a fitness function which is in the reinforcement learning context the expected long-term reward Jω of the upper-level policy:
∇NESωJω=F−1ω∇PEωJω