Thanks J. Peter et al for their great work of A Survey on Policy Search for Robotics.

Now let’s discurss different ways of policy update used in policy search. Typical policy update methods of model-free policy consist of policy gradent methods, expectation-maximization-based methods, information-theoretic methods and methods derived from path integral theory.

Policy gradient methods use gradient ascent for maximizing the expected return Jθ :

θ k + 1 = θ k + α \nabla θ J θ

where the policy gradient is given by:

\nabla θ J θ = \int τ \nabla θ p θ (τ) R (τ) d τ

We will discuss several ways to estiamte the gradietn ∇θJθ . You can also refer to here for a simpler discussion.

Finite Difference Methods

The finite difference method is among the simplest ways of obtaining the policy gradient and typically used with the episode-based evaluation strategy and exploration strategy in parameter space. The finite difference method estimate the gradient by applying small perturbations δθ[i] to the paramter vector θk. We may either perturb each parameter value separately or use a probability distribution with small variance to create the perturbations.

The gradient ∇FDθJθ can be obtained by using a first-order Taylor expansion of Jθ and solving for the gradient in a least-squares sense:

\nabla F D θ J θ = (δ Θ T δ Θ) - 1 δ Θ T δ R

where δΘ=[δθ[1],…,δθ[N]]T and δR=[δR[1],…,δR[N]]. This method is also called Least Square-based Finite Difference (LSFD) scheme.

Likelihood-ratio Policy Gradients

The likelihood-ratio methods make use of the so-called likelihood-ratio trick that is given by the identity ∇pθ(y)=pθ(y)∇logpθ(y) . By using it:

\nabla θ J θ = \int p θ (τ) \nabla θ log p θ (τ) R (τ) d τ = E p θ (τ) [\nabla θ log p θ (τ) R (τ)]

where the expectation over pθ(τ) is approximated by using a sum over the sampled trajectories τ[i]=(x[i]0,u[i]0,x[i]1,u[i]1,…) .

Due to the inherently noisy Monte-Carlo estimates, the resulting gradient estimates suffer from a large variance. The variance can be reduced by a baseline b :

\nabla θ J θ = E p θ (τ) [\nabla θ log p θ (τ) (R (τ) - b)]

which does not affect the policy gradient estimate. The baseline b is a free paramter, we can choose it s.t. it minimizes the variance of the gradient estimate.

Step-based likelihood-ratio methods

Step-based algorithms exploit the structure of the trajectory distribution:

p θ (τ) = p (x 1) \prod t = 1 T p (x t + 1 | x t, u t) π θ (u t | x t, t)

Then by the logarithm:

\nabla θ log p θ (τ) = \sum t = 1 T - 1 \nabla θ log π θ (u t | x t, t)

Note that it has nothing to do with the transition model. However this trick only works for stochastic policies.

REINFORCE

REINFORCE is one of the first PG algorithms. The REINFORCE policy gradient is given by:

\nabla R F θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) (R (τ) - b)]

The optimal baseline minimizes the variance of ∇RFθJθ :

b R F h = E p θ [(\sum T - 1 t = 0 \nabla θ h log π θ (u t | x t, t)) 2 R (τ)] E p θ [(\sum T - 1 t = 0 \nabla θ h log π θ (u t | x t, t)) 2]

A Policy Update Strategy in Model-free Policy Search: Policy Gradient

G(PO)MDP

Despite using step-based policy evalution strategy, REINFORCE uses the returns R(τ) (Recall that R(τ)=rT(xT)+∑T−1t=0rt(xt,ut) ) of the whole episode as the evaluations of single actions. Note that rewards from the past do not depend on actions in the future, and, hence Epθ[∂θlogπθ(ut|xt,t)rj]=0 for j<t . Hence, the policy gradient of G(PO)MDP is given by:

\nabla G M D P θ J θ = E p θ (τ) ⎡ ⎣ \sum j = 0 T - 1 \sum t = 0 j \nabla θ log π θ (u t | x t, t) (r j - b j) ⎤ ⎦

The optimal baseline is given by:

b G M D P h = E p θ [(\sum j t = 0 \nabla θ h log π θ (u t | x t, t)) 2 r j] E p θ [(\sum j t = 0 \nabla θ h log π θ (u t | x t, t)) 2]

Policy Gradient Theorem Algorithm

We can use the expected reward to come at time step t rather than the returns R(τ) to evaluate an action ut:

\nabla P G θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t) Q π t (x t, u t)]

We can also subtract an arbitrary baseline bt(x) from Qπt(xt,ut). This method can be used in combination with function approximation. And Qπt(xt,ut) can be estimated by Monte-Carlo rollouts.

Episode-based likelihood-ratio methods

Episode-based likelihood-ratio methods directly update the upper-level policy πw(θ) for choosing the parameters θ of the lower-level policy πθ(ut|xt,t) . Refer to here for more details about upper-level policy.

\nabla P E θ J θ = E π ω (θ) [\nabla ω log π ω (θ) (R (θ) - b)]

The optimal baseline is given by:

b P G P E h = E π ω (θ) [(\nabla ω h π ω (θ)) 2 R (θ)] E π ω (θ) [(\nabla ω h π ω (θ)) 2]

Natural Gradients

Natural gradients often achieve faster convergence than traditional gradients. As the traditional one use an Euclidean metric δθTδθ to determine the direction of the update δθ , i.e. they assume that all parameter dimensions have similary strong effects on the resulting distribution. Small chanegs in θ might result in large changes of the resulting distribution pθ(y). To achieve a stable behavior of the learning process, it is desirable to enforce that the distribution pθ(y) does not change too much in one update step which is the key intuition behind the natural gradient that limits the distance between the distribution pθ(y) and pθ+δθ(y) .

The Kullback-Leibler (KL) divergence is used to measure the distance between pθ(y) and pθ+δθ(y) . The Fisher information matrix can be used to approximate the KL divergence for sufficiently small δθ. Refer to here and here for more details if you are interested in KL divergence and Fisher information matrix.

The Fisher information matrix is defined as :

F θ = E p θ (y) [\nabla θ log p (y) \nabla θ log p (y) T]

and the KL divergence can be estimated as :

K L (p θ + δ θ (y) ∥ p θ (y)) \approx δ θ T F θ δ θ

The natural gradient update has a bounded distance:

K L (p θ + δ θ (y) ∥ p θ (y)) \leq ϵ

in the distribution space. Hence, we can formulate the following optimization problem:

δ θ N G = arg max δ θ δ θ T δ θ V G s.t. δ θ T F θ δ θ \leq ϵ

where δθVG is the traditional vanilla gradient update. Due to the deficiency of time, we do not look deeper into this problem but show the application only. The natural policy gradient is given by:

\nabla N G θ J θ = F - 1 θ \nabla θ J θ

where ∇θJθ can be estimated by any traditional policy gradient method.

Step-based Natural Gradient Methods

The Fisher information matrix of the trajectory distribution can be written as the average Fisher information matrices for each time step:

F θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) \nabla θ log π θ (u t | x t, t) T]

However, estiamting Fθ may be difficult as it contains O(d2) parameters. If we use compatible function approximations, Fθ does not need to be estimated explicitly. For simplicity, we do not derivate compatitble functino approximations in details but point out that it evolves from the Policy Gradient Theorem algorithm :
Let A~w(xt,ut,t)=ψt(xt,ut)Tw≈Qπt(xt,ut)−bt(x) as a function approximation. A good function approximation does not chagne the gradient in expectation, i.e. it does not introduce a bias. Using ψt(xt,ut)=∇θlogπθ(ut|xt,t) as basis functions which is also called compatible function approximation, as the function approximation is compatible with the policy parameterization. Then the policy gradient using compatitble function approximation can be written by:

\nabla F A θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) \nabla θ log π θ (u t | x t, t) T] w = G θ w

Hence, in order to compute the traditional gradient, we have to estimate the weight parameters w of the advantage function A~w(xt,ut,t) and the matrix Gθ .
Note that Gθ=Fθ . Hence, the step-based natural gradient simplifies to:

\nabla N G θ J θ = F - 1 θ \nabla F A θ J θ = w

Anyway, the natural gradient still requires estimating the function A~w . Due to the baseline bt(x) , the function A~w(xt,ut.t)≈Qπt(xt,ut)−Vt(xt) can be interpreted as the advantage function. The advantage functino can be estimated by using temporal difference methods which also require an estimate of the value function Vt(xt) .

Episodic Natural Actor Critic

The advantage function is easy to learn as its basis functions ψt(xt,ut) are given by the compatible function approximation, appropriate basis functions for the value function are more difficult to specify. Then we would like to find some algorithms avoiding estimating a value function. One such algorithm is the episodic Natural Actor Critic (eNAC) alogrithm where the estimation of the value function Vt can be avoided by considering whole sample paths.

For simplicity, we omite some internal steps and show the final outcome directly:
Rewrite the Bellman equation:

A ~ w (x t, u t . t) + V t (x) = r t (x, u) + \int p (x' | x, u) V t (x') d x'

Then:

\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) w + V 0 (x 0) = \sum t = 1 T - 1 r t (x t, u t) + r T (x T)

Now the value function Vt needs to be estimated only for the first time step. For a single start state x0. estimating V0(x0)=v0 reduces to estimating a constant v0 . For multiple start states x0 , Vt needs to be parameterized V0(x0)≈V~v(x0)=φ(x)Tv . By using mulitiple sample paths τ[i] , we get a regression problem:

(w v) = (Ψ T Ψ) - 1 Ψ T R

where

ψ [i] = [\sum t = 0 T - 1 \nabla θ log π θ (u [i] t ∣ ∣ x [i] t, t), φ (x [i]) T], Ψ = [ψ [1], \dots, ψ [N]] T

Natural Actor Critic

eNAC uses the returns R[i] for evaluating the policy, and consequently, gets less accurate for large time-horizons due to the large variance of the returns. The convergence speed can be improved by directly estimating the vaue function. To do so, temporal differnece methods have to first be adapted to learn the advantage function.

A Policy Update Strategy in Model-free Policy Search: Policy Gradient

Episode-based Natural Policy Gradients

The beneficial properties of the natural gradient can also be exploited for episode-based algorithms. Such methods come from the area of evoluationary algorithms. They perform gradient ascent on a fitness function which is in the reinforcement learning context the expected long-term reward Jω of the upper-level policy:

\nabla N E S ω J ω = F - 1 ω \nabla P E ω J ω