Proof of Policy Gradient Theorem

Introduction

Recall that policy gradient methods aims to directly optimize parameterized policies $\pi_{\theta}(a|s)$. The reward funciton is defined as:

\begin{array}{cc} J(\theta) = \sum_{s\in\mathcal{S}}d^{\pi}(s)V^{\pi}(s) = \sum_{s\in \mathcal{S}}d^{\pi}(s)\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)Q^{\pi}(s,a) \end{array}

where $d^{\pi}$ is the stationary distribution of Markov chain for $\pi_{\theta}$ ( $d^{\pi}(s)=lim_{t\rightarrow\infty} P(s_{t}=s|s_{0},\pi_{\theta})$ is the probability that $s_{t}=s$ when starting from $s_{0}$ and following policy $\pi_{\theta}$).

Intuitively, we can move $\theta$ toward the direction suggested by the gradient $\nabla_{\theta}J(\theta)$, and the policy gradient theorem gives a nice reformation of the derivative of $J(\theta)$ which not involve the derivative of $d^{\pi}$:

\begin{array}{l} \nabla_{\theta}J(\theta) \propto \sum_{s\in\mathcal{S}}d^{\pi}(s)\sum_{a\in\mathcal{A}}Q_{\pi}(s,a)\nabla_{\theta}\pi_{\theta}(a|s) \end{array}

Here I'll show two different ways to prove the policy gradient theorem.

Method 1

First, expand the derivative of the state value function:

\begin{array}{l} \nabla_{\theta}V^{\pi}(s)\\ = \nabla_{\theta}\left(\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)Q^{\pi}(s,a)\right)\\ = \sum_{a\in\mathcal{A}}\left(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\nabla_{\theta}Q^{\pi}(s,a)\right)\\ = \sum_{a\in\mathcal{A}}\left(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\nabla_{\theta}\sum_{s^{'},r}P(s^{'},r|s,a)(r+V^{\pi}(s^{'}))\right)\\ = \sum_{a\in\mathcal{A}}\left(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\sum_{s^{'},r}P(s^{'},r|s,a)\nabla_{\theta}V^{\pi}(s^{'})\right)\\ = \sum_{a\in\mathcal{A}}\left(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\sum_{s^{'}}P(s^{'}|s,a)\nabla_{\theta}V^{\pi}(s^{'})\right) \end{array}

Note that this equation is a recursive form of $\nabla_{\theta}V^{\pi}(s)$, in order to further expand $V^{\pi}(\theta)$, consider the following visitation sequence:

\begin{array}{l} s\stackrel{a\sim \pi_{\theta}(\cdot|s)}{\longrightarrow} s^{'} \stackrel{a\sim \pi_{\theta}(\cdot|s^{'})}{\longrightarrow} s^{''} \stackrel{a\sim \pi_{\theta}(\cdot|s^{''})}{\longrightarrow} ... \end{array}

Let $\rho^{\pi}(s\rightarrow x,k)$ denote the probability of transitioning from state $s$ to state $x$ after $k$ steps under policy $\pi_{\theta}$. Then it's easy to see that $\rho^{\pi}(s\rightarrow s^{'}, k=1)=\sum_{a}\pi_{\theta}(a|s)P(s^{'}|s,a)$; And recursively, $\rho^{\pi}(s\rightarrow x, k+1)=\sum_{s^{'}}\rho^{\pi}(s\rightarrow s^{'},k)\rho^{\pi}(s^{'}\rightarrow x, 1)$.

For simplicity, denote $\phi(s)=\sum_{a\in\mathcal{A}}\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)$, then we expand $\nabla_{\theta}V^{\pi}(s)$ :

\begin{array}{l} \nabla_{\theta}V^{\pi}(s)\\ = \phi(s)+\sum_{a}\pi_{\theta}(a|s)\sum_{s^{'}}P(s^{'}|s,a)\nabla_{\theta}V^{\pi}(s^{'})\\ = \phi(s)+\sum_{s^{'}}\sum_{a}\pi_{\theta}(a|s)P(s^{'}|s,a)\nabla_{\theta}V^{\pi}(s^{'})\\ = \phi(s)+\sum_{s^{'}}\rho^{\pi}(s\rightarrow s^{'},1)\nabla_{\theta}V^{\pi}(s^{'})\\ = \phi(s)+\sum_{s^{'}}\rho^{\pi}(s\rightarrow s^{'},1)[ \phi(s^{'})+\sum_{s^{''}}\rho^{\pi}(s^{'}\rightarrow s^{''},1)\nabla_{\theta}V^{\pi}(s^{''})]\\ = \phi(s)+ \sum_{s^{'}}\rho^{\pi}(s\rightarrow s^{'},1)\phi(s^{'}) +\sum_{s^{''}}\rho^{\pi}(s\rightarrow s^{''},2)\nabla_{\theta}V^{\pi}(s^{''})\\ = ...(Recursively)\\ = \sum_{x\in\mathcal{S}}\sum_{t=0}^{\infty}\rho^{\pi}(s\rightarrow x,t)\phi(x) \end{array}

Plugging it into the gradient of $J(\theta)$ :

\begin{array}{l} \nabla_{\theta}J(\theta) = \nabla_{\theta}V^{\pi}(s_{0})\;(where\;s_{0}\;is\;a\;random\;state)\\ = \sum_{s\in\mathcal{S}}\sum_{t=0}^{\infty}\rho^{\pi}(s_{0}\rightarrow s,t)\phi(s)\\ = \sum_{s\in\mathcal{S}}\eta(s)\phi(s)\\ = \left(\sum_{s}\eta(s)\right)\sum_{s}\frac{\eta(s)}{\sum_{s}\eta(s)}\phi(s)\\ \propto \sum_{s}\frac{\eta(s)}{\sum_{s}\eta(s)}\phi(s)\\ = \sum_{s}d^{\pi}(s)\sum_{a}\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)\\ = \sum_{s}d^{\pi}(s)\sum_{a}\pi_{\theta}(a|s)Q^{\pi}(s,a)\frac{\nabla_{\theta}\pi_{\theta}(a|s)}{\pi_{\theta}(a|s)}\\ = \mathbb{E}_{s\sim d^{\pi};a\sim \pi_{\theta}}[ Q^{\pi}(s,a)\nabla_{\theta}log\; \pi_{\theta}(a|s)] \end{array}

where $\eta(s) = \sum_{t=0}^{\infty} \rho^{\pi}(s_{0}\rightarrow s,t)$, and $d^{\pi}(s) = \frac{\eta(s)}{\sum_{s}\eta(s)}$ is the stationary distribution.

Method 2 : Proof by Performance Difference Lemma

Performance Difference Lemma for Average Reward Setting

\begin{array}{l} J(w)-J(\theta) = \sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}(\pi_{w}(a|s)-\pi_{\theta}(a|s))Q_{\theta}(s,a) \end{array}

Performance Difference Lemma for Discounted Reward Setting

For all parametrized poilicies $\pi_{\theta}, \pi_{w}$, and states $s_{0}$,

\begin{array}{l} V^{\pi_{w}}(s_{0}) - V^{\pi_{\theta}}(s_{0})=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{w}}}\mathbb{E}_{a\sim\pi_{w}(\cdot|s)}[ A^{\pi_{\theta}}(s,a)] \end{array}

where $A^{\pi}$ is the advantage function $A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s)$.

Proof of Policy Gradient Theorem

The process of showing that $Q_{w}$ is continuous is ommited. Let $\theta=w_{\epsilon,i}=w+\epsilon e_{i}$, then

\begin{array}{l} \frac{\partial J(w)}{\partial w_{i}}=\underset{\epsilon\rightarrow 0}{lim}\frac{J(w)-J(w_{\epsilon,i})}{\epsilon}\\ = \underset{\epsilon\rightarrow 0}{lim}\frac{\sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}(\pi_{w}(a|s)-\pi_{\epsilon,i}(a|s))Q_{w_{\epsilon,i}}(s,a)}{\epsilon}\\ =\underset{\epsilon\rightarrow 0}{lim}\sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}\frac{\pi_{w}(a|s)-\pi_{\epsilon,i}(a|s)}{\epsilon}Q_{w_{\epsilon,i}}(s,a)\\ =\sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}\left(\underset{\epsilon\rightarrow 0}{lim}\frac{\pi_{w}(a|s)-\pi_{\epsilon,i}(a|s)}{\epsilon}\right)\left(\underset{\epsilon\rightarrow 0}{lim} Q_{w_{\epsilon,i}}(s,a)\right)\\ =\sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}\frac{\partial \pi_{w}(a|s)}{\partial w_{i}}Q_{w}(s,a) \end{array}

Hence $\nabla_{w}J(w) = \sum_{s\in\mathcal{S}}d^{w}(s)\sum_{a\in\mathcal{A}}\nabla_{w}\pi_{w}(a|s)Q_{w}(s,a)$.

Reference

Gergely Neu, 2019, Twitter, https://twitter.com/neu_rips/status/1180466116444987392

Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.

Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th International Conference on Machine Learning, volume 2, pages 267–274, 2002.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition. 2017.

comments powered by Disqus
comments powered by Disqus