An Operator View of Policy Gradient Methods
Introduction
As I wroted in my recent post, I wonder since in PG algorithms, the update direction is not actually the gradient, then its not quite clear what an update actually does and why it finally converges to a promising policy. A recent work by Ghosh et al. gave an answer to this question by viewing PG as the repeated application of two operators : a policy improvement operator $\mathcal{I}$, which maps any policy to a policy achieving strictly larger return; And a projection operator $\mathcal{P}$, which finds the best approximation of this new policy in the space of realizable policies. This perspective is novel and essential since it allows us to bridge the gap between policy-based and value-based methods.
Prelinimaries
Notations
An infinite-horizon discounted MDP is defined by the tuple $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, p, r, d_{0}, \gamma\rangle$, where $\mathcal{S}$ is a finite set of states, $\mathcal{A}$ is a finite action set, $p : \mathcal{S}\times \mathcal{A}\rightarrow \mathcal{S}$ is the transition probability function, $r: \mathcal{S}\times \mathcal{A}\rightarrow [0, R_{max}]$ is the reward function, $d_{0}$ is the initial distribution of states, $\gamma\in[0,1)$ is the discounted factor. Let $\Delta(\cdot)$ denote the probability simplex, then the agent's goal is to learn a policy $\pi: \mathcal{S}\times \mathcal{A}\rightarrow \Delta(\mathcal{S})$ that maximizes the expected discounted return
Trajectory Formulation
The expected discounted return can be defined as $J(\pi) = \mathbb{E}_{s_{0}, a_{0}, ...}[\sum_{t=0}^{\infty} \gamma^{t}r(s_{t}, a_{t})]$, where $s_{0}\sim d_{0}, a_{t}\sim \pi(a_{t}|s_{t})$, and $s_{t+1}\sim p(s_{t+1}|s_{t}, a_{t})$. Let $\tau$ denote a specific trajectory $\tau = \langle s_{0}, a_{0}, s_{1}, ... \rangle$, and $R(\tau)=\sum_{t=0}^{\infty}\gamma^{t}r(s_{t}, a_{t})$ denotes the return of trajectory $\tau$.
PG methods aims to find the policy $\pi^{*} = arg max_{\pi}\int_{\tau} R(\tau)\pi(\tau)d\tau$. Since there's a restricted class $\Pi$ parametrized by $\theta\in\mathbb{R}^{d}$, and hence the problem becomes $\theta^{*} = argmax_{\theta}J(\pi_{\theta}) = argmax_{\theta}\int_{\tau}R(\tau)\pi_{\theta}(\tau)d\tau$.
One of the traditional PG methods, REINFORCE (by Williams), performs the following update :
\(\theta_{t+1} = \theta_{t} + \epsilon_{t}\int_{\tau}\pi_{\theta_{t}}(\tau)R(\tau)\frac{\partial \log \pi_{\theta}(\tau)}{\partial \theta}|_{\theta = \theta_{t}}d\tau\)
where $\epsilon_{t}$ is a stepsize.
Value-Based Formulation
The state value function is $V^{\pi}(s_{t}) = \mathbb{E}_{a_{t}, s_{t+1}, ...}[\sum_{t=0}^{\infty} \gamma^{k}r(s_{t+k},a_{t+k})]$; State-action value function is $Q^{\pi}(s_{t}, a_{t}) = \mathbb{E}_{s_{t+1}, a_{t+1}, ...}[\sum_{t=0}^{\infty} \gamma^{k}r(s_{t+k}, a_{t+k})]$, where $a_{t}\sim \pi(a_{t}|s_{t})$ and $s_{t+1}\sim p(s_{t+1}|s_{t}, a_{t})$. The agent aims to maximize $\mathbb{E}_{s_{0}}V^{\pi}(s_{0})$ and Sutton et al. gave an update for the state-action formulation (which is equivalent to the REINFORCE above) :
\(\theta_{t+1} = \theta_{t} + \epsilon\sum_{s}d^{\pi_{t}(s)}\sum_{a}\pi_{t}(a|s)Q^{\pi_{t}}(s, a)\frac{\partial \log \pi_{\theta}(a|s)}{\partial \theta}|_{\theta = \theta_{t}}\)
where $d^{\pi}$ is the discounted stationary distribution induced by policy $\pi$.
An Operator View of Reinforce
In short, all of these approaches can be cast as minimizing a divergence measure between the current policy $\pi$ and a fixed policy $\mu$ which achieves higher return than $\pi$. Moving from $\pi$ to $\mu$ can be seen as a policy improvement step $\mu = \mathcal{I}(\pi)$ where $\mathcal{I}$ is the improvement operator. Note that $\mu$ might not be in the set of realizable policies, the divergence minimization acts as a projection step using a projection operator $\mathcal{P}$. Then which improvement/projection operators to use? They should satisfy (a) The optimal policy $\pi(\theta^{*})$ should be a stationary point of the composition $\mathcal{P}\circ \mathcal{I}$. (b) Doing an approximate projection step of $\mathcal{I}\pi$ should always lead to a better policy than $\pi$. Next an operator view of REINFORCE is presented (for both trajectory formulation and value-based formulation)
Trajectory Formulation
Proposition 1 Assuming all returns $R(\tau)$ are positive, then REINFORCE (trajectory formulation) can be seen as doing gradient step to minimize $KL(R\pi_{t}||\pi)$, where $R\pi_{t}$ is defined by \(R\pi_{t} = \frac{1}{J(\pi_{t})} R(\tau)\pi_{t}(\tau).\) Hence, the two operators associated with OP-REINFORCE are \(\mathcal{I}_{\tau}(\pi)(\tau) = R\pi(\tau);\;\mathcal{P}(\mu) = argmin_{\pi\in \Pi}KL(\mu||\pi),\) where $\Pi$ is the set of realizable policies.
Proof of Proposition 1 : Denoting $\mu$ the distribution over trajectories such that $\mu(\tau)\propto R(\tau)\pi(\tau)$, since $J(\pi_{t}) = \int_{\tau}R(\tau)\pi_{t}(\tau)d\tau$, then $\mu(\tau) = \frac{1}{J(\pi)}R(\tau)\pi(\tau)$. Then
\(KL(\mu||\pi) = \int_{\tau}\mu(\tau)\log\frac{\mu(\tau)}{\pi(\tau)}d\tau = -\int_{\tau}\mu(\tau)\log\frac{\pi(\tau)}{\mu(\tau)}d\tau \\ = -\int_{\tau}\mu(\tau)\log\pi(\tau)d\tau+\int_{\tau}\mu(\tau)\log\mu(\tau)d\tau.\)
Fix $\mu$, take derivative in the context of the projection operator, we have
\(\frac{\partial KL(\mu||\pi)}{\partial \theta} = -\int_{\tau} \mu(\tau)\nabla_{\theta}\log \pi(\tau)d\tau\\ \propto -\int_{\tau}R(\tau)\pi(\tau)\nabla_{\theta}\log \pi(\tau) d\tau\)
which is the update direction of REINFORCE (in the trajectory formula).
But note that OP-REINFORCE is different from REINFORCE, since it solves the projection exactly rather than doing just one step of gradient descent. Hence it remains to show that OP-REINFORCE actually converge to the optimum:
Proposition 2 $\pi(\theta^{*})$ is a fixed point of $\mathcal{P}_{\tau}\circ \mathcal{I}_{\tau}$.
Proof of Proposition 2 :
\(\nabla_{\theta}KL(R\pi^{*}||\pi)|_{\pi=\pi^{*}} = \int_{t}R(\tau)\pi^{*}(\tau)\nabla_{\theta}\pi^{*}(\tau)d\tau = 0.\)
Value-Based Formulation
Proposition 3 If all $Q^{\pi}(s, a)$ are positive, then REINFORCE (in the value-based formula) can be seen as doing a gradient step to minimize \(D_{V_{t}^{\pi}\pi_{t}}(Q^{\pi_{t}}\pi_{t}||\pi) = \sum_{s}d^{\pi_{t}}(s)V^{\pi_{t}}(s)KL(Q^{\pi_{t}}\pi_{t}||\pi)\) where $D_{V_{t}^{\pi}\pi_{t}}$ and the distribution $Q^{\pi}\pi$ over actions are defined as : \(D_{z}(\mu||\pi) = \sum_{s}z(s)KL(\mu(\cdot|s)||\pi(\cdot|s)),\) \(Q^{\pi}\pi(a|s) = \frac{1}{\sum_{a^{'}}Q(s,a^{'})\pi(a^{'}|s)}Q(s,a)\pi(a|s)=\frac{1}{V^{\pi}(s)}Q(s,a)\pi(a|s)\) Hence the two operators associated with the state-action formulation are: \(\mathcal{I}_{V}(\pi)(s, a) = \left(\frac{1}{\mathbb{E}_{\pi}[V^{\pi} ]}d^{\pi}(s)V^{\pi}(s)\right)Q^{\pi}\pi(a|s),\) \(\mathcal{P}_{V}(\mu) = argmin_{z\in\Pi}\sum_{s}\mu(s)KL(\mu(\cdot|s)||z(\cdot|s))\)
proof of Proposition 3 :
\(\sum_{a}\mathcal{I}_{V}(\pi)(s,a) = \frac{1}{\mathbb{E}_{\pi}[V^{\pi} ]}d^{\pi}(s)V^{\pi}(s)\sum_{a}\frac{Q(s,a)\pi(a|s)}{V^{\pi}(s)}=1.\)
So by applying the improvement operator $\mathcal{I}_{V}$ on $\pi$, we get a policy $\mu$ which increases the probabilities of states $s$ with large value $V(s)$, and also increases the probabilities of actions $a$ with large values $Q(s, a)$.
Still, fix $Q^{\pi_{t}}\pi_{t}$, taking derivative of $D_{V_{t}^{\pi}\pi_{t}}(Q^{\pi_{t}}\pi_{t}||\pi)$ to the context of the projection operator,
\(\sum_{s}d^{\pi_{t}}(s)V^{\pi}(s)\frac{\partial KL(Q^{\pi}\pi_{t}||\pi)}{\partial \theta} \\ =-\sum_{s}d^{\pi_{t}}(s)V^{\pi}(s)\sum_{a}Q^{\pi}\pi_{t}(a|s)\nabla_{\theta}\log\pi(a|s)\\ =-\sum_{s}d^{\pi_{t}}(s)\sum_{a}Q^{\pi}(s,a)\pi_{t}(a|s)\nabla_{\theta}\log\pi(a|s).\)
which gives the update for the state-action fotmulation.
Note that still, it's different from the original algorithm because it solves the projection, then it remains to show that it finally converges to the optimum.
Proposition 4 $\pi(\theta^{*})$ is a fixed point of $\mathcal{P}_{V}\circ \mathcal{I}_{V}$
Proof of Proposition 4 : By definition of $\pi^{*}$, we have
\(\nabla_{\theta}\sum_{s}d^{\pi^{*}}(s)V^{\pi^{*}}(s)KL(Q^{\pi^{*}}\pi^{*}||\pi)|_{\pi=\pi^{*}}\\ = \sum_{s}d^{\pi^{*}}(s)\sum_{a}\pi^{*}(a|s)Q^{\pi^{*}}(s,a)\frac{\partial \log\pi_{\theta}(a|s)}{\partial \theta}|_{\theta=\theta^{*}}=0.\)
Properties of Operators
First, it can be shown that
Proposition 5 \(J(\mathcal{I}_{\tau}(\pi))=J(\pi)\left(1+\frac{Var_{\pi}(R)}{(\mathbb{E}_{\pi}[R])^{2}}\right)\geq J(\pi).\)
Proof of Proposition 5 : Define $z\pi(\tau)=\frac{1}{\int_{\tau^{'}}z(\tau^{'})\pi(\tau^{'})d\tau^{'}}z(\tau)\pi(\tau)$ for any function $z$ over trajectories.
\(J(z\pi)=\int_{\tau}R(\tau)(z\pi)(\tau)d\tau\\ = \int_{\tau}\frac{R(\tau)z(\tau)\pi(\tau)}{\int_{\tau^{'}}z(\tau^{'})\pi(\tau^{'})d\tau^{'}}d\tau\\ = \left(\int_{\tau^{'}}R(\tau^{'})\pi(\tau^{'})d\tau^{'}\right)\frac{\int_{\tau}R(\tau)z(\tau)\pi(\tau)d\tau}{\int_{\tau^{'}}R(\tau^{'})\pi(\tau^{'})d\tau^{'}\int_{\tau^{'}}z(\tau^{'})\pi(\tau^{'})d\tau^{'}}\\ = J(\pi)\frac{\mathbb{E}_{\pi}[Rz]}{\mathbb{E}_{\pi}[R]\mathbb{E}_{\pi}[z]} \)
Since $Cov_{\pi}(R,z)=\mathbb{E}_{\pi}[Rz]-\mathbb{E}_{\pi}[R]\mathbb{E}_{\pi}[z]$, then
\(J(z\pi)=J(\pi)\left(1+\frac{Cov_{\pi}(R,z)}{\mathbb{E}_{\pi}[R]\mathbb{E}_{\pi}[z]}\right)\)
Let $z=R$ (in Proposition 1), then we have $J(R\pi)=J(\pi)\left(1+\frac{Var_{\pi}(R)}{(\mathbb{E}_{\pi}[R])^{2}}\right)\geq J(\pi)$.
This property intuitively implies that if $Var_{\pi}(R)\approx 0$, i.e., the environment is deterministic and policy is almost deterministic, we have $J(\mathcal{I}_{\tau}(\pi))\approx J(\pi)$. (This explains the counter examples raised in the paper 'Is Policy Gradient a Gradient', that deterministic policies can be bad in PG methods)
Proposition 5 may lead to a dangerous case that $\mathcal{P}\circ \mathcal{I}(\pi)$ has a smaller expected return than $\pi$, since the projection operator may make the return smaller. , Luckily, the proposition below shows a lower bound for the expected returns.
Proposition 6 For any two policies $\pi$ and $\mu$ such that the support of $\mu$ covers that of $\pi$, we have \(J(\pi)\geq J(\mu)+\mathbb{E}_{\mu}[V^{\mu}(s)](D_{\mu}(\mathcal{I}_{V}(\mu)||\mu) - D_{\mu}(\mathcal{I}_{V}(\mu)||\pi))\) Hence, any policy $\pi$ such that $D_{\pi_{t}}(\mathcal{I}_{V}(\pi_{t})||\pi)-D(\mathcal{I}_{V}(\pi_{t})||\pi_{t})$ implies $J(\pi)>J(\pi_{t})$.
An operator view of PPO
Recall that PPO maximizes a surrogate objective, where the algorithm tries to maximize $\sum_{a}\pi(a|s)Q^{\mu}(s,a)$ at each state where the distribution over states and $Q$ values is kept fixed. It can be shown that PPO can be cast as operators below :
\(\mathcal{I}_{V}(\pi)(s,a)=d^{\pi}(s)\frac{exp(\beta Q^{\pi}(s,a))}{\sum_{a^{'}}exp(\beta Q^{\pi}(s,a^{'}))}\)
\(\mathcal{P}_{V}(\mu)=argmin_{z\in\Pi}\sum_{s}\mu(s)KL(clip(z(\cdot|s))||\mu(\cdot|s))\)
Note that $\mathcal{I}_{V}$ does not contain $V(s)$, hence the policy improvement operator does not increase probability of 'good states'. And the KL in the operator $\mathcal{P}_{V}$ is reversed, note that reversed KL is mode seeking, so the the resulting policy will focus its mass on the mode of $\mathcal{I}_{V}$, which may be dangerous since it may quickly lead to deterministic policies, and hence it's necessary to use the clip. When I first read this, I wonder whether this conflicts with the conclusion in 'Implementation Matters in Deep RL: A Case Study on PPO and TRPO', but actually the case is, if you were to do their version of PPO exactly, it would perform well, but in reality it's not executed exactly since we don't exactly take $argmin$ and we use an approximate critic.
An operator view of control-as-inference and MPO
The policy improvement operator that recovers MPO is :
\(\mathcal{I}_{V}(\pi)(s,a) = d^{\pi}(s)\frac{\pi(a,s)exp(\beta Q^{\pi}(s,a))}{\sum_{a^{'}}\pi(a^{'},s)exp(\beta Q^{\pi}(s,a^{'}))}\)
and the projection operator is
\(\mathcal{P}_{V}(\mu) = argmin_{\pi\in\Pi} KL(\mu||\pi)\)
It's easy to notice that $\mathcal{I}_{V}$ of MPO is an interpolation between those of OP-REINFORCE and PPO, it does not contain $V(s)$, so it does not upweight good states, while it contains $\pi(a|s)$, hence it will not quickly converge to a deterministic policy, so clipping is not necessary here.
The original MPO is actually complicated to execute, and this operator view may hint a achievable execution of MPO?
Summary and Reflections
In my point of view, this work is a milestone in the analysis of PG methods, it explains what a 'gradient ascent step' actually achieves, and this also shed light to further algorithm designs. It also shows some great properties like 'entropy regularization helps by increasing the variance of the returns', and why it's necessary to use clipping tech in PPO, etc. This perspective of viewing PG also allowed us to further bridge the gap between policy-based and value-based methods, i.e., REINFORCE and the Bellman optimality operator can be seen as the same method.
References
Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux. An operator view of policy gradient methods. arXiv preprint arXiv:2006.11266, 2020.
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry. Implementation Matters in Deep RL: A Case Study on PPO and TRPO. https://openreview.net/forum?id=r1etN1rtPB.
Chris Nota, Philip S. Thomas. Is the Policy Gradient a Gradient? arXiv preprint arXiv:1906.07073, 2019.