MPO

Apr 20, 2020 · 8 min read

Maximum a Posterior Policy Optimization

Background

Policy gradient algorithms like TRPO or PPO in practice require carefully runed entropy regularization to prevent policy collaspe, moreover, there are works showing that the plausible performace of PPO comes from the code-level optimization. We cannot help doubting, is 'Linearization/Quadratization + Trust Region' a good idea? As an alternative to policy gradient methods, V-MPO is an on-policy adaption of MPO (Maximum a Posterior Policy Optimization) that performs policy iteration based on a learned state-value function.

In this post I'll only concentrate on introducing MPO, a data-efficient off-policy method.

Introduction to MPO

A key inspiration of this method is the duality between control and estimation, say, replacing the question "what are the actions which maximize future rewards?" with the question "assuming future success in maximizing rewards, what are the actions most likely to have been taken?". By casting RL learning problem as that of inference in a particular probablistic model, we can apply EM (Expectation Maximum) algorithm to solve control problems.

Review on EM algorithm

In a probabilistic model, there are visible variables (y), latent variables (z) and associating parameter ($\theta$). We aim to maximize the likelihood $p(y|\theta)$. Let $q(z)$ be any arbitrary distribution of the latent variable z. By Bayes's rule and some rearranging, we have $\begin{array}{cc} p(z|y,\theta) = \frac{p(y|z,\theta)p(z|\theta)p(\theta)}{p(y|\theta)p(\theta)}=\frac{p(y|z,\theta)p(z|\theta)}{p(y|\theta)}\\ p(y|\theta) = \frac{p(y|z,\theta)p(z|\theta)}{p(z|y,\theta)}\\ p(y|\theta)=\frac{p(y|z,\theta)p(z|\theta)}{q(z)}\frac{q(z)}{p(z|y,\theta)} \end{array}$

By taking logarithms on both sides, we get

$\begin{array}{cc} \log p(y|\theta) \end{array} = \log\frac{p(y|z,\theta)p(z|\theta)}{q(z)}+\log\frac{q(z)}{p(z|y,\theta)}$

By computing the expectation of both sieds w.r.t. $q(z)$, we get

$\begin{array}{cc} \mathbb{E}[\log p(y|\theta)] = \int q(z)\log\frac{p(y|z,\theta)p(z|\theta)}{q(z)}dz + \int q(z)\log\frac{q(z)}{p(z|y,\theta)}dz \end{array}$

By Jensen's inequality, since log function is concave, we have

$\begin{array}{cc} \mathbb{E}[\log p(y|\theta)]\leq \log \mathbb{E}[ p(y|\theta)] \end{array}$

Since the term $\int q(z)\log\frac{q(z)}{p(z|y,\theta)}dz$ is the KL divergence between $q(z)$ and $p(z|y,\theta)$ and is hence always non-negative, therefore, we can derive that log likelihood of the $y$ given $\theta$ has a lower bound : $\begin{array}{cc} F(q(z),\theta)=\int q(z)\log\frac{p(y|z,\theta)p(z|\theta)}{q(z)}dz \end{array}$

Note that the maximum of the lower bound is not the maximum of the actual log likelihood, but it can be shown that by executing E-step and M-step, you are already maximizing the expectation of the log-likelihood functions.

E-step

E-step starts with a fixed $\theta(t)$ and attempts to maximize $F(q(z),\theta)$ w.r.t. $q(z)$. Obviously, this maximal is attained when the lower bound function $F(q(z),\theta)$ meets the objective likelihood function, i.e., when the KL divergence between $q(z)$ and $p(z|y,\theta)$ is 0.

Hence E-step gives $q(z)=p(z|y,\theta(t))$.

M-step

M-step attepmts to maximize $F(q(z),\theta)$ w.r.t. $\theta(t)$ bases on the fixed $q(z)$. We can ignore $q(z)$ in the denominator of $F(q(z),\theta)$ since it's independent of $\theta$. So the M-step can be seen as $\begin{array}{cc} \theta_{t}= arg\; max \int q_{t}(z)\log(p(y|z,\theta_{t-1})p(z|\theta_{t-1})) dz \end{array}$

Convergence of EM

Repeatly executing E-step and M-step until the new solution does not change, EM is guaranteed to converge to a point with zero gradient (may be local min / local max / saddle point).

Intuitively, since in each iteration t, EM requires the lower bound $F(q_{t}(z),\theta)$ touches the likelihood function $L(\theta,y)= \mathbb{E}[\log p(y|\theta)]$ at the solution $\theta_{t-1}$, so their gradients are the same too, i.e., $g=\nabla F(q_{t}(z),\theta_{t-1})=\nabla L(\theta_{t-1})$. So $\theta_{t}$ is at least as good as $\theta_{t-1}+\eta g$ , hence EM is at least as good as gradient ascent. If EM converges to $\theta^{*}$ then $\theta^{}$ is a convergent point for gradient ascent too.

MPO Algorithm

Conventional formulations of RL aim to find a trajectory that maximizes expected reward, and in contrast, inference formulations start from a prior distribution over trajectories condition a desired outcome (such as achieving a goal state), and then estimate the posterior distribution over trajectories consistent with this outcome.

Let $O$ be the event of succeeding at the RL task, $O=1$ if it success, and 0 otherwise. A finite-horizon undisconted reward formulation can be cast as inference problem by constructing a suitable probabilistic model via a likelihood function

$\begin{array}{cc}p(O=1|\tau) \propto exp(\sum_{t}\frac{r_{t}}{\alpha})\end{array}$

where $\alpha$ is a temperature parameter and $r_{t}$ is the shorthand of $r(s_{t},a_{t})$. Then follow from the idea of EM, we can define a lower bound on the likelihood of optimality for the policy $\pi$ :

$\begin{array}{cc} \log p_{\pi}(O=1) = \log \int p_{\pi}(\tau)p(O=1|\tau) d\tau\\ \geq \int q(\tau) [\log p(O=1|\tau)+\log\frac{p_{\pi}(\tau)}{q(\tau)}] d\tau \\ = \mathbb{E}_{q}[\sum_{t} \frac{r_{t}}{\alpha}] - KL(q(\tau) || p_{\pi}(\tau))\\ = \mathcal{J}(q,\pi) \end{array}$

where $p_{\pi}$ is the trajectory distribution induced by policy $\pi(a|s)$ : $\begin{array}{cc} p_{\pi}(\tau) = p(s_{0})\prod_{t\geq 0}p(s_{t+1}|s_{t},a_{t})\pi(a_{t}|s_{t}) \end{array}$ and $q(\tau)$ is an auxiliary distribution over trajectories.

Note that optimizing $\mathcal{J}$ w.r.t. $q$ can be seen as a KL regularized RL problem, and $\mathcal{J}$ can be optimized with the familiy of EM algorithms which alternate between improving $\mathcal{J}$ with respect to $q$ and $\pi$. Recall that EM does not require gradient of objective function, so in contrast to typical off-policy value-gradient algorithms, MPO does not require gradient of the Q-function. Instead, it uses samples from the Q-function to compare different actions in a given state, intuitively, it updates the policy s.t. better actions in that state will have better probabilities to be chosen.

Now we need to derive an infinite -horizon analogue of the KL-regularized expected reward objective. Let $q(\tau) = p(s_{0})\prod_{t>0} p(s_{t+1}|s_{t},a_{t})q(a_{t}|s_{t})$, then the structure of $q(\tau)$ is the same as $p_{\pi}$, the KL over trajectories decomposes into a KL over the induvidual state-conditional action distributions, which yields :

$\begin{array}{cc} \mathcal{J}(q, \theta) = \mathbb{E}_{q}\left[\sum_{t=0}^{\infty} \gamma^{t}[r_{t} - \alpha KL(q(a|s_{t}) || \pi(a|s_{t},\theta))]\right] + \log p(\theta). \end{array}$

The additional term $\log p(\theta)$ is a prior over policy parameters and can be motivated by a maximum a-posteriori estimation problem.

In shorthand, $KL(q_{t} || \pi_{t}) = KL(q(a|s_{t}) || \pi(a|s_{t},\theta))$. As associated with $\mathcal{J}(q,\theta)$, we also define the regularized Q-value function : $\begin{array}{cc} Q_{\theta}^{q}(s,a) = r_{0} + \mathbb{E}_{q(\tau),s_{0}=s,a_{0}=a}\left[\sum_{t\geq 1}^{\infty} \gamma^{t}(r_{t} - \alpha KL(q_{t}||\pi_{t}))\right]. \end{array}$

MPO algorithm treats $\pi$ as the primary object of interest, and $q$ as an auxiliary distribution, analogous to EM method, it optimizes $\mathcal{J}$ via alternate coordinate ascent in $q$ and $\pi_{\theta}$. ( Note that EM and MPO are 'methods' rather than specific algorithms, different optimizations in E-step and M-step lead to different algorithms.)

E-step of MPO

The E-step (of iteration i) improves $\mathcal{J}(q, \theta)$ w.r.t. $q$ given $\theta = \theta_{i}$.

Start by setting $q = \pi_{\theta_{i}}$, then $KL(q||\pi_{i}) = 0$, estimate the unregularized action-value function : $\begin{array}{cc} Q_{\theta_{i}}^{q}(s,a) = Q_{\theta_{i}}(s,a) = \mathbb{E}_{\tau_{\pi_{i}},s_{0}=s,a_{0}=a}[\sum_{t\geq 1}^{\infty}\gamma^{t}r_{t}]. \end{array}$ $Q_{\theta_{i}}$ can be estimated from off-policy data, which greately increases the data efficiency of our algorithm.

A partial E-step can be implemented by optimizing the 'one-step' KL regularized objective :

$\begin{array}{cc} \underset{q}{\max} \bar{\mathcal{J}}_{s}(q,\theta_{i}) = \underset{q}{\max} T^{\pi,q} Q_{\theta_{i}}(s,a) = \underset{q}{\max}\mathbb{E}_{\mu(s)}[\mathbb{E}_{q(\cdot|s)}[Q_{\theta_{i}}(s,a)] - \alpha KL(q||\pi_{i})] \end{array}$

where $T^{\pi,q}$ is the regularized Bellman operator :

$\begin{array}{cc} T^{\pi,q} = \mathbb{E}_{q(a|s)}[r(s,a) - \alpha KL(q||\pi_{i}) + \gamma\mathbb{E}_{p(s^{'}|s,a)}[V_{\theta_{i}}(s^{'})]] \end{array}$

We can view $\pi$ here as the current best policy, and $q$ is regularized towards it.

By maximizing the above equation, we obtain $q_{i} = arg\; max \bar{\mathcal{J}}(q,\theta_{i})$. However, note that we treat $Q_{\theta_{i}}$ as a constant w.r.t. $q$, which is not true, $q_{i}$ does not fully optimize $\mathcal{J}$, hence this is a partial E-step.

Constrained E-step

Note that in the above steps, when we try to optimize the KL regularized objective, the temperature parameter $\alpha$ need to be selected, however, in practice the reward and KL terms are on an arbitrary relative scale, making it difficult to choose $\alpha$. The soft KL regularization can be replaced with a hard constraint :

$\begin{array}{cc} \underset{q}{\max}\mathbb{E}_{\mu(s)}[\mathbb{E}_{q(a|s)}[Q_{\theta_{i}}(s,a)]]\\ constrained\;on: \mathbb{E}_{\mu(s)}[KL(q(a|s),\pi(a|s,\theta_{i}))]<\epsilon \end{array}$

Different from TRPO/PPO, we can choose a non-parametric representation of $q(a|s)$ given by sample based distribution over actions for a state $s$ : $\begin{array}{cc} q_{i}(a|s)\propto \pi(a|s,\theta_{i})exp(\frac{Q_{\theta_{i}}(s,a)}{\eta^{*}}) \end{array}$ where $\eta^{*}$ can be obtained by minimizing the convex dual function : $\begin{array}{cc} g(\eta) = \eta\epsilon +\eta\int \mu(s)\log \int\pi(a|s,\theta_{i})exp(\frac{Q_{\theta_{i}}(s,a)}{\eta})da\;ds \end{array}$

In E-step, we fixed $\mu_{q}(s)$ to the stationary distribution given by previously collected experience, and we use the Q function of the old policy to evaluate the integral over $a$, this allows us to estimate the integral over actions with multiple samples without additional environment interaction. As mentioned before, treating $Q_{\theta_{i}}$ as a constant w.r.t. $q$ makes the optimization of $\mathcal{J}$ a partial optimization, but this method greatly reduces the variance of the estimation, and allows for fully off-policy learning, making this step both scalable as well as data efficient.

M-step of MPO

M-step optimize the lower bound $\mathcal{J}$ w.r.t. $\theta$ given $q=q_{i}$ (from E-step). Recall that $\begin{array}{cc} \mathcal{J}(q,\theta) = \int q(\tau) [\log p(O = 1|\tau) + \log \frac{p_{\pi_{\theta}}(\tau)}{q(\tau)}]d\tau + \log p(\theta) \end{array}$.

Since $p(O=1|\tau)$ and $q$ are independent of $\theta$, then we want to solve :

$\begin{array}{cc} \theta^{*} = \underset{\theta}{arg\;max}\;\mathbb{E}_{\mu_{q}(s)}[\mathbb{E}_{q(a|s)} [\log \pi(a|s,\theta)]] + \log p(\theta). \end{array}$

It's noted that this is actually a Maximum a Posterior (MAP) estimation where samples are weighted by the variational distribution from the E-step. Since this is a supervised learning problem, it allows us to employ various regularization techniques.

We set $p(\theta)$ to be a Gaussian prior around the current policy $\theta_{i}$, this suggests the following generalized M-step: $\begin{array}{cc} \underset{\pi}{\max} \mathbb{E}_{\mu_{q}(s)}[\mathbb{E}_{q(a|s)}[\log \pi(a|s,\theta)] - \lambda KL(\pi(a|s,\theta_{i}),\pi(a|s,\theta))] \end{array}$

We can also enforce the hard KL constraint :

$\begin{array}{cc} \underset{\pi}{\max}\mathbb{E}_{\mu_{q}(s)}[\mathbb{E}_{q(a|s)}[\log \pi(a|s,\theta)]]\\ constrained\;on\;\mathbb{E}_{\mu_{q}(s)}[KL(\pi(a|s,\theta_{i}),\pi(a|s,\theta))]<\epsilon \end{array}$

This constraint can reduce overfitting risk, thus increasing stability of the algorithm.

Summary

Compared to typicall off-policy value-gradient algorithms, MPO does not require gradient of Q-function, and it has some desirable properties including low variance, low sample complexity, robustness, policy updates via supervised learning in M-step (thus allowing various regularization techs) and so on.

Moreover, if the prior on $\theta$ is uninformative, our algorithm has a monotonic improvement guarantee (the math is complicated, I will not show it here).

References

Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.

Bi, C. (2019, February 7). The EM Algorithm Explained. Retrieved from https://medium.com/@chloebee/the-em-algorithm-explained-52182dbb19d9