Non uniformity in First-order Non-convex Optimization

Oct 7, 2021 · 7 min read

Leveraging Non-uniformity in First-order Non-convex Optimization

This is a co-work with my collaborators Jincheng, Bo, Dale, Csaba. It is part of my Msc thesis and it's accepted to ICML 2021.

Motivation

Policy gradient method is a widely used approach to solve RL problems. In policy gradient methods, we explicitly build a representation of a policy and keep it in memory during learning, we parametrize the policy and use gradient descent to directly optimize the objective function. So it also suffers from common challenges in gradient descent like it may stuck at the plateaus for a long time, and it's hard to choose the appropriate learning rate, and there may be convergence issues.

In the following context, for the policy gradient methods, I'll focus on the softmax policy gradient algorithm on tabular case, where there is only one parameter per state-action pair, and under each state, the policy is parametrized using softmax function.

As shown in the figure below, during the process of policy gradient, it may stuck at the plateaus for a long time. Commonly known methods to accelerate escaping plateaus include Nomalized Gradient Descent, RMSProp, and so on. But they all face their own drawbacks. For Normalized Gradient Descent, in many cases, it does not converge. For RMSProp, in many cases, it does not accelerate enough.

Policy Gradient V.S. Normalized Policy Gradient

But according to the plot here, softmax normalized policy gradient converges to the optimal solution pretty fast. So what is the secret here?

Non-uniform Properties & GNPG

A possible start point is to analyze the geometric landscape of softmax policy gradient.

Hessian Spectral Radius and Gradient Norm

This figure plots the policy gradient updating process from a point far away from $\theta^*$, then enter a sub-optimal plateau and stuck at the plateau for a long time, then finally escape the plateau and approaches the optimal $\theta^*$. A reasonable guess would be that in the case of MDP, on plateaus, where the gradient is around zero, normalized gradient descent is equivalent to normalize the gradient by non-uniform smoothness parameter.

So what is the non-uniform smoothness parameter? We say that a function is $\beta(\theta)$-smooth if $\Big|f(\theta^{\prime})-f(\theta)-\Big\langle\frac{d f(\theta)}{d \theta},\theta^{\prime}-\theta\Big\rangle\Big|\leq \frac{\color{red}{\beta(\theta)}}{2}\cdot \left|\theta^{\prime}-\theta\right|_2^2$.

And we say that a function satisfies the non-uniform $\mathcal{L}ojasiewicz$ ineuquality if $\left|\frac{d f(\theta)}{d \theta}\right|_2 \geq \color{red}{C(\theta)}\color{black}\cdot |f(\theta)-f(\theta^*)|^{1-\xi}$ where $\left( C(\theta)> 0; \xi \in (-\infty,1]\right)$.

And motivated by those non-uniform geometric properties, geometry aware normalized policy gradient could be proposed, at each iteration, it normalizes the policy gradient by its local non-uniform smoothness parameter $\beta(\theta)$. And this algorithm could be generalized to GNGD.

Geometry Normalized Policy Gradient $\theta_{t+1}\gets \theta_t -\eta\cdot \frac{\partial V^{\pi_{\theta_t}}(\mu)}{\partial\theta_t}\Big/\color{red}\beta(\theta_t)$ Geometry Normalized Gradient Descent $\theta_{t+1} \gets \theta_{t} - \eta\cdot \frac{\nabla f(\theta_t)}{\color{red}\beta(\theta_t)}$

My co-authors and I rigorously proved that anywhere on MDP, the non-uniform smoothness parameter $\beta(\theta)$ is always bounded by a constant times the gradient norm. And Jincheng proved the non-uniform $\mathcal{L}ojasiewicz$ inequality for MDP.

$\left|\frac{\partial^2 V^{\pi_{\theta}}(\mu)}{\partial \theta^2}\right|_2\leq \left[6+\frac{4\cdot(\underset{\mu}{\max}\left|d^\pi_\mu / \mu\right|_{\infty}-(1-\gamma))}{(1-\gamma)\cdot \gamma}\right]\cdot \left|\frac{\partial V^{\pi_{\theta}}(\mu)}{\partial \theta}\right|_2$

$\left|\frac{\partial V^{\pi_{\theta}}(\mu)}{\partial \theta}\right|_2\geq \frac{\min_s \pi_{\theta}(a^*(s)|s)}{\sqrt{S}\cdot \left|d^{\pi^*}_\rho/d^{\pi_{\theta}}_{\mu}\right|_{\infty}}\cdot \left(V^*(\rho)-V^{\pi_\theta}(\rho)\right)$

Combing those two non-uniform properties, it can be theoretically shown that the convergence rate of geometry normalized policy gradient is linear.

$V^*(\rho)-V^{\pi_{\theta_t}}(\rho)\leq \frac{(V^*(\rho)-V^{\pi_{\theta_1}}(\rho))\cdot \underset{\mu}{\max}\left\|d^\pi_\rho/\mu\right\|_{\infty}}{1-\gamma} \cdot e^{-C\cdot (t-1)}$

Where here $C$ is an instance dependent constant, and $C$ is often very small although empirically in many parts of the landscape, $C$ could be large, and it’s still worth thinking whether the lower bound of $C$ is reasonably large for GNPG. And as shown in the figure below, where the y-axis is the log of sub-optimality, it can be seen that GNPG obtains linear convergence rate although initially in this part, $C$ is very small. and hence the empirical result verifies the theoretical result.

GNGD on GLM

Not only on MDP, GNGD could also be applied to other problems like generalized linear model. Hazan et al. showed that normalized gradient descent achieves $O(\frac{1}{\sqrt{t}})$ convergence rate if we use an adaptive learning rate $\Theta(\frac{1}{\sqrt{t}})$. And by analyzing the non-uniform properties of GLM, it can be theoretically shown that both gradient descent and GNGD achieves linear convergence rate, but GNGD has strictly better constant than GD. (where $0<C<1$ )

GNGD : $\mathcal{L}(\theta_t)\leq \mathcal{L}(\theta_1)\cdot e^{-C\cdot (t-1)}$

GD : $\mathcal{L}(\theta_t)\leq \mathcal{L}(\theta_1)\cdot e^{-C^2\cdot (t-1)}$

The figure above verifies the theoretical result. In subfigure (a), the y axis is log of sub-optimality, it can be observed that both GD and GNGD achieves linear convergence rate, and GNGD has a better constant.

General Non-uniform Analysis

My collaborators and I also did a general non-uniform analysis of GNGD, we classified functions into different categories according to their non-uniform smoothness property, non-uniform $\mathcal{L}ojasiewicz$ property, and convexity, and analyzed the convergence rate of GNGD & GD for each of those categories. In some cases, GNGD achieves a convergence rate faster than the lower bound for convex-smooth optimization, which is $\Omega(1/t^2)$. Note that this is not a contradiction, since the lower bound is the worst case, and for some objective functions, the convergence rate of GNGD also matches the lower bound $\Omega(1/t^2)$, but it’s still unknown whether GNGD enjoys optimal worst-case rates. Zhang et al. showed the lower bound for $(L_0, L_1)$ optimization is $\Omega(\frac{1}{\sqrt{t}})$, but in some cases, for example, GLM as I presented, GNGD can achieve faster convergence rate. Also, this is not a contradiction since this is the lower bound. And in some cases where gradient descent diverges, GNGD can converge. For example, as shown in the figure below, for the function $x^{1.5}$, gradient descent diverges, while GNGD converges at a linear rate.

Further Work

Stochastic GNGD : It would be interesting to analyze the convergence of stochastic GNGD. Looking forward to Jincheng's new work - Understanding the Effect of Stochasticity in Policy Optimization.
Approximators : Another direction is to further push the analysis to other domains with more complex function approximators, including neural networks.

Reference

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, "Optimality and approximation with policy gradient methods in markov decision processes," in Conference on Learning Theory, PMLR, 2020, pp. 64-66.
Z. Allen-Zhu, Y. Li, and Z. Song, "A convergence theory for deep learning via over-parameterization," in International Conference on Machine Learning, PMLR, 2019, pp. 242-252.
J. Bhandari and D. Russo, "A note on the linear convergence of policy gradient methods," arXiv preprint arXiv:2007.11120, 2020.
S. Bubeck, "Convex optimization: Algorithms and complexity," arXiv preprint arXiv:1405.4980, 2014.
S. Cen, C. Cheng, Y. Chen, Y.Wei, and Y. Chi, "Fast global convergence of natural policy gradient methods with entropy regularization," arXiv preprint arXiv:2007.06558, 2020.
E. Hazan, K. Levy, and S. Shalev-Shwartz, "Beyond convexity: Stochastic quasi-convex optimization," Advances in neural information process- ing systems, vol. 28, pp. 1594-1602, 2015.
C. Igel and M. Husken, Improving the rprop learning algorithm, 2000.
S. Kakade and J. Langford, "Approximately optimal approximate reinforcement learning," in ICML, vol. 2, 2002, pp. 267-274.
H. Karimi, J. Nutini, and M. Schmidt, "Linear convergence of gradient and proximal-gradient methods under the polyak- lojasiewicz condition," in Joint European Conference on Machine Learning and Knowledge Dis- covery in Databases, Springer, 2016, pp. 795-811.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
K. Kurdyka, "On gradients of functions denable in o-minimal structures," in Annales de l'institut Fourier, vol. 48, 1998, pp. 769-783.
G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, "Softmax policy gradient methods can take exponential time to converge," arXiv preprint arXiv:2102.11270, 2021.
Y. Li, C. Szepesvari, and D. Schuurmans, "Learning exercise policies for american options," in Articial Intelligence and Statistics, PMLR, 2009, pp. 352-359.
J. Mei, Y. Gao, B. Dai, C. Szepesvari, and D. Schuurmans, Leverag- ing non-uniformity in rst-order non-convex optimization, 2021. arXiv: 2105.06072 [cs.LG].
J. Mei, C. Xiao, B. Dai, L. Li, C. Szepesvari, and D. Schuurmans, "Escaping the gravitational pull of softmax," Advances in Neural Information Processing Systems, vol. 33, 2020.
J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, "On the global convergence rates of softmax policy gradient methods," in International Conference on Machine Learning, PMLR, 2020, pp. 6820-6829.
R. Murray, B. Swenson, and S. Kar, "Revisiting normalized gradient descent: Fast evasion of saddle points," IEEE Transactions on Automatic Control, vol. 64, no. 11, pp. 4818-4824, Nov. 2019, issn: 2334-3303. doi: 10.1109/tac.2019.2914998. [Online]. Available: http://dx.doi.org/ 10.1109/TAC.2019.2914998.
A. S. Nemirovski and D. B. Yudin, "Problem complexity and method efficiency in optimization," 1983.
Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.
B. T. Polyak, "Gradient methods for minimizing functionals," Zhur- nal Vychislitel'noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643-653, 1963.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in Advances in neural information processing systems, 2000, pp. 1057- 1063.
A. Wilson, L. Mackey, and A. Wibisono, "Accelerating rescaled gradient descent: Fast optimization of smooth functions," arXiv preprint arXiv:1902.08825, 2019.
J. Zhang, T. He, S. Sra, and A. Jadbabaie, "Why gradient clipping accelerates training: A theoretical justication for adaptivity," in Interna- tional Conference on Learning Representations, 2019.