This year, one of the most impactful papers delves into the fascinating idea that AI models are converging toward a shared understanding of reality, much like Plato’s concept of an ideal world.
Retrieval Augmented Generation (RAG) - Unveiling the Simplicity and Magic of an LLM Innovation In today’s landscape, large language models (LLMs) are catalyzing transformative applications across various industries.
Desired characteristics for real-world RL agents - My research internship project at BorealisAI This is the post of my project during my research internship (with my collaborators Pablo and Kry) at BorealisAI, as well as part of my Msc thesis.
Introduction As I wroted in my recent post, I wonder since in PG algorithms, the update direction is not actually the gradient, then its not quite clear what an update actually does and why it finally converges to a promising policy.
Introduction Policy gradient theorem is the cornerstone of policy gradient methods, and in the last post, I presented the proof of policy gradient therorem, which describes the gradient of the discounted objective w.
The Proof of Policy Gradient Theorem Introduction Recall that policy gradient methods aims to directly optimize parameterized policies $\pi_{\theta}(a|s)$.
MPO Extension -- A more intuitive interpretation of MPO Introduction As stated in the last post, MPO is motivated from the perspective of "RL as inference".
Maximum a Posterior Policy Optimization Background Policy gradient algorithms like TRPO or PPO in practice require carefully runed entropy regularization to prevent policy collaspe, moreover, there are works showing that the plausible performace of PPO comes from the code-level optimization.
TRPO and PPO -- A Reading Summary Introduction Generally speaking, goal of reinforcement learning is to find an optimal behaviour strategy which maximizes rewards.