Reinforcement Learning

This lecture (9 ECTS) will lay the foundations of reinforcement learning (RL). The lecture is devided into three parts: Multiarmed bandits, tabular RL, non-tabular RL. 

  • Multiarmed bandits (mostly from the algorithmic point of view) – 5 lectures
    • Explore-then-commit
    • Greedy
    • UCB
    • Botzman Exploration
    • Softmax (policy gradient)
  • Tabular MDP basics – 5 lectures
    • Foundations of dynamic programming
    • value iteration
    • policy iteration
  • Tabular Q-learning, TD-learning – 6 lectures
    • Monte Carlo evaluation
    • Tsitsiklis' convergence proof of stochastic fixed point interations
    • One-step approximate dynamic programming (TD(0), SARSA, Q-learning)
    • Double Q-learning
    • Multi-step approximate dynamic programming (n-step, forwards&backwards TD(lambda))
  • Policy Gradient Schemes – 10 lectures
    • policy gradient theorems
    • variance reduction tricks such as baseline, actor critic
    • gradient descent and stochastic gradient descent
    • neural networks in RL
    • SAC, TRPO, PPO

We will prove everything that we think is needed for a proper understanding of the algorithms but also go into the coding (Python). At many instances of RL convergence proofs are still open (even worse, sometimes algorithms are known to diverge). We will cover theoretical results around RL which sometimes leads to good educated guesses for RL algorithms even though the theoretical assumptions of techniques cannot be checked (or are violated).