This lecture (10 ECTS) will lay the foundations of reinforcement learning (RL). The lecture is devided into three parts: Multiarmed bandits, tabular RL, non-tabular RL.
We will prove everything that we think is needed for a proper understanding of the algorithms but also go into the coding (Python). At many instances of RL convergence proofs are still open (even worse, sometimes algorithms are known to diverge). We will cover theoretical results around RL which sometimes leads to good educated guesses for RL algorithms even though the theoretical assumptions of techniques cannot be checked (or are violated).
Reinforcement learning is a type of machine learning that involves training an agent to make a sequence of decisions in an environment in order to maximize a reward. It is often used to control complex, dynamic systems or to optimize performance. Some applications of reinforcement learning include:
Overall, reinforcement learning offers a way to optimize complex systems by learning how to act in certain situations in order to maximize rewards.
Attention: This text was written by chatGPT, an AI tool based on reinforcement learning (RL) itself (and transformer networks). I do not quite agree with chatGPT, financial markets seem not be very well suited to ML methods. Anyways, as we will cross RL in our future lives in manifold occasions it will be useful to know how RL works.
Students from the study programs Mathematics, WiMa, WiFo, MMDS. We will cover the mathematical background of reinforcement learning, coding (in python) will be part of the exercises.
Prof. Dr. Leif Döring, Sara Klein, Bene Wille
Lecture: Tuesday and Wednesday, B2 (10:15–11:45), in B6 , D007 Seminarraum 2 (in the Garden of B6)
Exercise Classes: Thursday, B4 (13:45–15:15), in B6, B3.01 (Mathelounge)
Exams will be oral, here are some hints.
Python code to try some bandit algorithms.
Upper confident bound: Change the constant in the exploration bonus and see what happens. Also change the distribution of the arms to Bernoulli and explore with simulations how the constant in the exploration bonus should be adapted. Plot the average over different realisations, the code only plots one realisation.
Policy gradient: Play with the code to get a feeling what changes if you change the step-sizes (learning rates) of gradient descent or the initialisation of the logits (Boltzman weights of the probabilities).
Standard grid world: Value iteration runs on standard grid world, target in the lower right corner, trap diagonally abover. Try to see what happens if you change the discount factor, the size, etc. to get a feeling how the algorithm adapts. How does the number of iterations change? How do the state-action and state-values change? What policy should be played according to the algorithm?
Value Iteration for GridWorld (using V)
Value Iteration for GridWorld (directly using Q)
Grid world without termination in the target/
Policy Iteration for GridWorld (without termination)
Windy Cliff walk: Cliff walk is similar to grid world with the difference that there are bombs all along one edge of a square. We start next to the cliff and want to reach the goal on the other side of the cliff. In windy cliff walk we also assume there is some wind in the steps. In all steps there is a fixed probability that we are pushed one step north. Without wind an optimal path (optimal means shortest) is just along the cliff. With wind that pushes us towards the cliff an optimal path has some safety distance. Here is code to see a visualisation of the result from plain vanilla Q-learning. The current estimated policy (greedy from current Q) is plotted after every episode/
Q-learning for windy cliff walk
Plain Vanilla Gradient descent: Python code for gradient descent on the quadratic function. Play around with the code by changing learning rates and the function to be optimised!
Sutton & Barto: “Reinforcement Learning – an Introduction” is available online. This covers all major ideas but skipps essentially all details. In essence, this lecture course follows the core ideas of Sutton & Barto but tries to include as much of the missing mathematics as possible.