# Reinforcement Learning

This lecture (10 ECTS) will lay the foundations of reinforcement learning (RL). The lecture is devided into three parts: Multiarmed bandits, tabular RL, non-tabular RL.

• Multiarmed bandits (mostly from the algorithmic point of view) – 5 lectures
• Explore-then-commit
• Greedy
• UCB
• Botzman Exploration
• Softmax (policy gradient)
• Tabular MDP basics – 5 lectures
• Foundations of dynamic programming
• value iteration
• policy iteration
• Tabular Q-learning, TD-learning – 6 lectures
• Monte Carlo evaluation
• Tsitsiklis' convergence proof of stochastic fixed point interations
• One-step approximate dynamic programming (TD(0), SARSA, Q-learning)
• Double Q-learning
• Multi-step approximate dynamic programming (n-step, forwards&backwards TD(lambda))
• Policy Gradient Schemes – 10 lectures
• policy gradient theorems
• variance reduction tricks such as baseline, actor critic
• gradient descent and stochastic gradient descent
• neural networks in RL
• SAC, TRPO, PPO

We will prove everything that we think is needed for a proper understanding of the algorithms but also go into the coding (Python). At many instances of RL convergence proofs are still open (even worse, sometimes algorithms are known to diverge). We will cover theoretical results around RL which sometimes leads to good educated guesses for RL algorithms even though the theoretical assumptions of techniques cannot be checked (or are violated).

• #### Why is reinforcement learning useful?

Reinforcement learning is a type of machine learning that involves training an agent to make a sequence of decisions in an environment in order to maximize a reward. It is often used to control complex, dynamic systems or to optimize performance. Some applications of reinforcement learning include:

• Robotics: Reinforcement learning can be used to teach robots how to perform tasks by rewarding successful execution and punishing mistakes.
• Financial markets: Reinforcement learning can be used to develop trading strategies by learning how to take advantage of market conditions.
• Games: Reinforcement learning has been successfully used to control computer games by learning how to play against human or other computer opponents.
• Web optimization: Reinforcement learning can be used to optimize websites by learning how to control traffic on the site in order to achieve certain goals.

Overall, reinforcement learning offers a way to optimize complex systems by learning how to act in certain situations in order to maximize rewards.

Attention: This text was written by chatGPT, an AI tool based on reinforcement learning (RL) itself (and transformer networks). I do not quite agree with chatGPT, financial markets seem not be very well suited to ML methods. Anyways, as we will cross RL in our future lives in manifold occasions it will be useful to know how RL works.

• #### Target group

Students from the study programs Mathematics, WiMa, WiFo, MMDS. We will cover the mathematical background of reinforcement learning, coding (in python) will be part of the exercises.

• #### Team

Prof. Dr. Leif Döring, Sara Klein, Bene Wille

• #### Weekly schedule

Lecture: Tuesday and Wednesday, B2 (10:15–11:45), in B6 , D007 Seminarraum 2 (in the Garden of B6)

Exercise Classes: Thursday, B4 (13:45–15:15), in B6, B3.01 (Mathelounge)

• #### Oral exams

Exams will be oral, here are some hints.

• #### Exercises

 Exercise Sheet 1 Solution Sheet 1 Exercise Sheet 2 Solution Sheet 2 Exercise Sheet 3 Solution Sheet 3 Exercise Sheet 4 Solution Sheet 4 Exercise Sheet 5 Solution Sheet 5 Exercise Sheet 6 Solution Sheet 6, Code Exercise Sheet 7 Solution Sheet 7 Exercise Sheet 8 Solution Sheet 8 Exercise Sheet 9 Solution Sheet 9, Code Exercise Sheet 10 Solution Sheet 10, Code Exercise Sheet 11 Solution Sheet 11 Exercise Sheet 12 Solution Sheet 12
• #### Lecture notes and python code

Lecture Notes

Python code to try some bandit algorithms.

Upper confident bound: Change the constant in the exploration bonus and see what happens. Also change the distribution of the arms to Bernoulli and explore with simulations how the constant in the exploration bonus should be adapted. Plot the average over different realisations, the code only plots one realisation.

UCB

Policy gradient: Play with the code to get a feeling what changes if you change the step-sizes (learning rates) of gradient descent or the initialisation of the logits (Boltzman weights of the probabilities).

Policy Gradient for bandits

Standard grid world: Value iteration runs on standard grid world, target in the lower right corner, trap diagonally abover. Try to see what happens if you change the discount factor, the size, etc. to get a feeling how the algorithm adapts. How does the number of iterations change? How do the state-action and state-values change? What policy should be played according to the algorithm?

Value Iteration for GridWorld (directly using Q)

Grid world without termination in the target/trap: The algorithm continuous running and collects +1 in the target, -1 in the trap. Run the algorithm and see how the algorithm learns backwards from the target. Keeping in mind that the algorithm only needs to learn the shortest paths from states to the target, compare what you see to the famous Bellman-Ford algorithm. Try to understand why the algorithm behaves like Bellman-Ford and try to understand why the learnt (approximately) optimal state-action values are perhaps not exactly the way one might think (starting in the target is not the best option).

Policy Iteration for GridWorld (without termination)

Windy Cliff walk: Cliff walk is similar to grid world with the difference that there are bombs all along one edge of a square. We start next to the cliff and want to reach the goal on the other side of the cliff. In windy cliff walk we also assume there is some wind in the steps. In all steps there is a fixed probability that we are pushed one step north. Without wind an optimal path (optimal means shortest) is just along the cliff. With wind that pushes us towards the cliff an optimal path has some safety distance. Here is code to see a visualisation of the result from plain vanilla Q-learning. The current estimated policy (greedy from current Q) is plotted after every episode/rollout), the exploration of state-action pairs is chosen as epsilon-greedy.

Q-learning for windy cliff walk

Plain Vanilla Gradient descent: Python code for gradient descent on the quadratic function. Play around with the code by changing learning rates and the function to be optimised!

Gradient descent on the quadratic

• #### Further reading

Sutton & Barto: “Reinforcement Learning – an Introduction” is available online. This covers all major ideas but skipps essentially all details. In essence, this lecture course follows the core ideas of Sutton & Barto but tries to include as much of the missing mathematics as possible.