Introduction
A few weeks ago, the ACM announced the 2024 Turing Award (the Nobel equivalent in the field of computer science) for Andrew Barto and Richard Sutton, for their contributions to the field of reinforcement learning (RL).
I remember the year 2020 when I began exploring RL and bought a copy of the book Reinforcement Learning by Sutton and Barto, which is still on my nightstand today. I still believe it’s one of the best technology books I’ve ever read for its approach, completeness, and depth in such a complex and fascinating subject.
In this article, I aim to introduce the theory behind RL and how we can create agents that interact with their environment and learn—by failing, exploring, and improving, attempt after attempt—until they master the target task and maximize their reward.

What is RL?
In the field of machine learning, learning tasks are typically divided into three types:
- Supervised learning: We have a set of features \(x\) and a target \(y\) , and we seek to find \(f: y=f(x)\). For example, building a classifier that determines if a forearm X-ray shows a fracture.
- Unsupervised learning: We don’t have labels \(y\), only a set of features \(x\) in our dataset, and we try to explain it. For example, clustering clients based on similar consumption patterns.
- Reinforcement learning: In this scenario, we have an automated agent that interacts with its environment by taking actions and receives feedback based on its current state and selected action. These feedback signals are rewards that the agent aims to maximize. For instance, controlling the steering, acceleration, and braking of an autonomous vehicle on a highway.
In this article, we will focus on the last type, which we will model mathematically to produce algorithms that can learn based on the reinforcement signals received. RL presents several challenges, such as delayed reward signals (in a chess game, you only find out if you won, lost, or drew at the end), continuous environments (an automated trading agent doesn’t operate in finite episodes, but continuously interacts), the exploration/exploitation trade-off, and the infamous curse of dimensionality (as dimensional space grows, the amount of training data needed increases exponentially).
Markov Decision Processes (MDPs)
We begin by modeling our agent's interactions with its environment using a
Markov Decision Process(MDP).
These are stochastic processes that satisfy the Markov property: the probability of reaching a new state after taking an action depends only on the current state, not the history of previous states.
This may initially seem restrictive, but it’s not. It’s a simplification for mathematical modeling that can represent the states we work with in real environments. Sometimes this is evident, as in chess, where the next best move depends only on the current board layout. In other cases, like trading agents, the evolution of asset prices over time matters. However, we can encode this information into the state, such as using technical indicators or learned features from history—or even a list of recent closing prices. \(S = {s_{1}\ldots S_{n}}\), actions \(A={a_{1}\ldots a_{m}}\), rewards \(R={r_{1}\ldots r_{k}}\) , and a probability function \(P(s', r | s, a)\) describing the likelihood of being in state \(s\), taking action \(a\) , and reaching state \(s'\) with reward \(r\). By the law of total probability, we must have, for all \((s,a)\), \(\sum\limits_{s'}\sum\limits_{r}P(s', r|s, a) = 1\).
While interacting with its environment, our agent generates a sequence of states, actions, and rewards: \(s_{1}, a_{1}, r_{2}, s_{2}, a_{2}, r_{3}, s_{3}, a_{3}, r_{4}, \ldots\)But what is our goal? To maximize immediate rewards \(r_{2}, r_{3}, r_{4}\)? Clearly not. We aim to maximize the total future gain, defined at time t as \(t\): \(G_{t} = r_{t+1} + r_{t+2}+\ldots = \sum\limits_{i=t+1}^{\infty}r_{i}\).
This definition, however, creates a problem: in continuous, non-episodic environments, the agent may continue to accumulate infinite rewards, leading to infinite gains over infinite time. For our trader, earning \(0.1\%\) annually is mathematically the same as earning \(10\%\)—with infinite time to exploit it. A clever solution to this is to introduce a geometric discount factor to transform the infinite sum into a finite and optimizable objective. Recall the geometric series: \(\sum\limits_{i=0}^{\infty}\gamma^{i} = \frac{1}{1-\gamma}\) si \(0\leq\gamma<1\). Applying this discount factor, we get: \(G_{t} = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3}+\ldots = \sum\limits_{i=t+1}^{\infty}\gamma^{i-t-1}r_{i}\).
And an important identity from this is: \(G_{t} = r_{t+1}+\gamma(r_{t+2} + \gamma r_{t+3} + \ldots) = r_{t+1} + \gamma G_{t+1}\).
Policies and Value Functions
In the previous section, we defined a mathematical model for our interactions with the environment, but we have not yet defined our learning objective (we’re getting there). What is it that we need to achieve? How can we distinguish a better state from a worse one? Which chessboard configuration is better if I present two configurations? Clearly, the one that gives me a higher probability of winning. This is what we will formalize now.
We now define a policy as the probability of taking an action given a state: \(\pi(a|s) = p(a|s)\). We define the policy stochastically. Given a state, we don’t always need to take the same action deterministically—it may be better to act probabilistically to maximize our total gain. This is evident in game theory (perhaps a topic for a future article).
Now that we have our learning goal (a policy, still to be determined), we define the value of states and actions under that policy. The value of a state is not absolute but depends on the policy that allows the agent to extract that value \(s\) como la ganancia total esperada que generaré por estar en el estado \(s\) y seguir la política \(\pi\).
\(V_{\pi}(s) = E_{\pi}[G_{t}|s]=\sum\limits_{a}p(a|s)\sum\limits_{s'}\sum\limits_{r}P(s',r|s,a)(r + \gamma V_{\pi}(s'))\)
We define Vπ(s), the expected total return starting from state s and following policy π. Similarly, Qπ(s, a) is the expected return of taking action a in state s and then following π. \(s\) con sus probabilidades para calcular la esperanza de ganancia percibida por tomarlas. Esta ganancia considera todos los estados de destino y sus recompensas según la mecánica del MDP, combinando la recompensa inmediata con el valor descontado del estado al que llego. Esta es una definición donde, de forma similar a como hicimos anteriormente con la ganancia, calculamos el valor actual en base al valor siguiente. Este tipo de definición es de programación dinámica y nos provee un conjunto de ecuaciones lineales (uno para cada estado) que pueden resolverse para encontrar V.
Just as we define the value of a state, we can define the value of an action given a state, commonly known as the Q-value in RL. This definition follows from the previous equation, simply by rearranging terms.
\(Q_{\pi}(s,a) = E_{\pi}[G_{t}|s,a] = \sum\limits_{s'}\sum\limits_{r}P(s', r|s, a)(r + \gamma\sum\limits_{a'}p(a'|s')Q(s',a'))\)
Optimality
The previous equations are known as Bellman equations, and they allow solving an MDP. This can be done either as a system of linear equations or numerically, approximating the values with good timing and convergence guarantees.
Finally, the optimal policy \(V(s)\) o \(Q(s,a)\) para todos los estados y acciones. Esta política se representa como \(\pi^{*}\) y sus valores asociados son \(V^{*}(s)\) and \(Q^{*}(s,a)\). Es decir, \(V^{*}(s)=\max\limits_{\pi}V_{\pi}(s)\), \(Q^{*}(s,a)=\max\limits_{\pi}Q_{\pi}(s,a)\). These lead to the Bellman optimality equations, which can be solved using iterative techniques that adjust the value functions until convergence..
With these definitions, we can define the Bellman optimality equations as the values generated by taking the best possible actions, as follows.
\(V^{*}(s) = \max\limits_{a}\sum\limits_{s'}\sum\limits_{r}P(s', r|s,a)(r + \gamma V^{*}(s'))\)
\(Q^{*}(s,a) = \sum\limits_{s'}\sum\limits_{r}P(s',r|s,a)(r + \gamma\max\limits_{a'}Q^{*}(s', a'))\)
The previous equations no longer represent a system of linear equations (since we introduced the nonlinear max function). However, those familiar with optimization problems know that they can still be solved using linear programming techniques. A much more common approach, though, is to use iterative methods to solve the optimal Bellman equations, since the performance of linear programming algorithms tends to degrade with the number of constraints (in our case, the number of possible states). Iterative methods aim to solve the Bellman equations by applying them repeatedly to approximate and improve policies until reaching the optimum—or by directly approximating the value function and then deriving the optimal policy. These techniques come with convergence guarantees, are very simple to implement (we’ve already done the hard work of deriving the equation—now it’s just about applying it), and they typically converge in reasonable time.
So, are we done? Not at all. Solving the Bellman equations assumes a very demanding requirement—one that rarely holds in real-world RL problems: knowing the dynamics of the MDP, that is, the probability distribution 𝑃 P. Up to this point, everything we've seen has been planning techniques that assume knowledge of the environment’s dynamics and optimize the actions to take. But we haven’t yet entered the world of learning. An agent truly learns when it doesn’t know the environment’s dynamics and must explore it. Through this interaction, it generates the experience needed to discover the optimal policy.
In the following sections, we will move from planning problems to learning, and derive one of the most important algorithms in RL.
Exploring an Unknown Environment
In practice, we often don’t know the environment's dynamics. We must explore to learn them while also trying to find the optimal policy.
How much should we explore versus exploit? This is addressed by GLIE algorithms (greedy in the limit with infinite exploration), which explore all state-action pairs infinitely but eventually converge to optimal actions. A common GLIE method is \(\epsilon\)-greedy: take a random action with small probability \(\epsilon\), tomar una acción aleatoria, y con probabilidad \(1-\epsilon\), elegir la acción óptima. Si hacemos decaer \(\epsilon\) and the best-known action with probability 1− \(\epsilon\) . Decaying ε over time achieves the balance needed for convergence.
Temporal Difference Learning
With exploration defined and our goal in sight, we now formulate learning iteratively. We build a \(Q(s,a)\) table of state-action values and update it as follows:
\(Q(s_{t},a_{t}) \leftarrow (1-\alpha)Q(s_{t},a_{t}) + \alpha G_{t} = Q(s_{t}, a_{t}) + \alpha[G_{t} – Q(s_{t}, a_{t})]\)
Here, \(\alpha\)is the learning rate, blending new rewards with past estimates. Since environments are stochastic, we must average over many observations. Gt, however, requires knowing future returns—delaying updates. \(\alpha\) debe proveer ciertas garantías para garantizar la convergencia que no exploraremos ahora, pero en la práctica suele bastar con definir un \(\alpha\) , however, requires knowing future returns—delaying updates.
The term that complicates the previous equation is \(G_{t}\), because it requires updating the value table at each step based on the future return I will obtain in the episode (or to infinity for continuous problems!). This would make learning very slow, as it must wait for complete episodes to finish before starting to update the values. To solve this problem, Sutton proposed in a famous 1988 paper the Temporal Difference learning methods (TD-learning). Basically, temporal difference learning updates new estimates based on the approximations of the next step (or the 𝑛 n next steps), which is known in RL as bootstrapping. At the equation level, the modification is simple, and Sutton showed that although this is an estimate biased toward the prediction target, it converges to the correct value in the limit.
\(Q(s_{t},a_{t}) \leftarrow Q(s_{t}, a_{t}) + \alpha[r_{t+1} + \gamma\max\limits_{a} Q(s_{t+1}, a) – Q(s_{t}, a_{t})]\)
This forms the basis of more advanced RL algorithms like DQN, which defeated Lee Sedol in Go.
Q-learning
Finally, we present the Q-learning algorithm by Watkins (1989), which learns optimal policies in finite MDPs without knowing the dynamics. Pseudocode for Q-learning with \(\epsilon\)-greedy exploration:
Initialize Q(s, a) arbitrarily (except terminal states → 0)
For each episode:
Start in state \(s = s_{0}\)
Until episode ends:
Choose action a via \(\epsilon\)-greedy policy Q
Take action a, observe r and s′
\(Q(s,a) = Q(s,a) + \alpha(r + \gamma\max\limits_{a}Q(s', a) – Q(s,a))\)
s = s’
In less than ten lines of code, we can learn any finite MDP without knowing its dynamics. For high-dimensional or continuous problems, Q-learning is often impractical. Instead, we use function approximators (like neural networks) or policy optimization techniques directly.
This opens the door to deep RL, like DQNs for continuous states—material for a future post. For now, I recommend watching the AlphaGo documentary and David Silver’s RL videos on YouTube. Most of all, read Sutton and Barto’s Reinforcement Learning book—it inspired this article.