Offsiteteam
Reinforcement Learning
July 24, 2025

The Simplest Reinforcement Learning setup: Multi-Armed Bandits

People usually distinguish between three types of AI algorithms: supervised, unsupervised, and Reinforcement Learning.

The separation between the first two is typically straightforward:

  • Supervised learning requires a clearly labeled dataset — a set of training examples where we know the correct answer. For example, for a given house’s size, location, and number of rooms, we know whether someone buys it or not.
  • Unsupervised learning, on the other hand, deals with data that has no labels. The goal is to uncover hidden structure in the data. For example, we might have a dataset of drugs (or associated targets, genes, etc.), each of them with several assosiated properties and we want to group them meaningfully, even though we don’t know how many groups there are or what those groups represent. As a result we might have groups and later by manual investigation deside that one group is let’s say a metabolic drug group, another is cardiovascular drug group etc.

Reinforcement Learning (RL) is different. In RL, we do have a notion of feedback (like labels), but this feedback is delayed, not immediate as in supervised learning. Imagine navigating a maze or playing a game: while you can supervise the outcome (win or lose), you don’t know whether each individual step is correct or not. You need to define a policy that tells you whether your current step is good or bad, based on the ultimate goal (winning or losing).


Multi-Armed Bandits: The Simplest RL Setup

The journey into the world of RL often begins with the so-called k-armed bandit problem — named after the classic casino slot machines known as “one-armed bandits”, which are operated by pulling a lever on the side.

Originally, these machines had only one arm, but in RL, people generalize this idea to k arms. Each arm provides a money reward drawn from its own (unknown) probability distribution — usually assumed to be normal.

You don’t know neither the true Expectation/Variance nor even empirical mean/sample variance of those distributions. Your goal is to gradually discover a good policy — a strategy that tells you which arm to pull in order to maximize your total money reward over time.

K-armed bandit

The challenge here lies in the randomness of the reward. Suppose you try each arm once and receive the following rewards: 1, 2, 1, 3. A naive greedy approach would suggest always choosing the arm that gave you a reward of 3. But that value might have come from the tail of the distribution and be an extremely lucky case. It might never happen again, and future rewards from that arm may be much lower. So you need a more robust strategy that handles this uncertainty.

This leads to the introduction of essential RL concepts like exploration vs. exploitation — that is, the trade-off between trying out new arms to gather information (exploration) and using the arm that currently seems best (exploitation).

No states in bandits: An important feature of the bandit problem is that it involves no notion of state. The reward distributions are either fixed (stationary) or might evolve over time (non-stationary), but they do so independently of your actions — pulling a particular arm doesn’t affect the environment’s state.

Therefore, in the k-armed bandit problem, we only need to worry about rewards, not states (This is what differentiates bandits from full RL problems).

In contrast, in general reinforcement learning problems, the concept of state is fundamental. For example, in a game-playing scenario, the agent must consider the current board configuration (the state) before choosing an action. However, the simplicity of the bandit setup makes it ideal for learning the core ideas of RL — such as: exploitation/exploration, State Values, Action Value etc.


Methods of Solving the k-Arm Bandit Problem

Action-Value Methods

We start from defining an important RL concept: Action Value. Let’s imagine we know the true distributions of all arms. In that case, solving the bandit problem would be trivial — just choose the arm with the highest expected reward.

We can formulate this idea with the following equation:

\( \LARGE{ q_*(a) \doteq \mathbb{E}[R_t \mid A_t = a] } \)

Methods of Solving the k-Arm Bandit Problem

This equation basically means: let’s define the true value of action \( q_*(a) \) as the expected reward at time t, given that action <a> was chosen. How much money we expect pulling the arm <a>? In other words, the value of an action is the expectation (or average) of all rewards received when always choosing the same action a.

But the problem is we don’t know the true \( q_*(a) \), so we must interact with the bandit (try each arm) and estimate the expected reward from data. And we call this estimate at time \( t \), as \( Qt(a) \). A set of algorithms used to estimate these expectations effectively are called action-value methods.

We could take a brute-force approach and just try every action many times to get a reliable estimate of each arm’s reward distribution, but there are two problems with this:

  1. In real-life settings, we may have many arms, and trying each one many times may not be feasible.

  2. Remember, we are learning on the go — i.e., we are playing and learning at the same time. So by choosing poor arms too often, we lose rewards (our money!). Therefore, we need to learn quickly with minimal loss.

An effective way to solving the Bandit problem has two parts

  • an update rule to update Action Value estimates after each step

  • a strategy to choose the next action effectively.

Let’s start from the update rule.

After each interaction of our bandit (selection an action) we could re-calculate all empirical means for each arm anew as

\( \LARGE{ Q_n \doteq \frac{R_1 + R_2 + \cdots + R_{n-1}}{n - 1} } \)

That would be correct but inefficient, as it requires summing all previous rewards again and again.

So, it would make sense to reuse previously calculated reward value and do not sum all rewards again and again. We can do it with the following update rule (I will not show the derivation here, but we can derive it directly from the equation above).

\( \LARGE{ Q_{n+1} = \frac{1}{n} \sum_{i=1}^n R_i = Q_n + \frac{1}{n} [R_n - Q_n] } \)

all term in this equation have its own names as follows:

\( \large{ \textit{NewEstimate} \leftarrow \textit{OldEstimate} + \textit{StepSize} \left[ \textit{Target} - \textit{OldEstimate} \right] } \)

Note that \( StepSize=1/n \), so it is changing at every step, effectively become smaller at later stages.

Now, how we choose the next action

If we knew the true expectations for each arm, the task would be trivial: just be greedy — always select the arm with the largest expected reward. But since we do not know the true expected reward and only have its estimation based on few samples that large mean reward might be simple by chance, not because the arm is actually better. If we always exploit that action, we might get stuck with a suboptimal arm until enough samples reduce the estimated mean.

To address this, people use the so-called ε-greedy approach: once in a while forget about being greedy and pool a random arm, do that with small probability epsilon.

By doing that we allow algorithm to improve estimations of other arms which might happen to be better. This allows the algorithm to improve its estimates of all arms, even those not currently believed to be best.

This strategy reflects the exploration vs. exploitation trade-off — a central idea in Reinforcement Learning.

Congratulations — we’ve just built the simplest reinforcement learning algorithm!


Gradient Bandit Algorithm

Another fundamentally different approach is that instead of estimating the Action Value \( Q(a) \) at each step and making decisions based on it, we can use another quantity — the so-called preference \( Ht(a) \) — to choose our next action.

The preference of each action at a given time \( Ht(a) \) is an artificial quantity: just a number assigned to every action at each step to indicate which action is preferred at that moment.

Gradient Bandit Algorithm

Policy: Making Decision over the next action

To clarify this approach, let’s start by defining a Policy as follows:

\( \LARGE{ \Pr\{A_t = a\} \doteq } \huge{ \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} } \LARGE{ \doteq \pi_t(a) } \)

A Policy is a probability distribution over all actions that reflects our preference for each action at a given step. In simpler terms, we are more likely to take actions with higher probabilities. (Note: during action selection, we still choose actions randomly according to this distribution — not always the one with the highest probability. This inherently provides some level of exploration.)

You may also notice that this newly introduced concept — Preference, \( H(a) \) — is similar to how people use logits in Logistic regression: first, we compute the logit scores and then convert them into real probabilities using the softmax function. You can think of Preferences as just some kind of scores, which we then pass through softmax to produce actual action probabilities.

With this policy defined, we are fully equipped to make decisions — we simply generate the next action according to the probabilities defined by the policy.


Learning the Policy

Now, we need a way to learn this policy.

To optimize our policy, we define an objective function and in our case (not surprisingly!) it is the expected reward under a given policy. This makes sense because the goal in Reinforcement Learning is to maximize the cumulative reward (we need more money!). So, by definition of expectation we can formulate it by summing over all possible value of actions \( Q(a) \) weighted by their probabilities under the policy.

\( \LARGE{ \mathbb{E}[R_t] = \sum_{a} \pi_t(a) \cdot Q(a) } \)

Do you feel the difference with the previous method? Previously we just choose the largest \( Q(a) \) with some random exploration move but now we are going to learn the policy which not only will choose the best value of action but also balance exploration/exploitation automatically!


Gradient Ascent for Policy Optimization

To improve the policy over time, we aim to adjust the Preferences \( Ht(a) \) (remember, it is inside the policy definition) so that the policy gradually shifts toward actions with higher expected rewards.

Why do we need that preference \( H(a) \) at all? It is a trick actually, It allows differentiable action probabilities, which is essential for applying gradient-based methods.

Ok, now we want to maximize this objective function with respect to the preference \( Ht(a) \), using Gradient ascent.

Remember the idea of Gradient descent? It is a method used to minimize a function by moving in the direction of the negative gradient — that is, the direction of steepest decrease.

For example Gradient Descent method for Logistic regression is distilled into a parameter update rule like that

\( \large{ \textit{NextParameter} = \textit{CurrentParameter} - \alpha * \nabla (\textit{ObjectiveFunction}) } \)

In contrast, Gradient Ascent moves in the direction of the positive gradient to maximize a function:

\( \large{ \textit{NextParameter} = \textit{CurrentParameter} + \alpha * \nabla (\textit{ObjectiveFunction}) } \)

So, for Gradient Bandit algorithm and our policy formulation with preferences the update rule in general form is

\( \LARGE{ H(a) \leftarrow H(a) + \alpha \cdot \nabla_{H(a)} \mathbb{E}[R_t] } \)

Remember, the expected reward in our formulation has policy inside and the policy contains softmax function. Derivative of softmax has two parts (derivation is omitted)

\( \Huge{ \frac{\partial \pi(a)}{\partial H(b)} } = \LARGE{ \begin{cases} \pi(a)(1 - \pi(a)) & \text{if } a = b \\ -\pi(a)\pi(b) & \text{if } a \ne b \end{cases} } \)

which leads us to two update rules

  • for selected action a (we maximize desired action, so Gradient ascent)

  • and for other actions b (we minimize them, so descent)

\( \LARGE{ \begin{align} H_{t+1}(A_t) &\doteq H_t(A_t) + \alpha (R_t - \bar{R}_t)(1 - \pi_t(A_t)) \\ H_{t+1}(a) &\doteq H_t(a) - \alpha (R_t - \bar{R}_t)\pi_t(a) \end{align} } \)

Here

  • \( R_t \) - Actual reward received
  • \( \bar{R}_t \) - Baseline reward, it is a trick again. It is often the running average of previous rewards and it reduces the variance in updates (Was this reward higher or lower than what we normally get?) It might be ether zero or average reward so far.

Conclusion

Gradient bandits represent the simplest form of a Reinforcement Learning (RL) setup, but they allow us to introduce many important RL concepts, such as:

However, it’s important to remember that in the bandit formulation, the RL setup does not include a state. This makes it unsuitable for most real-world scenarios — like playing games or navigating environments — where the situation (state) changes over time.

Still, gradient bandits are extremely useful for understanding the foundational principles of Reinforcement Learning in a clear and focused setting.


References

  1. Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction. 2018
Ready to use Reinforcement Learning
to improve your business?
Fill out the form below to tell us about your project.
We'll contact you promptly to discuss your needs.
We received your message!
Thank you!