The Multi-Armed Bandit (MAB) problem is a foundational setup in reinforcement learning that simplifies the environment to a single decision point with multiple actions (arms), each associated with an unknown reward distribution. The agent’s task is to select arms over time in a way that maximizes total reward. There is no state or state transition — just actions and rewards.
The term "multi-armed bandit" comes from the analogy to a gambler facing several slot machines (each a different "arm"), where each machine provides random rewards and the odds are unknown. The agent must choose between exploration (trying different arms to learn their value) and exploitation (choosing the best-known arm so far).
Despite its simplicity, the bandit problem captures a critical aspect of reinforcement learning: the exploration-exploitation trade-off. It also serves as a testbed for many core RL strategies, such as ε-greedy, UCB (Upper Confidence Bound), and Thompson Sampling. Variants of the bandit problem, like contextual bandits, introduce limited state information and are used in applications like recommendation systems and online advertising.