Member-only story
The Multi-Armed Bandit Problem—A Beginner-Friendly Guide
Understanding the exploitation-exploration trade-off with an example
A Multi-Armed Bandit (MAB) is a classic problem in decision-making, where an agent must choose between multiple options (called “arms”) and maximize the total reward over a series of trials. The problem gets its name from a metaphor involving a gambler at a row of slot machines (one-armed bandits), each with a different but unknown probability of paying out. The goal is to find the best strategy to pull the arms (select actions) and maximize the gambler’s overall reward over time. The MAB problem is a fancy name for the exploitation-exploration trade-off.
The Multi-Armed Bandit problem is a foundational problem that arises in numerous industrial applications. Let’s explore it and examine interesting strategies for solving it.

Example
You’ve just arrived in a new city. You’re a spy and plan to stay for 120 days to complete your next assignment. There are three restaurants in town: Italian, Chinese, and Mexican. You want to maximize your dining satisfaction during your stay. However, you don’t know which restaurant will be the best for you. Here’s how the three restaurants stack up:
- Italian restaurant: Average satisfaction score of 8/10
- Chinese restaurant: Average satisfaction score of 6/10
- Mexican restaurant: Average satisfaction score of 9/10
The catch is that you don’t know these satisfaction scores when you start. What would be your strategy to pick the best restaurant over your 120 dinners?
Exploration
Let’s say you explore all three restaurants equally. In other words, you visited each restaurant for 40 days. The expected satisfaction score will equal (40 * 8 + 40 * 6+ 40 * 9) = 920. Hence, an average satisfaction of 7.67 per day. Is this an optimal strategy? If you had picked the Mexican restaurant, you would have an average satisfaction of 9!
Exploration and Exploitation
You don’t want to explore too much. At the same time, you don’t want to choose one restaurant randomly and visit it all the time. You…