Blog post

Love multi-armed bandits? Meet their smarter cousins

Written byGeorge Khachatryan

Published6 May 2024

Love multi-armed bandits? Meet their smarter cousins

Many sophisticated marketers rely on multi-armed bandits, which can be thought of as “smart” A/B tests which dynamically redistribute customers as they go. While powerful, multi-armed bandits have some significant limitations. A more general type of AI testing, contextual bandits, can give significant performance improvements over what is possible with traditional multi-armed bandits.

What is a multi-armed bandit (MAB)?

Imagine you’re at a casino, facing a row of ten slot machines. The machines vary in how likely they are to pay out: some of them might be excellent bets, others are a rotten deal, and you don’t know yet which is which. You have 1,000 tokens, and your goal is to make as much money as you can. Of course in real life, the way to maximize your winnings is probably to turn around and walk out of the casino. But in this hypothetical problem, we imagine that some of the machines – we don’t know which! – have a positive expected return.

This scenario is called the classic multi-armed bandit problem, since one slot machine is a one-armed bandit (it has one arm and takes your money away), and here you have a whole row.

You’d probably start by putting one token in each machine. Maybe one of them pays out! What do you do now? One approach is to just put the remaining 990 tokens into that winning machine. This is called a “greedy algorithm,” since it immediately jumps at the first sign of promise. But as you can guess, the greedy algorithm is usually a bad idea: maybe that winning machine is actually an underperformer that got lucky; you can’t know without more data points.

The opposite extreme of the greedy algorithm is to put 100 tokens into each machine regardless of their performance. This is called “pure exploration.” Of course, that’s not optimal either – you will at some point have enough data to know that some machines are likely better than others, so pure exploration leaves money on the table – or in this case, in the machine!

This tension is called the exploration-exploitation tradeoff. An optimal solution to the multi-armed bandit problem is one that navigates this tradeoff in the best possible way. An algorithm designed to solve this problem is called a multi-armed bandit, or MAB for short.

This field of research isn’t new. One of the seminal works on the topic, Herbert Robbins’s Some aspects of the sequential design of experiments, was published in the Annals of Mathematical Statistics in 1952.

MABs are a very natural tool for marketers, since they are an improvement over A/B testing. In a marketing context, the “arms” of the multi-armed bandit problem are the set of choices available to the marketer. In a traditional A/B test, the marketer divides customers equally between two or more treatments (or “arms”), runs an experiment, and then checks at the end to see which arm won. By contrast, MABs dynamically adjust traffic as they go, getting to the right answer sooner and with less waste coming from exposing customers to low-performing arms.

MABs are powerful, but they don’t personalize

MABs are certainly an improvement over A/B testing, but they suffer from the same limitation: they only find the overall “winner.” In marketing, however, different customers need different things. You respond to emails, I ignore email but act on text messages; you’re easiest to reach on Tuesdays, I’m most available on Saturdays; you’re ready to buy at list price, I won’t buy unless I get a 20% discount. Finding the arm that is the best on average is a good start, but it’s a far cry from the marketing dream of the best choice for each individual.

You can try to get around this problem by dividing customers into segments and running a separate MAB for each segment. For example, you might break customers into five groups by purchase frequency, and then use MABs to find the best choice for each group. That’s better than no personalization at all! But it’s only scratching the surface: two customers in the “high frequency” group might be different in a hundred other ways, yet the MAB by segment approach ignores these differences.

MABs struggle when there are too many arms

A second problem with MABs is that they work very slowly when faced with a large number of choices. Imagine you’d like to test 24 different email send times, one per hour in the day. An MAB sees these 24 times as a set of 24 completely independent choices, none of which gives you information about the others. But in real life, if you send an email at 3 pm and the customer converts, this should make you a bit more optimistic than you were previously not just about 3 pm, but also about 2 pm and 4 pm, since they are “close.” An approach that used this sort of structural information about the arms would be more efficient than a traditional MAB.

Meet the contextual bandit

To overcome the limitations of MABs, you can use what is called a contextual multi-armed bandit – contextual bandit for short. A contextual bandit can learn to act differently in different situations. More precisely, a contextual bandit takes a context vector and outputs a decision on which action to take (i.e., which arm to pull):

Whereas MABs are a simple, many-decades-old technique, contextual bandits are much more complex and have been an area of active research in the machine learning community throughout the 2000s, 2010s, and 2020s.

Contextual bandits make 1:1 personalized decisions

In the example above, we had only four variables in the context vector – customer country, years as customer, purchases in last 90 days, and day of week. But in real life, contextual bandits can take into account dozens or hundreds of variables. This flexibility allows you to take quite rich first-party data, convert it into context vector variables (called customer features in the machine learning community), and then use a contextual bandit to make personalized decisions which take all of this information into account.

This is a much stronger approach than “MABs by segment,” which ignores all the information you have about customers apart from which segment they are in.

Contextual bandits are better than MABs at handling many arms

The context vector isn’t limited to just customer variables: it can also contain variables describing the arms! Data scientists call these variables action features, because they are features of the actions (arms) the contextual bandit can take. For example, a contextual bandit which tests the right sending time can take the time as a numerical variable. This allows the contextual bandit to learn more efficiently than an MAB, since it “knows” that 3 pm is close to 4 pm but far from 11 pm.

This capability may seem like a subtle advantage, but it can have massive implications for how quickly bandits learn. For example, if a contextual bandit is choosing among different emails, it can take context on email length, imagery, subject line characteristics, and so forth; it will then automatically apply what it learns about each arm to its understanding of other (structurally similar) arms.

Contextual bandits are powerful, but require specialized tooling and expertise to get right

MABs are relatively simple machine learning models, which can be implemented and maintained by small teams of qualified data scientists. The same is not true of contextual bandits, which are quite complex and notoriously difficult to successfully execute. Part of the difficulty is that there is no “one size fits all” configuration of a contextual bandit – the best approaches depend on the specifics of the data used, the arms (and their structure), and the business situation.

For example, a leading rideshare company built contextual bandits to personalize promotions to riders and drivers. The project was very successful, but required 50 permanent, full-time senior engineers. In nearly all situations – including at companies with large, sophisticated data science teams – attempting to build contextual bandits in-house is a high-cost, high-risk approach to the problem.

OfferFit’s AI testing platform is built on contextual bandits. We’ve found that marketers who don’t want the costs and risks of building in-house need a product which offers three things.

Flexibility. Marketers need AI testing solutions which can use all of their rich party data, make decisions along multiple dimensions (including frequency, channel, timing, messaging, and incentive), integrate with any system, maximize custom KPIs (e.g., profit or LTV), and stay within marketer-defined guardrails.
Robustness. Because contextual bandits are challenging to execute, products using them should come with a track record of success (specifically with contextual bandits) and rich built-in diagnostic tools to ensure their configuration can be quickly optimized.
Visibility. Of course marketers using any machine learning model, including contextual bandits, are aiming for uplift in the metrics that matter to them – revenue, margin, conversions, etc. But marketers also need reporting that goes a level deeper, allowing them to understand how the model is getting lift and gain novel insights about their customers.

In conclusion, contextual bandits are a huge upgrade over MABs in their capabilities and performance, but require a commensurately greater level of specialized expertise and investment to implement properly. As this capability becomes more broadly available through products like OfferFit’s AI testing platform, we will see contextual bandits gradually move from a cutting-edge technique available only to a small number of AI-first tech companies to an essential tool deployed by every sophisticated enterprise marketer.

George Khachatryan is OfferFit’s co-founder and CEO. He previously served as an Associate Partner at McKinsey & Company. He co-founded his first startup, whose educational software products are used today by millions of students, when he was in high school. He holds a PhD in mathematics from Cornell University.