Bandits
Main idea: making repeated decisions based on feedback, factoring in the tradeoff between exploring new decisions or keeping existing good decisions
Multi-Armed Bandit Framework #
The Multi-Armed bandit problem arises when the following are true:
- We need to get the data as a part of the process
- Exploration/exploitation tradeoff (both have a cost)
- Stochastic: rewards are random
Setup:
- Selection rounds
- Arms (choices)
- : reward distribution for arm
- : mean reward for distribution
- At each round , choose an arm such that a reward is procured
- Define pseudo-regret at time as where is the best mean possible.
- The term in the sum is also known as the optimality gap, (how much worse arm is than the best arm).
Goal: maximize total expected reward
Known: only and
Examples:
- AB testing
- Advertising
- Gambling
- Optimizing blog posts
- Training hyperparameters for ML models
Algorithms:
- Explore then commit (ETC): choose the arm with the highest sample mean
- not optimal: will never choose the true best arm if not explored
- Upper Confidence Bound (UCB): choose arm with highest upper bound in the confidence interval
- Confidence interval calculation (derived from Hoeffdingβs Inequality):
- (where ) is something like .05, that specifies how wide our confidence interval should be
- Choose a as a decreasing function of to ensure that confidence intervals will get narrower as we explore something more
- Hoeffding requires variables to be independent, which isnβt actually true for the UCB algorithm (which arm we choose depends on which arms we chose before). However, the result still holds.
- UCB regret is bounded by which says that the regret grows logarithmically with respect to .
- Thomson Sampling: draw a sample from the posterior for each arm, and choose arm according to
Other Bandit Problems #
Adversarial Bandits: rewards are chosen by an adversarial actor Contextual bandits: rewards are correlated with confounding variables Linear bandits: arms are a vector of arm choices (online linear regression) Non-stationary bandits: arm reward distributions change over time Structured bandits: previous choices affect future choices (such as navigation on a road network)