
# Deep Bayesian Bandits Library
This library corresponds to the *[Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson
Sampling](https://arxiv.org/abs/1802.09127)* paper, published in
[ICLR](https://iclr.cc/) 2018. We provide a benchmark to test decision-making
algorithms for contextual-bandits. In particular, the current library implements
a variety of algorithms (many of them based on approximate Bayesian Neural
Networks and Thompson sampling), and a number of real and syntethic data
problems exhibiting a diverse set of properties.
It is a Python library that uses [TensorFlow](https://www.tensorflow.org/).
We encourage contributors to add new approximate Bayesian Neural Networks or,
more generally, contextual bandits algorithms to the library. Also, we would
like to extend the data sources over time, so we warmly encourage contributions
in this front too!
Please, use the following when citing the code or the paper:
```
@article{riquelme2018deep, title={Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson Sampling},
author={Riquelme, Carlos and Tucker, George and Snoek, Jasper},
journal={International Conference on Learning Representations, ICLR.}, year={2018}}
```
**Contact**. This repository is maintained by [Carlos Riquelme](http://rikel.me)([rikel](https://github.com/rikel)). Feel free to reach out directly at [rikel@google.com](mailto:rikel@google.com) with any questions or comments.
We first briefly introduce contextual bandits, Thompson sampling, enumerate the
implemented algorithms, and the available data sources. Then, we provide a
simple complete example illustrating how to use the library.
## Contextual Bandits
Contextual bandits are a rich decision-making framework where an algorithm has
to choose among a set of *k* actions at every time step *t*, after observing
a context (or side-information) denoted by *X<sub>t</sub>*. The general pseudocode for
the process if we use algorithm **A** is as follows:
```
At time t = 1, ..., T:
1. Observe new context: X_t
2. Choose action: a_t = A.action(X_t)
3. Observe reward: r_t
4. Update internal state of the algorithm: A.update((X_t, a_t, r_t))
```
The goal is to maximize the total sum of rewards: ∑<sub>t</sub> r<sub>t</sub>
For example, each *X<sub>t</sub>* could encode the properties of a specific user (and
the time or day), and we may have to choose an ad, discount coupon, treatment,
hyper-parameters, or version of a website to show or provide to the user.
Hopefully, over time, we will learn how to match each type of user to the most
beneficial personalized action under some metric (the reward).
## Thompson Sampling
Thompson Sampling is a meta-algorithm that chooses an action for the contextual
bandit in a statistically efficient manner, simultaneously finding the best arm
while attempting to incur low cost. Informally speaking, we assume the expected
8.**Bootstrapped Networks**. This algorithm trains simultaneously and in
parallel **q** neural networks based on different datasets D<sub>1</sub>, ..., D<sub>q</sub>. The way those datasets are collected is by adding each new collected
datapoint (X<sub>t</sub>, a<sub>t</sub>, r<sub>t</sub>) to each dataset *D<sub>i</sub>* independently and with
probability p ∈ (0, 1]. Therefore, the main hyperparameters of the
algorithm are **(q, p)**. In order to choose an action for a new context,
one of the **q** networks is first selected with uniform probability (i.e.,
*1/q*). Then, the best action according to the *selected* network is
played.
See [Deep Exploration via Bootstrapped
DQN](https://arxiv.org/abs/1602.04621).
The algorithm is implemented in *bootstrapped_bnn_sampling.py*, and we
instantiate it as (where *my_hparams* contains both **q** and **p**):
comprises personal information from the US Census Bureau database, and the
standard prediction task is to determine if a person makes over 50K a year
or not. However, we consider the *k = 14* different occupations as
feasible actions, based on *d = 94* covariates (many of them binarized).
As in previous datasets, the agent obtains a reward of 1 for making the
right prediction, and 0 otherwise. The total number of contexts is *n =
45222*. Data is available [here](https://storage.googleapis.com/bandits_datasets/adult.full) or alternatively
[here](https://archive.ics.uci.edu/ml/datasets/adult), use *adult.data*
file.
6.**Census data**. The US Census (1990) Dataset (Asuncion & Newman, 2007)
contains a number of personal features (age, native language, education...)
which we summarize in *d = 389* covariates, including binary dummy
variables for categorical features. Our goal again is to predict the
occupation of the individual among *k = 9* classes. The agent obtains
reward 1 for making the right prediction, and 0 otherwise. Data is available
[here](https://storage.googleapis.com/bandits_datasets/USCensus1990.data.txt) or alternatively [here](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+\(1990\)), use
*USCensus1990.data.txt* file.
7.**Covertype data**. The Covertype Dataset (Asuncion & Newman, 2007)
classifies the cover type of northern Colorado forest areas in *k = 7*
classes, based on *d = 54* features, including elevation, slope, aspect,
and soil type. Again, the agent obtains reward 1 if the correct class is
selected, and 0 otherwise. Data is available [here](https://storage.googleapis.com/bandits_datasets/covtype.data) or alternatively
[here](https://archive.ics.uci.edu/ml/datasets/covertype), use
*covtype.data* file.
In datasets 4-7, each feature of the dataset is normalized first.
## Usage: Basic Example
This library requires Tensorflow, Numpy, and Pandas.
The file *example_main.py* provides a complete example on how to use the
library. We run the code:
```
python example_main.py
```
**Do not forget to** configure the routes to the data files at the top of *example_main.py*.
For example, we can run the Mushroom bandit for 2000 contexts on a few