-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Contextual Bandit algorithms
The contextual bandit learning algorithms in VW consist of two broad classes. The first class consists of settings where the maximum number of actions is known ahead of time, and the semantics of these actions stay fixed across examples. A more advanced setting allows potentially changing semantics per example. In this latter setting, the actions are specified via features, different features associated with each action. We refer to this setting as the ADF setting for action dependent features.
When the number of actions is known ahead of time, suppose we have a file train.dat consisting of examples in the contextual bandit format discussed here. Then we train VW on this data by invoking:
./vw -d train.dat --cb_explore 4
Here --cb_explore
indicates that we are training a contextual bandit algorithm, and its argument is the (fixed) maximum number of actions. Providing no further arguments uses defaults for the exploration strategy, which is epsilon-greedy (epsilon = 0.05
). In general, we support a number of exploration algorithms defined below:
-
Explore-first: This simplest scheme takes in a parameter
tau
. On the firsttau
examples, we take each of thek
actions with probability1/k
. This data is then used to learn a good predictor, which is used to pick actions for the remaining examples.Usage: ./vw -d train.dat --cb_explore 4 --first 2
-
Epsilon-greedy: This default exploration strategy trains a good policy online as it sees examples. At each example, the prediction of the current learned policy is taken with probability
1-epsilon
, and with the remainingepsilon
probability an action is chosen uniformly at random.Usage: ./vw -d train.dat --cb_explore 4 --epsilon 0.1
-
Bagging Explorer: This exploration rule is based on an ensemble approach. It takes in an argument
m
and trainsm
different policies. The policies differ by being trained on different random subsets of the data, with each example going to a subset of them
policies. The precise details of training are done using the bootstrap feature of VW described here. The exploration algorithm picks the action from one of these policies uniformly at random. This is a simple and effective approach that rules out obviously bad actions, while exploring amongst the plausibly good ones when the variation amongst them
policies is adequate.Usage: ./vw -d train.dat --cb_explore 4 --bag 5
-
Online Cover: This is a theoretically optimal exploration algorithm based on this paper. Like bagging, many different policies are trained, with the number specified as an argument
m
. Unlike bagging, the training of these policies is explicitly optimized to result in a diverse set of predictions, choosing all the actions which are not already learned to be bad in a given context. This is the most sophisticated of the available options for contextual bandit learning.Usage: ./vw -d train.dat --cb_explore 4 --cover 3
In many applications, the set of actions is richer in that it changes over time or we have rich information regarding each action. In either of these cases, it is a good idea to create features for every (context, action)
pair rather than features associated only with context and shared across all actions. We call this setting ADF, and it builds on the csoaa_ldf
learner that supports cost-sensitive classification with label-dependent features. An example dataset for this setting might look like:
| a:1 b:0.5
0:0.1:0.75 | a:0.5 b:1 c:2
shared | s_1 s_2
0:1.0:0.5 | a:1 b:1 c:1
| a:0.5 b:2 c:1
Each example now spans multiple lines, with one line per action. For each action, we have the label information (a,c,p)
, if known, as before. The action field a
is ignored now since actions are identified by line numbers, and typically set to 0. The semantics of cost and probability are same as before. Each example is also allowed to specify the label information on precisely one action. A newline signals end of a multiline example. Additionally, we can specify contextual features which are shared across all actions in a line at the beginning of an example, which always has a tag shared
, as in the second multiline example above. Since the shared line is not associated with any action, it should never contain the label information.
The ADF learning mode is enabled by invoking VW as follows on the above file train_adf.dat:
./vw -d train_adf.dat --cb_explore_adf
Note that unlike --cb_explore
, we do not specify the number of actions in --cb_explore_adf
as they are inferred from the number of lines in each example. We again have a number of choices for the exploration algorithms:
-
Explore-first: Same as before, it takes in a parameter
tau
. On the firsttau
examples, we take each of the actions with equal probability. This data is then used to learn a good predictor, which is used to pick actions for the remaining examples.Usage: ./vw -d train.dat --cb_explore_adf --first 2
-
Epsilon-greedy: This is the default strategy again. At each example, the prediction of the current learned policy is taken with probability
1-epsilon
, and with the remainingepsilon
probability an action is chosen uniformly at random.Usage: ./vw -d train.dat --cb_explore_adf --epsilon 0.1
-
Bagging Explorer: Again same as before
Usage: ./vw -d train.dat --cb_explore_adf --bag 5
-
Softmax Explorer: This is a different explorer, which uses the policy to not only predict an action but also predict a score indicating the quality of each action. A distribution is then created with the probability of action
a
being is proportional toexp(lambda*score(x,a))
. Herelambda
is a parameter, which leads to uniform exploration forlambda = 0
, and stops exploring aslambda
approaches infinity. In general, this provides another nice knob for controlled exploration based on the uncertainty in the learned policy.Usage: ./vw -d train.dat --cb_explore_adf --softmax --lambda 10
The online cover algorithm is not currently supported in the ADF mode.
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: