-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Loss functions
Given a prediction (p) and a label (y), a loss function measures the discrepancy between the algorithm's prediction and the desired output. VW currently supports the following loss functions, with squared loss being the default:
Loss | Function | Minimizer | Example usage |
---|---|---|---|
Squared | Expectation (mean) | Regression Expected return on stock |
|
Quantile | Median | Regression What is a typical price for a house? |
|
Logistic | Probability | Classification Probability of click on ad |
|
Hinge | 0-1 approximation | Classification Is the digit a 7? |
|
Poisson | Counts (Log Mean) | Regression Number of call events to call center |
|
Classic | Squared loss without importance weight aware updates |
Expectation (mean) | Regression squared loss often performs better than classic |
Expectile | Expectile(q) | Regression Asymmetric squared loss for risk-averse CB |
To select a loss function in VW see the Command line arguments guide. The Logistic and Hinge loss are for binary classification only, and thus all samples must have class "-1" or "1". More information on loss function semantics in these slides (pdf) from an online learning course.
The Python wrapper overrides the default squared loss with logistic loss when using VWClassifier.
- If the problem is a binary classification (i.e. labels are -1 and +1) your choices should be Logistic or Hinge loss (although Squared loss may work as well). If you want VW to report the 0-1 loss instead of the logistic/hinge loss, add
--binary
. Example: spam vs non-spam, odds of click vs no-click. - For binary classification where you need to know the posterior probabilities, use
--loss_function logistic --link logistic
. - If the problem is a regression problem, meaning the target label you're trying to predict is a real value:
- use Squared, Quantile, or Expectile loss.
- Logistic loss can also work with real valued labels in the range [0,1] (via (1-y){loss for -1} + y{loss for 1})
- Example target labels: revenue, height, weight.
- If you're trying to minimize the mean error, use squared-loss. See: http://en.wikipedia.org/wiki/Least_squares .
- If OTOH you're trying to predict rank/order and you don't mind the mean error to increase as long as you get the relative order correct, you need to minimize the error vs the median (or any other quantile), in this case, you should use quantile-loss. See: http://en.wikipedia.org/wiki/Quantile_regression
- If you are risk-averse (underestimation is less costly than overestimation), you should use Expectile Loss. See https://arxiv.org/abs/2210.13573
- use Squared, Quantile, or Expectile loss.
loss_functions.h
provides the interface for implementing custom loss functions. See classic_squaredloss
in loss_functions.cc
for an example implementation of the classic squared loss function. The methods, especially getLoss()
and getUpdate()
, are used for gradient descent in gd.cc
. To enable the use of your loss function from command line, you also need to add an additional loss function type in parse_args.cc
.
Note that there are two versions of the methods for computing the weight update, getUpdate()
and getUnsafeUpdate()
. Their differences are explained in the importance weight aware updates paper. The paper observes that for dealing with importance weights, the standard approach of multiplying the gradient can be problematic, and a better approach is to focus on the cumulative effect of importance weights and compute a line integral of many infinitely small updates. Derived from Theorem 1, Table 1 in the paper presents closed-form updates for common loss functions, which are already implemented in the code.
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: