Contextual Bandits¶
This is the documentation page for the python package contextualbandits. For more details, see the project’s home page:
Getting started¶
You can find user guides with detailed examples in the following links:
Online Contextual Bandits¶
-
class
contextualbandits.online.
ActiveExplorer
(nchoices, C=None, explore_prob=0.15, decay=0.9997, beta_prior='auto')¶ Bases:
object
Active Explorer
Logistic Regression which selects a proportion of actions according to an active learning heuristic based on gradient.
Note
Here, for the predictions that are made according to an active learning heuristic (these are selected at random, just like in Epsilon-Greedy), the guiding heuristic is the gradient that the observation, having either label (either weighted by the estimted probability, or taking the maximum or minimum), would produce on each model that predicts a class, given the current coefficients for that model.
Parameters: - nchoices (int) – Number of arms/labels to choose from.
- C (float) – Inverse of the regularization parameter for Logistic regression. For more details see sklearn.linear_model.LogisticRegression.
- explore_prob (float (0,1)) – Probability of selecting an action according to active learning criteria.
- decay (float (0,1)) – After each prediction, the probability of selecting an arm according to active learning criteria is set to p = p*decay
- beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data with actions chosen by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, gradient_calc='weighted')¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- gradient_calc (str, one of ‘weighted’, ‘max’ or ‘min’) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
AdaptiveGreedy
(base_algorithm, nchoices, window_size=500, percentile=30, decay=0.9998, decay_type='threshold', initial_thr='auto', fixed_thr=False, beta_prior='auto')¶ Bases:
object
Adaptive Greedy
Takes the action with highest estimated reward, unless that estimation falls below a certain moving threshold, in which case it takes a random action.
Note
The threshold for the reward probabilities can be set to a hard-coded number, or to be calculated dynamically by keeping track of the predictions it makes, and taking a fixed percentile of that distribution to be the threshold. In the second case, these are calculated in separate batches rather than in a sliding window.
The original idea was taken from the paper in the references and adapted to the contextual bandits setting like this.
Parameters: base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit.
nchoices (int) – Number of arms/labels to choose from.
window_size (int) – Number of predictions after which the threshold will be updated to the desired percentile. Ignored when passing fixed_thr=False
percentile (int [0,100]) – Percentile of the predictions sample to set as threshold, below which actions are random. Ignored in fixed threshold mode.
decay (float (0,1)) –
- After each prediction, either the threshold or the percentile gets adjusted to:
val_t+1 = val_t*decay
Ignored when pasing fixed_thr=True.
decay_type (str, either ‘percentile’ or ‘threshold’) – Whether to decay the threshold itself or the percentile of the predictions to take after each prediction. If set to ‘threshold’ and fixed_thr=False, the threshold will be recalculated to the same percentile the next time it is updated, but with the latest predictions. Ignored when passing fixed_thr=True.
initial_thr (str ‘autho’ or float (0,1)) – Initial threshold for the prediction below which a random action is taken. If set to ‘auto’, will be calculated as initial_thr = 1.5/nchoices
fixed_thr (bool) – Whether the threshold is to be kept fixed, or updated to a percentile after N predictions.
beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
References
[1] Mortal multi-armed bandits (2009)
-
decision_function
(X)¶ Get the estimated probability for each arm from the classifier that predicts it.
Note
This is quite different from the decision_function of the other policies, as it doesn’t follow the policy in assigning random choices with some probability. A sigmoid function is applyed to the decision_function of the classifier if it doesn’t have a predict_proba method.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
BayesianTS
(nchoices, method='advi', beta_prior=((1, 1), 3))¶ Bases:
object
Bayesian Thompson Sampling
Performs Thompson Sampling by sampling a set of Logistic Regression coefficients from each class, then predicting the class with highest estimate. .. note:
The implementation here uses PyMC3's GLM formula with default parameters and ADVI. You might want to try building a different one yourself from PyMC3 or Edward. The method as implemented here is not scalable to high-dimensional or big datasets.
Parameters: - nchoices (int) – Number of arms/labels to choose from.
- method (str, either ‘advi’ or ‘nuts’) – Method used to sample coefficients (see PyMC3’s documentation for mode details).
- beta_prior (None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’.
References
[1] An empirical evaluation of thompson sampling (2011)
-
decision_function
(X)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Samples coefficients for Logistic Regression models from partially-labeled data, with actions chosen by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
BayesianUCB
(nchoices, percentile=80, method='advi', nsamples=None, beta_prior=((3, 1), 3))¶ Bases:
object
Bayesian Upper-Confidence Bound
Gets an upper-confidence bound by Bayesian Logistic Regression estimates.
Note
The implementation here uses PyMC3’s GLM formula with default parameters and ADVI. You might want to try building a different one yourself from PyMC3 or Edward. The method as implemented here is not scalable to high-dimensional or big datasets.
Parameters: - nchoices (int) – Number of arms/labels to choose from.
- percentile (int [0,100]) – Percentile of the predictions sample to take
- method (str, either ‘advi’ or ‘nuts’) – Method used to sample coefficients (see PyMC3’s documentation for mode details).
- beta_prior (None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’.
-
decision_function
(X)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Samples Logistic Regression coefficients for partially labeled data with actions chosen by this policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
BootstrappedTS
(base_algorithm, nchoices, nsamples=10, beta_prior='auto')¶ Bases:
object
Bootstrapped Thompson Sampling
Performs Thompson Sampling by fitting several models per class on bootstrapped samples, then makes predictions by taking one of them at random for each class.
Parameters: - base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit.
- nchoices (int) – Number of arms/labels to choose from.
- nsamples (int) – Number of bootstrapped samples per class to take.
- beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
References
[1] An empirical evaluation of thompson sampling (2011)
-
decision_function
(X)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per sample per class) to partially labeled data, with the actions having been determined by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, output_score=False, apply_sigmoid_scores=True)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.
- apply_sigmoid_scores (bool) – If passing output_score=True, whether to apply a sigmoid function to the scores from the decision function of the classifier that predicts each class.
Returns: pred – Actions chosen by the policy. If passing output_score=True, it will be an array with the first column indicating the action and the second one indicating the score that the classifier gave to that class.
Return type: array (n_samples,) or (n_samples, 2)
-
class
contextualbandits.online.
BootstrappedUCB
(base_algorithm, nchoices, nsamples=10, percentile=80, beta_prior='auto')¶ Bases:
object
-
decision_function
(X)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per sample per class) to partially labeled data, with the actions having been determined by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, output_score=False, apply_sigmoid_scores=True)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.
- apply_sigmoid_scores (bool) – If passing output_score=True, whether to apply a sigmoid function to the scores from the decision function of the classifier that predicts each class.
Returns: pred – Actions chosen by the policy. If passing output_score=True, it will be an array with the first column indicating the action and the second one indicating the score that the classifier gave to that class.
Return type: array (n_samples,) or (n_samples, 2)
-
-
class
contextualbandits.online.
EpsilonGreedy
(base_algorithm, nchoices, explore_prob=0.2, decay=0.9999, beta_prior='auto')¶ Bases:
object
Epsilon Greedy
Takes a random action with probability p, or the action with highest estimated reward with probability 1-p.
Parameters: - base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit.
- nchoices (int) – Number of arms/labels to choose from.
- explore_prob (float (0,1)) – Probability of taking a random action at each round.
- decay (float (0,1)) – After each prediction, the explore probability reduces to p = p*decay
- beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
References
[1] The k-armed dueling bandits problem (2010)
-
decision_function
(X)¶ Get the decision function for each arm from the classifier that predicts it.
Note
This is quite different from the decision_function of the other policies, as it doesn’t follow the policy in assigning random choices with some probability.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, output_score=False, apply_sigmoid_scores=True)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.
- apply_sigmoid_scores (bool) – If passing output_score=True, whether to apply a sigmoid function to the scores from the decision function of the classifier that predicts each class.
Returns: pred – Actions chosen by the policy. If passing output_score=True, it will be an array with the first column indicating the action and the second one indicating the score that the classifier gave to that class.
Return type: array (n_samples,) or (n_samples, 2)
-
class
contextualbandits.online.
ExploreFirst
(base_algorithm, nchoices, explore_rounds=2500, beta_prior='auto')¶ Bases:
object
Explore First, a.k.a. Explore-Then-Exploit
Selects random actions for the first N predictions, after which it selects the best arm only according to its estimates.
Parameters: - base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit.
- nchoices (int) – Number of arms/labels to choose from.
- explore_rounds (int) – Number of rounds to wait before exploitation mode. Will switch after making N predictions.
- beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
References
[1] The k-armed dueling bandits problem (2012)
-
decision_function
(X)¶ Get the decision function for each arm from the classifier that predicts it.
Note
This is quite different from the decision_function of the other policies, as it doesn’t follow the policy in assigning random choices at the beginning with equal probability all.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data with actions chosen by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
LinUCB
(nchoices, alpha=1.0)¶ Bases:
object
Note
The formula described in the paper where this algorithm first appeared had dimensions that didn’t match to an array of predictions. I assumed that the (n x n) matrix that results inside a square root was to be summed by rows.
Parameters: - nchoices (int) – Number of arms/labels to choose from.
- alpha (float) – Parameter to control the upper-confidence bound (more is higher).
References
[1] A contextual-bandit approach to personalized news article recommendation (2010)
-
fit
(X, a, r)¶ ” Fits one linear model for the first time to partially labeled data. Overwrites previously fitted coefficients if there were any. (See partial_fit for adding more data in batches)
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
partial_fit
(X, a, r)¶ ” Updates each linear model with a new batch of data with actions chosen by this same policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
Returns: pred – Actions chosen by the policy.
Return type: array (n_samples,)
-
class
contextualbandits.online.
SeparateClassifiers
(base_algorithm, nchoices, beta_prior=None)¶ Bases:
object
Separate Clasifiers per arm
Fits one classifier per arm using only the data on which that arm was chosen. Predicts as One-Vs-Rest.
- base_algorithm : obj
- Base binary classifier for which each sample for each class will be fit.
- nchoices : int
- Number of arms/labels to choose from.
- beta_prior : str ‘auto’, None, or tuple ((a,b), n)
- If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
-
decision_function
(X)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
decision_function_std
(X)¶ Get the predicted probabilities from each arm from the classifier that predicts it, standardized to sum up to 1.
Note
Classifiers are all fit on different data, so the probabilities will not add up to 1.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, output_score=False, apply_sigmoid_scores=True)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.
- apply_sigmoid_scores (bool) – If passing output_score=True, whether to apply a sigmoid function to the scores from the decision function of the classifier that predicts each class.
Returns: pred – Actions chosen by the policy. If passing output_score=True, it will be an array with the first column indicating the action and the second one indicating the score that the classifier gave to that class.
Return type: array (n_samples,) or (n_samples, 2)
-
predict_proba_separate
(X)¶ Get the predicted probabilities from each arm from the classifier that predicts it.
Note
Classifiers are all fit on different data, so the probabilities will not add up to 1.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
class
contextualbandits.online.
SoftmaxExplorer
(base_algorithm, nchoices, beta_prior='auto')¶ Bases:
object
Soft-Max Explorer
Selects an action according to probabilites determined by a softmax transformation on the scores from the decision function that predicts each class.
Note
If the base algorithm has ‘predict_proba’, but no ‘decision_function’, it will calculate the ‘probabilities’ with a simple scaling by sum rather than by a softmax.
Parameters: - base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit.
- nchoices (int) – Number of arms/labels to choose from.
- beta_prior (str ‘auto’, None, or tuple ((a,b), n)) – If not None, when there are less than ‘n’ positive sampless from a class (actions from that arm that resulted in a reward), it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to auto, will be calculated as: beta_prior = ((3/nchoices,4), 1)
-
decision_function
(X, output_score=False, apply_sigmoid_score=True)¶ Get the scores for each arm following this policy’s action-choosing criteria.
Parameters: X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm. Returns: scores – Scores following this policy for each arm. Return type: array (n_samples, n_choices)
-
fit
(X, a, r)¶ Fits the base algorithm (one per class) to partially labeled data.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
Returns: self – Copy of this same object
Return type: obj
-
predict
(X, exploit=False, output_score=False)¶ Selects actions according to this policy for new data.
Parameters: - X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.
- exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.
- output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.
Returns: pred – Actions chosen by the policy. If passing output_score=True, it will be an array with the first column indicating the action and the second one indicating the score that the classifier gave to that class.
Return type: array (n_samples,) or (n_samples, 2)
Off-policy learning¶
-
class
contextualbandits.offpolicy.
DoublyRobustEstimator
(base_algorithm, reward_estimator, nchoices, method='rovr', handle_invalid=True, c=None, pmin=1e-05)¶ Bases:
object
Doubly-Robust Estimator
Estimates the expected reward for each arm, applies a correction for the actions that were chosen, and converts the problem to const-sensitive classification, on which the base algorithm is then fit.
This technique converts the problem into a cost-sensitive classification problem by calculating a matrix of expected rewards and turning it into costs. The base algorithm is then fit to this data, using either the Weighted All-Pairs approach, which requires a binary classifier with sample weights as base algorithm, or the Regression One-Vs-Rest approach, which requires a regressor as base algorithm.
In the Weighted All-Pairs approach, this technique will fail if there are actions that were never taken by the exploration policy, as it cannot construct a model for them.
The expected rewards are estimated with the imputer algorithm passed here, which should output a number in the range [0,1].
This technique is meant for the case of contiunous rewards in the [0,1] interval, but here it is used for the case of discrete rewards {0,1}, under which it performs poorly. It is not recommended to use, but provided for comparison purposes.
This method requires to form reward estimates of all arms for each observation. In order to do so, you can either provide estimates as an array (see Parameters), or pass a model.
One method to obtain reward estimates is to fit a model to the data and use its predictions as reward estimates. You can do so by passing an object of class contextualbandits.online.SeparateClassifiers which should be already fitted, or by passing a classifier with a ‘predict_proba’ method, which will be put into a ‘SeparateClassifiers’
object and fit to the same data passed to this function to obtain reward estimates.The estimates can make invalid predictions if there are some arms for which every time they were chosen they resulted in a reward, or never resulted in a reward. In such cases, this function includes the option to impute the “predictions” for them (which would otherwise always be exactly zero or one regardless of the context) by replacing them with random numbers ~Beta(3,1) or ~Beta(1,3) for the cases of always good and always bad.
This is just a wild idea though, and doesn’t guarantee reasonable results in such siutation.
Note that, if you are using the ‘SeparateClassifiers’ class from the online module in this same package, it comes with a method ‘predict_proba_separate’ that can be used to get reward estimates. It still can suffer from the same problem of always-one and always-zero predictions though.
Parameters: base_algorithm (obj) – Base algorithm to be used for cost-sensitive classification.
reward_estimator (obj or array (n_samples, n_choices)) –
- One of the following:
- An array with the first column corresponding to the reward estimates for the action chosen by the new policy, and the second column corresponding to the reward estimates for the action chosen in the data (see Note for details).
- An already-fit object of class ‘contextualbandits.online.SeparateClassifiers’, which will be used to make predictions on the actions chosen and the actions that the new policy would choose.
- A classifier with a ‘predict_proba’ method, which will be fit to the same test data passed here in order to obtain reward estimates (see Note 2 for details).
nchoices (int) – Number of arms/labels to choose from. Only used when passing a classifier object to ‘reward_estimator’.
method (str, either ‘rovr’ or ‘wap’) – Whether to use Regression One-Vs-Rest or Weighted All-Pairs (see Note 1)
handle_invalid (bool) – Whether to replace 0/1 estimated rewards with randomly-generated numbers (see Note 2)
c (None or float) – Constant by which to multiply all scores from the exploration policy.
pmin (None or float) – Scores (from the exploration policy) will be converted to the minimum between pmin and the original estimate.
References
[1] Doubly robust policy evaluation and learning (2011)
[2] Doubly robust policy evaluation and optimization (2014)
-
decision_function
(X)¶ Get score distribution for the arm’s rewards
Note
For details on how this is calculated, see the documentation of the RegressionOneVsRest and WeightedAllPairs classes in the costsensitive package.
Parameters: X (array (n_samples, n_features)) – New observations for which to evaluate actions. Returns: pred – Score assigned to each arm for each observation (see Note). Return type: array (n_samples, n_choices)
-
fit
(X, a, r, p)¶ Fits the Doubly-Robust estimator to partially-labeled data collected from a different policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
- p (array (n_samples)) – Reward estimates for the actions that were chosen by the policy.
-
predict
(X)¶ Predict best arm for new data.
Parameters: X (array (n_samples, n_features)) – New observations for which to choose an action. Returns: pred – Actions chosen by this technique. Return type: array (n_samples,)
-
class
contextualbandits.offpolicy.
OffsetTree
(base_algorithm, nchoices, c=None, pmin=1e-05)¶ Bases:
object
Offset Tree
Parameters: - base_algorithm (obj) – Binary classifier to be used for each classification sub-problem in the tree.
- nchoices (int) – Number of arms/labels to choose from.
References
[1] The offset tree for learning with partial labels (2009)
-
fit
(X, a, r, p)¶ Fits the Offset Tree estimator to partially-labeled data collected from a different policy.
Parameters: - X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observations.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
- p (array (n_samples)) – Reward estimates for the actions that were chosen by the policy.
-
predict
(X)¶ Predict best arm for new data.
Note
While in theory, making predictions from this algorithm should be faster than from others, the implementation here uses a Python loop for each observation, which is slow compared to NumPy array lookups, so the predictions will be slower to calculate than those from other algorithms.
Parameters: X (array (n_samples, n_features)) – New observations for which to choose an action. Returns: pred – Actions chosen by this technique. Return type: array (n_samples,)
Policy Evaluation¶
-
contextualbandits.evaluation.
evaluateDoublyRobust
(pred, X, a, r, p, reward_estimator, nchoices=None, handle_invalid=True, c=None, pmin=1e-05)¶ Doubly-Robust Policy Evaluation
Evaluates rewards of arm choices of a policy from data collected by another policy.
Note
This method requires to form reward estimates of the arms that were chosen and of the arms that the policy to be evaluated would choose. In order to do so, you can either provide estimates as an array (see Parameters), or pass a model.
One method to obtain reward estimates is to fit a model to both the training and test data and use its predictions as reward estimates. You can do so by passing an object of class contextualbandits.online.SeparateClassifiers which should be already fitted.
Another method is to fit a model to the test data, in which case you can pass a classifier with a ‘predict_proba’ method here, which will be fit to the same test data passed to this function to obtain reward estimates.
The last two options can suffer from invalid predictions if there are some arms for which every time they were chosen they resulted in a reward, or never resulted in a reward. In such cases, this function includes the option to impute the “predictions” for them (which would otherwise always be exactly zero or one regardless of the context) by replacing them with random numbers ~Beta(3,1) or ~Beta(1,3) for the cases of always good and always bad.
This is just a wild idea though, and doesn’t guarantee reasonable results in such siutation.
Note that, if you are using the ‘SeparateClassifiers’ class from the online module in this same package, it comes with a method ‘predict_proba_separate’ that can be used to get reward estimates. It still can suffer from the same problem of always-one and always-zero predictions though.
Parameters: pred (array (n_samples,)) – Arms that would be chosen by the policy to evaluate.
X (array (n_samples, n_features)) – Matrix of covariates for the available data.
a (array (n_samples), int type) – Arms or actions that were chosen for each observation.
r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
p (array (n_samples)) – Scores or reward estimates from the policy that generated the data for the actions that were chosen by it.
reward_estimator (obj or array (n_samples, 2)) –
- One of the following:
- An array with the first column corresponding to the reward estimates for the action chosen by the new policy, and the second column corresponding to the reward estimates for the action chosen in the data (see Note for details).
- An already-fit object of class ‘contextualbandits.online.SeparateClassifiers’, which will be used to make predictions on the actions chosen and the actions that the new policy would choose.
- A classifier with a ‘predict_proba’ method, which will be fit to the same test data passed here in order to obtain reward estimates (see Note for details).
nchoices (int) – Number of arms/labels to choose from. Only used when passing a classifier object to ‘reward_estimator’.
handle_invalid (bool) – Whether to replace 0/1 estimated rewards with randomly-generated numbers (see Note)
c (None or float) – Constant by which to multiply all scores from the exploration policy.
pmin (None or float) – Scores (from the exploration policy) will be converted to the minimum between pmin and the original estimate.
References
[1] Doubly robust policy evaluation and learning (2011)
-
contextualbandits.evaluation.
evaluateRejectionSampling
(policy, X, a, r, online=False, start_point_online='random', batch_size=10)¶ Evaluate a policy using rejection sampling on test data.
Note
In order for this method to be unbiased, the actions on the test sample must have been collected at random and not according to some other policy.
Parameters: - policy (obj) – Policy to be evaluated (already fitted to data). Must have a ‘predict’ method. If it is an online policy, it must also have a ‘fit’ method.
- X (array (n_samples, n_features)) – Matrix of covariates for the available data.
- a (array (n_samples), int type) – Arms or actions that were chosen for each observation.
- r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.
- online (bool) – Whether this is an online policy to be evaluated by refitting it to the data as it makes choices on it.
- start_point_online (either str ‘random’ or int in [0, n_samples-1]) – Point at which to start evaluating cases in the sample. Only used when passing online=True.
- batch_size (int) – After how many rounds to refit the policy being evaluated. Only used when passing online=True.
Returns: result – Estimated mean reward and number of observations taken.
Return type: tuple (float, int)
References
[1] A contextual-bandit approach to personalized news article recommendation (2010)