upliftml package¶
PySpark-based Estimators¶
- class upliftml.models.pyspark.CVTEstimator(base_model_class: Any, base_model_params: Dict, predictors_colname: str = 'features', treatment_colname: str = 'treatment', target_colname: str = 'outcome', output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by transforming the target variable into a new target variable Z, such that the treatment effect tau(X) = 2 * E[Z | X] - 1. This transformation results in a classification problem and is, thus, slightly different from the TransformedOutcomeEstimator, which results in a regression problem. Can only be used with 50-50 treatment vs. control RCT data.
The Class Variable Transformation technique was proposed in Jaskowski and Jaroszewicz (2012) (http://people.cs.pitt.edu/~milos/icml_clinicaldata_2012/Papers/Oral_Jaroszewitz_ICML_Clinical_2012.pdf).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the CVT model by transforming the target variable and fitting a classifier on the transformed targets.
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the CVT model and returns treatment effect predictions.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing treatment effect predictions
- class upliftml.models.pyspark.PropensityEstimator(base_model_class: Optional[Any] = None, base_model_params: Optional[Dict] = None, predictors_colname: Optional[str] = None, treatment_colname: str = 'treatment', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'propensity')¶
Bases:
object
Estimates treatment propensities, either as the simple treatment proportions E[T] or by training a model for E[T | X].
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Fits a propensity model. If self.model is None, uses the proportion of treated instances in df_train to estimate E[T], independent of X. If self.model is instantiated, fits a full propensity model E[T | X].
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the propensity model and returns treatment assignment predictions. If self.model is None, uses the pre-calculated treatment proportion for all instances. If self.model is instantiated, applies the model to get estimates E[T | X].
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing treatment assignment predictions
- class upliftml.models.pyspark.RetrospectiveEstimator(base_model_class: Any, base_model_params: Dict, predictors_colname: str = 'features', treatment_colname: str = 'treatment', target_colname: str = 'outcome', positive_outcome_value: int = 1, output_colname: str = 'score')¶
Bases:
object
Estimates E[T | Y=1, X], which corresponds to estimating the relative treatment effect E[Y | T=1, X] / E[Y | T=0, X] in case of 50-50 treatment vs. control RCT data.
This estimator can also used as the greedy solution for maximizing incrementality under ROI constraints, as described in Goldenberg et al. (2020) (preprint: https://drive.google.com/file/d/1E0KQ_sT09q1bpnlt9gZFFSbrx-YgGcqF/view).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the Retrospective Estimator E[T | Y=1, X].
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the Retrospective Estimator model and returns predictions for E[T | Y=1, X].
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing predictions for E[T | Y=1, X]
- class upliftml.models.pyspark.SLearnerEstimator(base_model_class: Any, base_model_params: Dict, predictors_colname: str = 'features', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by training a single model for E[Y | T, X], applying the model with T=1 and T=0 and using the difference in these estimates as the estimated treatment effect.
The name S-learner originates from Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the S-learner.
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the S-learner and returns treatment effect predictions.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing predictors containing treatment effect predictions
- class upliftml.models.pyspark.TLearnerEstimator(base_model_class: Any, base_model_params: Dict, predictors_colname: str = 'features', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect as the difference in estimates from two separate models: E[Y | T=1, X] - E[Y | T=0, X].
The two-model approach is widely used for treatment effect estimation. The name T-learner originates from Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the T-learner.
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the T-learner and returns treatment effect predictions.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing treatment effect predictions
- class upliftml.models.pyspark.TransformedOutcomeEstimator(base_model_class: Any, base_model_params: Dict, predictors_colname: str = 'features', propensity_model_class: Optional[Any] = None, propensity_model_params: Optional[Dict] = None, treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by transforming the outcome, such that the expectation of the transformed outcome corresponds to the treatment effect. This transformation results in a regression problem and is, thus, slightly different from the CVTEstimator, which results in a classification problem.
The Transformed Outcome technique was proposed in Athey and Imbens (2015) (https://pdfs.semanticscholar.org/86ce/004214845a1683d59b64c4363a067d342cac.pdf).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the Transformed Outcome model by first fitting a propensity model, retrieving the propensity scores for each instance, computing the transformed outcomes, and finally fitting a regressor on the transformed outcomes.
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the Transformed Outcome Estimator and returns treatment effect predictions.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing treatment effect predictions
- class upliftml.models.pyspark.XLearnerEstimator(base_model_class_1: Any, base_model_params_1: Dict, base_model_class_2: Any, base_model_params_2: Dict, predictors_colname_2: str = 'features', predictors_colname_1: str = 'features', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect in three stages: 1. Train a T-learner to get scores Y_hat_1 and Y_hat_0. 2. Train regression models to predict the residuals: tau1 = E[Y(1) - Y_hat_1) | X] and tau0 = E[Y_hat_0 - Y(0) | X] 3. Estimate the treatment effect as a weighted average: tau(X) = p(X) * tau0(X) + (1 - p(X)) * tau1(X). Our implementation sets p(X) = 0.5 for all X.
X-learner was proposed in Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_train: pyspark.sql.DataFrame, df_val: Optional[Any] = None) None ¶
Trains the X-learner.
- Args:
df_train (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_val (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df: pyspark.sql.DataFrame) pyspark.sql.DataFrame ¶
Applies the X-learner and returns treatment effect predictions.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing predictors
- Returns:
df (pyspark.sql.DataFrame): a dataframe containing treatment effect predictions
H2O-based Estimators¶
- class upliftml.models.h2o.CVTEstimator(base_model_class: Any, base_model_params: Dict, predictor_colnames: List[str], treatment_colname: str = 'treatment', target_colname: str = 'outcome', output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by transforming the target variable into a new target variable Z, such that the treatment effect tau(X) = 2 * E[Z | X] - 1. This transformation results in a classification problem and is, thus, slightly different from the TransformedOutcomeEstimator, which results in a regression problem. Can only be used with 50-50 treatment vs. control RCT data.
The Class Variable Transformation technique was proposed in Jaskowski and Jaroszewicz (2012) (http://people.cs.pitt.edu/~milos/icml_clinicaldata_2012/Papers/Oral_Jaroszewitz_ICML_Clinical_2012.pdf).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the CVT model by transforming the target variable and fitting a classifier on the transformed targets.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the CVT model and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
- class upliftml.models.h2o.PropensityEstimator(base_model_class: Optional[Any] = None, base_model_params: Optional[Dict] = None, predictor_colnames: Optional[List[str]] = None, treatment_colname: str = 'treatment', treatment_value: Union[int, str] = 1, output_colname: str = 'propensity')¶
Bases:
object
Estimates treatment propensities, either as the simple treatment proportions E[T] or by training a model for E[T | X].
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Fits a propensity model. If self.model is None, uses the proportion of treated instances in df_h2o_train to estimate E[T], independent of X. If self.model is instantiated, fits a full propensity model E[T | X].
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the propensity model and returns treatment assignment predictions. If self.model is None, uses the pre-calculated treatment proportion for all instances. If self.model is instantiated, applies the model to get estimates E[T | X].
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
propensities (h2o.H2OFrame): a single column containing treatment assignment predictions
- class upliftml.models.h2o.RLearnerEstimator(target_model_class: Any, target_model_params: Dict, final_model_class: Any, final_model_params: Dict, predictor_colnames: List[str], propensity_model_class: Optional[Any] = None, propensity_model_params: Optional[Any] = None, treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, categorical_outcome: bool = False, fold_colname: Optional[str] = None, n_folds: Optional[int] = None, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect in two stages: 1. Using cross-fitting, on the training set train a marginal target estimator to get scores Y_hat and a propensity estimator to get scores T_hat, calculate the residuals on the validation set. 2. Train a final estimator on the residuals: tau = E[(Y- Y_hat) / (T - T_hat) | X] using (T - T_hat)^2 as weights.
R-learner was proposed in Nie and Wager (2019) (https://arxiv.org/abs/1712.04912).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the R-learner.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the R-learner and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
- class upliftml.models.h2o.RetrospectiveEstimator(base_model_class: Any, base_model_params: Dict, predictor_colnames: List[str], treatment_colname: str = 'treatment', target_colname: str = 'outcome', positive_outcome_value: Union[str, int] = 1, output_colname: str = 'score')¶
Bases:
object
Estimates E[T | Y=1, X], which corresponds to estimating the relative treatment effect E[Y | T=1, X] / E[Y | T=0, X] in case of 50-50 treatment vs. control RCT data.
This estimator can also used as the greedy solution for maximizing incrementality under ROI constraints, as described in Goldenberg et al. (2020) (preprint: https://drive.google.com/file/d/1E0KQ_sT09q1bpnlt9gZFFSbrx-YgGcqF/view).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the Retrospective Estimator E[T | Y=1, X].
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the Retrospective Estimator model and returns predictions for E[T | Y=1, X].
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing predictions for E[T | Y=1, X]
- class upliftml.models.h2o.SLearnerEstimator(base_model_class: Any, base_model_params: Dict, predictor_colnames: List[str], treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, categorical_outcome: bool = False, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by training a single model for E[Y | T, X], applying the model with T=1 and T=0 and using the difference in these estimates as the estimated treatment effect.
The name S-learner originates from Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the S-learner.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the S-learner and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
- class upliftml.models.h2o.TLearnerEstimator(base_model_class: Any, base_model_params: Dict, predictor_colnames: List[str], treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, categorical_outcome: bool = False, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect as the difference in estimates from two separate models: E[Y | T=1, X] - E[Y | T=0, X].
The two-model approach is widely used for treatment effect estimation. The name T-learner originates from Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the T-learner.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the T-learner and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
- class upliftml.models.h2o.TransformedOutcomeEstimator(base_model_class: Any, base_model_params: Dict, predictor_colnames: List[str], propensity_model_class: Optional[Any] = None, propensity_model_params: Optional[Dict] = None, treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect by transforming the outcome, such that the expectation of the transformed outcome corresponds to the treatment effect. This transformation results in a regression problem and is, thus, slightly different from the CVTEstimator, which results in a classification problem.
The Transformed Outcome technique was proposed in Athey and Imbens (2015) (https://pdfs.semanticscholar.org/86ce/004214845a1683d59b64c4363a067d342cac.pdf).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the Transformed Outcome model by first fitting a propensity model, retrieving the propensity scores for each instance, computing the transformed outcomes, and finally fitting a regressor on the transformed outcomes.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the Transformed Outcome Estimator and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
- class upliftml.models.h2o.XLearnerEstimator(base_model_class_1: Any, base_model_params_1: Dict, predictor_colnames_1: List[str], base_model_class_2: Any, base_model_params_2: Dict, predictor_colnames_2: List[str], treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: int = 1, control_value: int = 0, categorical_outcome: bool = False, output_colname: str = 'score')¶
Bases:
object
Estimates treatment effect in three stages: 1. Train a T-learner to get scores Y_hat_1 and Y_hat_0. 2. Train regression models to predict the residuals: tau1 = E[Y(1) - Y_hat_1) | X] and tau0 = E[Y_hat_0 - Y(0) | X] 3. Estimate the treatment effect as a weighted average: tau(X) = p(X) * tau0(X) + (1 - p(X)) * tau1(X). Our implementation sets p(X) = 0.5 for all X.
X-learner was proposed in Künzel et al. (2019) (https://arxiv.org/pdf/1706.03461.pdf).
- fit(df_h2o_train: h2o.H2OFrame, df_h2o_val: Optional[h2o.H2OFrame] = None) None ¶
Trains the X-learner.
- Args:
df_h2o_train (h2o.H2OFrame): a dataframe containing the treatment indicators, the observed outcomes, and predictors df_h2o_val (h2o.H2OFrame, optional): a dataframe containing the treatment indicators, the observed outcomes, and predictors
- predict(df_h2o: h2o.H2OFrame) h2o.H2OFrame ¶
Applies the X-learner and returns treatment effect predictions.
- Args:
df_h2o (h2o.H2OFrame): a dataframe containing predictors
- Returns:
predictions (h2o.H2OFrame): a single column containing treatment effect predictions
Evaluation and Plotting Functions¶
- upliftml.evaluation.compute_auuc(df_qini: pandas.DataFrame) float ¶
Computes the Area Under the Uplift Curve.
- Args:
df_qini (pandas.DataFrame): a dataframe containing the Qini estimates
- Returns:
A scalar representing the AUUC score
- upliftml.evaluation.compute_qini_coefficient(df_qini: pandas.DataFrame) float ¶
Computes the Qini coefficient
- Args:
df_qini (pandas.DataFrame): a dataframe containing the Qini estimates
- Returns:
A scalar representing the Qini coefficient
- upliftml.evaluation.estimate_and_plot_cate_lift(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, label: Optional[str] = None, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, cumulatively estimates CATE lift
(with or without confidence intervals), and plots the estimates as a lineplot.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores treatment_colname (str, optional): the column name in df that contains the treatment indicators target_colname (str, optional): the column name in df that contains the target treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing the CATE lift estimates (with or without confidence intervals),
cumulative population sizes and fractions
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_cate_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
Estimates conditional average treatment effects per bucket in a Spark DataFrame and plots the estimates.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
bucket_colname (str, optional): column name in df that contains the bucket assignments target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
- control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_cate_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, estimates CATE values per quantile,
and plots them.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_cum_iroi(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'score', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[int, str] = 1, control_value: Union[int, str] = 0, label: Optional[str] = None, plot_overall: bool = True, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, estimates cumulative iROI
(with or without confidence intervals), and plots the estimates as a lineplot.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that refers to the
treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. plot_overall (bool, optional): Indicator the overall iROI line should be plotted. Defaults to True. bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing cumulative iROI estimates (with or without confidence intervals),
cumulative population sizes and fractions
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_iroi_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
Estimates incremental ROI per bucket in a Spark DataFrame and plots the estimates.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, benefit and cost,
and real-valued model scores
bucket_colname (str, optional): column name in df that contains the bucket assignments benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing iROI estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_iroi_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'score', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, estimates iROI per quantile,
and plots the estimates.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, benefit and cost,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing iROI estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_qini(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, label: str = 'Qini coefficient', plot_random: bool = True, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, estimates the Qini values
(with or without confidence intervals), and plots them as a lineplot.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores treatment_colname (str, optional): the column name in df that contains the treatment indicators target_colname (str, optional): the column name in df that contains the target treatment_value (str or int, optional): the value in column <treatment_colname> that
refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. plot_random (bool, optional): Indicator whether the random targeting line should be plotted. Defaults to True. bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing Qini estimates (with or without confidence intervals),
cumulative population sizes and fractions
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_target_rate_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', target_colname: str = 'outcome', bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
Estimates conditional average treatment effects per bucket in a Spark DataFrame and plots the estimates.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
bucket_colname (str, optional): column name in df that contains the bucket assignments target_colname (str, optional): the column name in df that contains the target
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_and_plot_target_rate_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', target_colname: str = 'outcome', add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None, sort_x: bool = True, ax: Optional[Any] = None) Tuple ¶
- Divides the data into buckets based on model score quantiles, estimates CATE values per quantile,
and plots them.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores target_colname (str, optional): the column name in df that contains the target add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
A tuple, containing: (pandas.DataFrame): a dataframe containing the target rate estimates (with or without confidence intervals),
population sizes and fractions within each bucket
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.estimate_ate(df: pyspark.sql.DataFrame, target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) Dict ¶
Estimates the average treatment effect in a Spark DataFrame.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators and observed outcomes target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- Dict with estimates of the target rate in the control group, the target rate in the treatment group, and the ATE,
all with or without lower and upper bounds depending on whether bootstrapping is performed.
- upliftml.evaluation.estimate_cate_lift(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
- Divides the data into buckets based on model score quantiles and cumulatively estimates CATE lift
(with or without confidence intervals).
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that refers to the
treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing the CATE lift estimates (with or without confidence intervals),
cumulative population sizes and fractions
- upliftml.evaluation.estimate_cate_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
Estimates the conditional average treatment effects per bucket in a Spark DataFrame.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes, and the bucket assignments bucket_colname (str, optional): column name in df that contains the bucket assignments target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that refers to the
treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.estimate_cate_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', target_colname: str = 'outcome', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
- Divides the data into buckets based on model score quantiles and estimates average treatment
effects per bucket.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators,
the observed outcomes, and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores target_colname (str, optional): the column name in df that contains the target treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.estimate_cum_iroi(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
- Divides the data into buckets based on model score quantiles and estimates cumulative iROI
(with or without confidence intervals).
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that
refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing cumulative iROI estimates (with or without confidence intervals),
cumulative population sizes and fractions
- upliftml.evaluation.estimate_iroi(df: pyspark.sql.DataFrame, benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) Dict ¶
Estimates the incremental return on investment in a Spark DataFrame.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators,
the cost and the benefit for each instance
benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname>
that refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
Dict of estimates of the iROI, incremental benefit, and incremental cost, all with or without lower and upper bounds depending on whether bootstrapping is performed.
- upliftml.evaluation.estimate_iroi_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[int, str] = 1, control_value: Union[int, str] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
Estimates incremental ROI per bucket in a Spark DataFrame.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, benefit and cost,
and the bucket assignments
bucket_colname (str, optional): column name in df that contains the bucket assignments benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that refers to the
treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing iROI estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.estimate_iroi_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'score', benefit_colname: str = 'revenue', cost_colname: str = 'cost', treatment_colname: str = 'treatment', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
Divides the data into buckets based on model score quantiles and estimates iROI per bucket.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, benefit and cost,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost treatment_colname (str, optional): the column name in df that contains the treatment indicators treatment_value (str or int, optional): the value in column <treatment_colname> that
refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing iROI estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.estimate_qini(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'cate_outcome', treatment_colname: str = 'treatment', target_colname: str = 'outcome', treatment_value: Union[str, int] = 1, control_value: Union[str, int] = 0, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
- Divides the data into buckets based on model score quantiles and estimates Qini values
(with or without confidence intervals).
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores treatment_colname (str, optional): the column name in df that contains the treatment indicators target_colname (str, optional): the column name in df that contains the target treatment_value (str or int, optional): the value in column <treatment_colname>
hat refers to the treatment group
control_value (str or int, optional): the value in column <treatment_colname> that refers to the control group bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing Qini estimates (with or without confidence intervals),
cumulative population sizes and fractions
- upliftml.evaluation.estimate_roi(df: pyspark.sql.DataFrame, benefit_colname: str = 'revenue', cost_colname: str = 'cost', bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) Dict ¶
Estimates the return on investment in a Spark DataFrame.
- Args:
df (pyspark.sql.DataFrame): a dataframe containing the cost and the benefit for each instance benefit_colname (str, optional): the column name in df that contains the benefit cost_colname (str, optional): the column name in df that contains the cost bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
Dict with estimate of the ROI, with or without lower and upper bounds.
- upliftml.evaluation.estimate_target_rate_per_bucket(df: pyspark.sql.DataFrame, bucket_colname: str = 'bucket', target_colname: str = 'outcome', bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
Estimates conditional average treatment effects per bucket in a Spark DataFrame.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators, the observed outcomes,
and the bucket assignments
bucket_colname (str, optional): column name in df that contains the bucket assignments target_colname (str, optional): the column name in df that contains the target bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.estimate_target_rate_per_quantile(df: pyspark.sql.DataFrame, n_buckets: int = 30, score_colname: str = 'score', target_colname: str = 'outcome', add_labels: bool = False, bootstrap: bool = False, n_bootstraps: int = 100, ci_quantiles: Optional[List[float]] = None) pandas.DataFrame ¶
- Divides the data into buckets based on model score quantiles and estimates average
treatment effects per bucket.
- Args:
- df (pyspark.sql.DataFrame): a dataframe containing the treatment indicators,
the observed outcomes, and real-valued model scores
n_buckets (int, optional): the number of quantiles to generate from the column <score_colname> score_colname (str, optional): the column name in df that contains the model scores target_colname (str, optional): the column name in df that contains the target add_labels (bool, optional): indicates whether the bucket labels are added in the form [start, end).
Defaults to False, meaning that only the ids of the buckets are returned.
bootstrap (bool, optional): if True, will perform bootstrapping and return confidence intervals n_bootstraps (int, optional): the number of bootstraps to perform. Only has an effect if bootstrap=True ci_quantiles (list of float, optional): the lower and upper confidence bounds.
Only has an effect if bootstrap=True
- Returns:
- (pandas.DataFrame): a dataframe containing CATE estimates (with or without confidence intervals),
population sizes and fractions within each bucket
- upliftml.evaluation.plot_cate_lift(df: pandas.DataFrame, x: str = 'fraction', y: str = 'cum_cate', label: Optional[str] = None, bootstrap: bool = False, ax: Optional[Any] = None) Any ¶
Plots the CATE lift estimates as a lineplot.
- Args:
- df (pandas.DataFrame): a dataframe containing CATE lift estimates and cumulative population sizes or fractions.
If bootstrap=True, df should also contain upper and lower bounds for the lift estimates.
- x (str, optional): column name in df that contains the cumulative population sizes or fractions.
This defines the x-axis.
y (str, optional): column name in df that contains the CATE lift estimates. This defines the y-axis. label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. bootstrap (bool, optional): indicates whether to use lower and upper bound values from df and plot a
scatterplot with errorbars. If False, plots a barplot.
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.plot_cate_random(overall_ate: float, ax: Optional[Any] = None) Any ¶
Plots the random targeting line on a Qini plot.
- Args:
overall_ate (float): the overall treatment effect across all instances, if all instances were targeted ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- upliftml.evaluation.plot_cum_iroi(df: pandas.DataFrame, x: str = 'fraction', y: str = 'iroi', label: Optional[str] = None, plot_overall: bool = True, bootstrap: bool = False, ax: Optional[Any] = None) Any ¶
Plots the cumulative iROI curve.
- Args:
- df (pandas.DataFrame): a dataframe containing iROI estimates and cumulative population sizes or fractions.
If bootstrap=True, df should also contain upper and lower bounds for the lift estimates.
- x (str, optional): column name in df that contains the cumulative population sizes or fractions.
This defines the x-axis.
y (str, optional): column name in df that contains the iROI estimates. This defines the y-axis. label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. plot_overall (bool, optional): Indicator whether the overall iROI line should be plotted. Defaults to True. bootstrap (bool, optional): indicates whether to use lower and upper bound values from df and plot a scatterplot
with errorbars. If False, plots a barplot.
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.plot_metric_per_bucket(df: pandas.DataFrame, x: str = 'bucket', y: str = 'cate', bootstrap: bool = False, sort_x: bool = True, ax: Optional[Any] = None) Any ¶
Plots metric values per buckets as a barplot or scatterplot with errorbars.
- Args:
- df (pandas.DataFrame): a dataframe containing metric values per buckets. If bootstrap=True,
df should also contain upper and lower bounds.
x (str): column name in df that contains the bucket names. This defines the x-axis. y (str): column name in df that contains the metric values. This defines the y-axis. bootstrap (bool, optional): indicates whether to use lower and upper bound values from df and plot a
scatterplot with errorbars. If False, plots a barplot.
sort_x (bool, optional): if True, x-axis will be sorted from highest metric value to lowest ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
- upliftml.evaluation.plot_qini(df: pandas.DataFrame, x: str = 'fraction', y: str = 'ate', label: Optional[str] = None, plot_random: bool = True, bootstrap: bool = False, ax: Optional[Any] = None) Any ¶
Plots the Qini curve.
- Args:
- df (pandas.DataFrame): a dataframe containing Qini estimates and cumulative population sizes or fractions.
If bootstrap=True, df should also contain upper and lower bounds for the lift estimates.
- x (str, optional): column name in df that contains the cumulative population sizes or fractions.
This defines the x-axis.
y (str, optional): column name in df that contains the Qini estimates. This defines the y-axis. label (str, optional): name of the score estimation method to be shown on the legend. Defaults to None. plot_random (bool, optional): Indicator whether the random targeting line should be plotted. Defaults to True. bootstrap (bool, optional): indicates whether to use lower and upper bound values from df and plot a
scatterplot with errorbars. If False, plots a barplot.
ax (matplotlib.axes._subplots.AxesSubplot, optional): if specified, the plot will be plotted on this ax. Useful when creating a figure with subplots.
- Returns:
(matplotlib.axes._subplots.AxesSubplot): the axis of the plot
Data Simulators¶
- upliftml.datasets.simulate_randomized_trial(n: int = 1000, p: int = 5, sigma: float = 1.0, binary_outcome: bool = False, add_cost_benefit: bool = False) pandas.DataFrame ¶
- Simulates a synthetic dataset corresponding to a randomized trial
The version with continuous outcome and without cost/benefit columns corresponds to Setup B in Nie X. and Wager S. (2018) ‘Quasi-Oracle Estimation of Heterogeneous Treatment Effects’ and is aligned with the implementation in the CausalML package.
- Args:
n (int, optional): number of observations to generate p (int optional): number of covariates. Should be >= 5, since treatment heterogeneity is determined based on the first 5 features. sigma (float): standard deviation of the error term binary_outcome (bool): whether the outcome should be binary or continuous add_cost_benefit (bool): whether to generate cost and benefit columns
- Returns:
- (pandas.DataFrame): a dataframe containing the following columns:
treatment
outcome
propensity
expected_outcome
actual_cate
benefit (only if add_cost_benefit=True)
cost (only if add_cost_benefit=True)