Vitali Set

Evaluating ranking in regression

2024-11-17T00:00:00+00:00

In supervised learning regression problems, the focus is generally on metrics that ensure the predicted value is close to the true value of the sample. Classic regression metrics are variations that involve the measure $| \hat{y_i} - y_i |$.

However, it is not always essential to predict the exact value of the target variable precisely, as in some applications, exactness is not critical to the final objective. In many cases, achieving a good ranking of the predictions is sufficient to meet the demands of the business problem. Of course, this depends on the context, but with proper ranking, we can approach the problem similarly to setting a threshold in binary classification or, more generally, as a policy problem. In this case, the most appropriate cutoff point is identified through additional analysis to implement a desired treatment or action, such as targeting individuals with an expected credit card expense greater than $\delta$ for a new product marketing campaign. In most companies, the policy is structured around buckets of relevant percentiles, which are inherently based on ranking.

In other scenarios, such as income estimation, the regression model is often used as an auxiliary variable in subsequent models. These models, frequently ensembles of decision trees, inherently disregard the exact value of variables, considering only their rankings. If the final model is, for instance, a logistic regression or even a neural network, simple transformations are typically applied, altering the distribution of the values but maintaining monotonicity. Again, the exact values matter much less than the ranking.

From this perspective, it becomes clear that regression problems may require specific metrics to evaluate the quality of the ranking rather than relying solely on metrics that aim to minimize variations of $| \hat{y_i} - y_i |$.

$\oint$ It is worth emphasizing that ranking-oriented metrics are particularly relevant in domains such as recommendation systems, where the primary objective is to provide an optimal ranking of items rather than precise value predictions. I believe that adapting recommendation system metrics could also be highly effective in addressing challenges in other domains. However, these adaptations might not be as straightforward as those discussed in this post.

To illustrate our metrics, let’s assume we built three different models that produced various scores for the same prediction problem, with the test set defined by y_true.

import numpy as np

random_state_0, random_state_1, random_state_2, random_state_3 = np.random.RandomState(42).randint(low=0, high=2**32 - 1, size=4)

y_true = np.random.RandomState(random_state_0).normal(size=1_000)

y_score_1 = np.exp(3 + y_true) + np.random.RandomState(random_state_1).normal(size=len(y_true))
y_score_2 = 3 * y_true + np.random.RandomState(random_state_2).normal(size=len(y_true))
y_score_3 = np.random.RandomState(random_state_3).normal(size=1_000)

SCORES = dict(zip(['y_score_1', 'y_score_2', 'y_score_3'], [y_score_1, y_score_2, y_score_3]))

Without delving into the specifics of how these scores were generated, the most natural and well-known way to evaluate these models would be using metrics such as $R^2$, $\textrm{RMSE}$, or some variation of these. These metrics are very useful but do not necessarily provide much insight into ranking.

In our example, by analyzing the $\textrm{RMSE}$, it seems that y_score_3 is a good predictor.

from sklearn import metrics

for score_name, y_score in SCORES.items():
    rmse = np.sqrt(metrics.mean_squared_error(y_true=y_true, y_pred=y_score))
    print(f"RMSE for {score_name}: {rmse:6.3f}")

RMSE for y_score_1: 58.205
RMSE for y_score_2:  2.242
RMSE for y_score_3:  1.400

Spearman’s Correlation

When we want to evaluate how a regression model ranks the data, it is natural to consider measures of correlation between two continuous variables. Consequently, an initial idea might be to use Pearson's correlation. However, Pearson's correlation focuses solely on linear relationships and does not account for the relative order of the values. Thus, even if the model accurately reproduces the order of the predicted values, if the transformation between the values is not close to linear, Pearson's correlation may not adequately reflect the quality of the ranking.

This is where Spearman's correlation ($\rho$) becomes an interesting metric, as it measures the similarity between the rankings of these variables [1]. In other words, it evaluates whether the order of the values is consistent between the two. This makes Spearman's correlation particularly useful in problems where the relative position of the values is more important than their magnitudes.

$\oint$ Spearman's correlation can be seen as a version of Pearson's correlation applied to the ranks of the variables instead of their original values. Under the hood, Spearman transforms the data by replacing each value with its position in the ranking and then calculates Pearson's correlation on these ranks.

If there are no ties in the ranks, the simplified formula for Spearman's correlation is given by

\[\rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)},\]

where $n$ is the total number of observations, and $d_i$ is the difference between the ranks of the same observation in the two variables. To calculate $d_i$, we first assign a rank to each value of the variables. For example, given a set of values $\{w_i\}_{i=1}^n$ and $\{z_i\}_{i=1}^n$, we sort each set separately and replace the values with their respective ranks. Then, for each observation $i$, we compute

\[d_i = \text{rank}(w_i) - \text{rank}(z_i).\]

The values of $\rho$ range between -1 and 1. A value of 1 indicates perfect ranking agreement, -1 indicates a complete inversion of the ranking, and 0 indicates no ranking relationship between the variables.

from scipy import stats

for score_name, y_score in SCORES.items():
    spearman = stats.spearmanr(y_true, y_score).statistic
    print(f"Spearman for {score_name}: {spearman:7.5f}")

Spearman for y_score_1: 0.99759
Spearman for y_score_2: 0.94718
Spearman for y_score_3: 0.01447

Using this new metric, we noticed that y_score_1 and y_score_2 stand out due to their ability to sort y_true.

Kendall’s Tau Correlation

Another common metric for evaluating ranking is Kendall's Tau ($\tau$) concordance index. This metric measures the strength of association between two rankings by comparing pairs of observations and determining whether they are concordant or discordant [2].

Two pairs $(w_i, z_i)$ and $(w_j, z_j)$ are considered:

concordant: if the ranking of $w_i$ relative to $w_j$ is the same as the ranking of $z_i$ relative to $z_j$. Formally, this occurs when

\[(w_i - w_j)(z_i - z_j) > 0.\]

discordant: if the ranking of $w_i$ relative to $w_j$ is the opposite of that of $z_i$ relative to $z_j$. In other words,

\[(w_i - w_j)(z_i - z_j) < 0.\]

The formula for Kendall's Tau is

\[\tau = \frac{C - D}{\frac{1}{2} n(n-1)},\]

where $C$ is the number of concordant pairs and $D$ is the number of discordant pairs. The denominator, $\frac{1}{2} n(n-1)$, represents the total number of possible pairs among $n$ observations.

Similar to Spearman's correlation, Kendall's Tau ranges between -1 and 1, with the same interpretation: when $\tau$ approaches 1, the rankings are highly concordant; when it approaches -1, the rankings are reversed; and when $\tau \approx 0$, there is no association between the rankings.

for score_name, y_score in SCORES.items():
    kendall = stats.kendalltau(y_true, y_score).statistic
    print(f"kendall's tau for {score_name}: {kendall:7.5f}")

kendall's tau for y_score_1: 0.96163
kendall's tau for y_score_2: 0.80227
kendall's tau for y_score_3: 0.00976

$\oint$ It is possible to adapt the metric to account for sample weights by assigning the weight of a pair as the product of the weights of the samples.

ROCAUC for Classification

ROCAUC for classification is a very good binary classification metric for measuring ranking [3]. It is a class imbalance-invariant metric and has a perfect interpretation for the ranking problem, being, in my experience, the primary metric used in the industry for binary classification problems when ranking is the primary goal.

It is possible to prove that in a binary classification problem with explanatory variables $X \in \mathcal{X}$ and $Y \in \{0, 1\}$, given a scoring/ranking function $f:\mathcal{X} \to \mathbb{R}$, then

\[\text{ROCAUC}(f) = \mathbb{P}\left( f(X_i) > f(X_j) \mid Y_i = 1, Y_j = 0 \right).\]

In other words, if we select a random sample from class 1 and a random sample from class 0 in our binary classification problem, the ROCAUC coincides with the probability that the score given to the class 1 sample is greater than the score given to the class 0 sample.

Because of this probabilistic interpretation of the metric, a good ROCAUC for your classifier is equivalent to a good ranking when using your classifier as a means of ordering.

Estimating the ROCAUC via the Wilcoxon-Mann-Whitney statistic

The previous definition refers to the true ROCAUC value, rather than the estimated value we calculate using sklearn.metrics.roc_auc_score. In an observed random sample of $(X, Y)$, $\{(x_i, y_i)\}_{i=1}^n$, the probabilistic version can be estimated using the Wilcoxon-Mann-Whitney statistic as

\[\frac{1}{n_0 n_1} \sum_{i : y_i = 1} \sum_{j : y_j = 0} \mathbb{1}\left( f(x_i) > f(x_j) \right),\]

where $n_0$ and $n_1$ are the numbers of elements in classes $0$ and $1$, respectively, and $\mathbb{1}\left(S\right)$ is the indicator function. $\mathbb{1}\left(S\right)$ is equal to 1 when the condition $S$ is true and 0 otherwise.

There are some variations of this statistic for more efficient computation, since in this form it requires a number of comparisons on the order of $\mathcal{O}(n_0 n_1)$, or $\mathcal{O}(n^2)$ if $n_1 \approx n_0$, which can be impractical [3]. The simplest basic version is to perform this sampling only a sufficiently large number $N$ of times, resulting in the version

\[\widehat{\text{ROCAUC}}(f) = \frac{1}{N} \sum_{(i,j) : y_i = 1, y_j = 0} \mathbb{1}\left( f(x_i) > f(x_j) \right).\]

ROCAUC for Regression

This probabilistic interpretation motivates us to make a clever variation and use something similar for the regression problem [4]. If we replace the condition $y_i = 1, y_j = 0$ with $y_i > y_j$, we can construct a generic ranking probability metric for regression problems as

\[\widehat{\text{ROCAUC}}(f) = \frac{1}{N} \sum_{(i,j): y_i > y_j} \mathbb{1}\left( f(x_i) > f(x_j) \right).\]

$\oint$ Just like with Kendall's tau, it's possible to adapt the metric to account for sample weights by assigning the weight of a pair as the product of the weights of the samples.

from sklearn import utils

def regression_roc_auc(y_true, y_score):
    """Compute the generalized ROC AUC for regression tasks.

    This function calculates the probability that the predicted values maintain
    the correct order relative to the true values, specifically for pairs where
    y_true[i] > y_true[j].

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        True continuous target values.

    y_score : array-like of shape (n_samples,)
        Predicted continuous target values.

    Returns
    -------
    score : float
        The computed generalized ROC AUC score for regression.
    """
    y_true = utils.check_array(y_true, ensure_2d=False, dtype=None)
    y_score = utils.check_array(y_score, ensure_2d=False)

    total_pairs = 0
    correct_orderings = 0

    # Efficiently compute the metric without explicit loops
    # Create a mask for all pairs where y_true[i] > y_true[j]
    diff_matrix = y_true[:, None] - y_true[None, :]
    valid_pairs = diff_matrix > 0

    # Count total valid pairs
    total_pairs = np.sum(valid_pairs)

    if total_pairs == 0:
        # If no valid pairs, return 0.5 (equivalent to random ordering)
        return 0.5

    # Compare predictions for valid pairs
    pred_diff = y_score[:, None] - y_score[None, :]
    correct_orderings = np.sum((pred_diff > 0) & valid_pairs)

    score = correct_orderings / total_pairs
    return score

for score_name, y_score in SCORES.items():
    roc_auc = regression_roc_auc(y_true, y_score)
    print(f"ROCAUC for {score_name}: {roc_auc:7.5f}")

ROCAUC for y_score_1: 0.98081
ROCAUC for y_score_2: 0.90113
ROCAUC for y_score_3: 0.50488

These metrics are quite useful, but if your regression problem is highly imbalanced, you may encounter some difficulties. I have worked on regression problems where over 99.5% of the data had values equal to 0, with only a small fraction having any associated value. Since many values will be tied, depending on the correlation implementation you use, your previous metrics may become artificially inflated or deflated, without a clear rationale to identify the issue. In the case of ROCAUC, with many ties in $y_i = y_j$, discarding numerous samples might result in a less reliable value with high variance.

Ranking Curve

The ranking curve (I’m not sure if this curve has an official name) is interesting because it is very simple and intuitive. The process involves ranking your sample based on the predicted variable, dividing it into "buckets" according to percentiles, and then calculating the mean or another relevant positional statistic for each bucket. For example, if you divide the sample into 10 buckets, the third bucket would contain the elements with values falling between the 20th and 30th percentiles, and you would compute the mean of these values.

The idea is that, if your score ranks the sample well, then the elements with the highest values will cluster at one end, and those with the lowest values will cluster at the other. As a result, the resulting graph will have a steep slope.

$\oint$ I usually divide the buckets into 10, but this number is a parameter you can adjust as desired, depending on the level of detail you want to observe. The issue is that the greater the detail, the noisier the result will be due to smaller sample sizes. However, by using a bootstrap method, you can plot a confidence interval for analysis.

$\oint$ This construction is not a QQ-plot, but understanding how a QQ-plot works may help you grasp the construction of this metric, even though this curve is much simpler.

def ranking_curve(y_true, y_score, n_buckets=10, statistic='mean'):
    """Compute the ranking curve for a regression task.

    Calculates statistics of `y_true` values across `n_buckets`  of `y_score`
    values, ordered by the predicted scores. It can be used to  assess the
    distribution or trends of true values as a function of predicted scores.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        True continuous target values.

    y_score : array-like of shape (n_samples,)
        Predicted continuous target values.

    n_buckets : int, default=10
        The number of buckets to divide the sorted `y_score` values into.

    statistic : {'mean', 'median'} or callable, default='mean'
        The statistic to compute for `y_true` values in each bucket.
        - If 'mean', computes the mean of `y_true` in each bucket.
        - If 'median', computes the median of `y_true` in each bucket.
        - If callable, applies the callable function to the `y_true` values in each bucket.

    Returns
    -------
    bucket_positions : ndarray of shape (n_buckets,)
        The positions of the buckets, indexed from 1 to `n_buckets`.

    bucket_values : ndarray of shape (n_buckets,)
        The computed statistic for `y_true` values in each bucket.
    """
    sorted_indices = np.argsort(y_score)
    y_true_sorted = y_true[sorted_indices]

    bucket_edges = np.linspace(0, len(y_true), n_buckets + 1, dtype=int)
    bucket_values = []

    if statistic == 'mean':
        stat_func = np.mean
    elif statistic == 'median':
        stat_func = np.median
    elif callable(statistic):
        stat_func = statistic
    else:
        raise ValueError

    for i in range(n_buckets):
        start, end = bucket_edges[i], bucket_edges[i + 1]
        bin_values = y_true_sorted[start:end]
        if len(bin_values) > 0:
            bucket_stat = stat_func(bin_values)
        else:
            bucket_stat = np.nan
        bucket_values.append(bucket_stat)

    bucket_positions = np.arange(1, n_buckets + 1)
    return bucket_positions, bucket_values

N_BUCKETS = 10
ordering_curve_dict = {}

for score_name, y_score in SCORES.items():
    bins, ordering_curve = ranking_curve(y_true, y_score, N_BUCKETS, 'mean')
    ordering_curve_dict[score_name] = ordering_curve

It’s useful to compare the curve with a random model that would uniformly distribute y_true across all bins, meaning that the mean for every bucket would be the same, as there would be no relationship between the order and y_true.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(5, 3), dpi=130)

ax.hlines(y_true.mean(), min(bins), max(bins), label='random ordering', colors='k', alpha=0.5)
for score_name, ordering_curve in ordering_curve_dict.items():
    ax.plot(bins, ordering_curve, '-o', markeredgecolor='k', markeredgewidth=0.5, label=score_name)

ax.set_xticks(bins)
ax.legend()
ax.set_ylabel("Mean of y for each bucket")
ax.set_xlabel("buckets")
plt.tight_layout()

It is also very useful to transform this plot into numerical values that can be used to compare models during hyperparameter optimization. Some of the metrics I like to use include:

The value of the last bucket.
The value of the first bucket.
The difference between the last and the first bucket (which is equivalent to the mean of the variations through a telescoping sum).
The slope of a linear regression fitted to the points.

def last_bucket_ordering_curve(y_true, y_score, n_buckets=10, statistic='mean'):
    _, ordering_curve = ranking_curve(y_true, y_score, n_buckets=n_buckets, statistic=statistic)
    *_, last_bucket = ordering_curve
    return last_bucket

def first_bucket_ordering_curve(y_true, y_score, n_buckets=10, statistic='mean'):
    _, ordering_curve = ranking_curve(y_true, y_score, n_buckets=n_buckets, statistic=statistic)
    first_bucket, *_ = ordering_curve
    return first_bucket

def diff_bucket_ordering_curve(y_true, y_score, n_buckets=10, statistic='mean'):
    _, ordering_curve = ranking_curve(y_true, y_score, n_buckets=n_buckets, statistic=statistic)
    first_bucket, *_, last_bucket = ordering_curve
    return last_bucket - first_bucket

from sklearn import linear_model

def linear_regression_coefficient_ordering_curve(y_true, y_score, n_buckets=10, statistic='mean'):
    _, ordering_curve = ranking_curve(y_true, y_score, n_buckets=n_buckets, statistic=statistic)
    x_values = np.arange(n_buckets).reshape(-1, 1)
    y_values = np.array(ordering_curve).reshape(-1, 1)

    model = linear_model.LinearRegression().fit(x_values, y_values)
    return model.coef_[0][0]

The higher the value of the last bin, the more concentrated the selected values of y_true are in the higher range.

for score_name, y_score in SCORES.items():
    last_bucket = last_bucket_ordering_curve(y_true, y_score)
    print(f"Last bucket for {score_name}: {last_bucket:7.5f}")

Last bucket for y_score_1: 1.79617
Last bucket for y_score_2: 1.70048
Last bucket for y_score_3: 0.12308

The lower the value of the first bin, the more concentrated the selected values of y_true are in the lower range.

for score_name, y_score in SCORES.items():
    first_bucket = first_bucket_ordering_curve(y_true, y_score)
    print(f"First bucket for {score_name}: {first_bucket:8.5f}")

First bucket for y_score_1: -1.76345
First bucket for y_score_2: -1.70674
First bucket for y_score_3:  0.07232

The greater the difference between the last bin and the first bin, the better separated the values with low scores are from those with higher scores.

for score_name, y_score in SCORES.items():
    diff_bucket = diff_bucket_ordering_curve(y_true, y_score)
    print(f"Diff bucket for {score_name}: {diff_bucket:7.5f}")

Diff bucket for y_score_1: 3.55962
Diff bucket for y_score_2: 3.40723
Diff bucket for y_score_3: 0.05076

The steeper the slope of the linear regression curve fitted to the points, the more tilted the points are, indicating better ranking.

for score_name, y_score in SCORES.items():
    lr_bucket = linear_regression_coefficient_ordering_curve(y_true, y_score)
    print(f"Linear regression coefficient for {score_name}: {lr_bucket:7.5f}")

Linear regression coefficient for y_score_1: 0.34367
Linear regression coefficient for y_score_2: 0.32808
Linear regression coefficient for y_score_3: 0.00722

$\oint$ Adding sample weights to this curve is considerably more tedious, as you need to split the percentiles based on the sum of the weights, but it’s not impossible. :)

$\oint$ This curve is also really good for evaluating ranking performance for classification problems.

Final considerations

Although regression models often optimize metrics based on $| \hat{y_i} - y_i |$, I hope this discussion has inspired reflection on the limitations of such metrics. They may not always be the most appropriate choice and can sometimes obscure the true problem of interest.

The ranking metrics introduced here are each highly valuable, complementing one another depending on the specific context and problem at hand. Instead of striving for a single, universally applicable metric, it is often more effective to evaluate these metrics collectively. In practice, they tend to align and reinforce each other, offering a richer and more nuanced understanding of model performance.

Moreover, I encourage you to tweak existing metrics or develop custom variations which can often uncover fresh perspectives on a problem. The ultimate goal is not merely to assign a score to a model but to ensure it aligns with the problem's objectives and delivers outcomes that are meaningful and actionable.

Bibliography

[1] Spearman's rank correlation coefficient. Wikipedia.

[2] Kendall rank correlation coefficient. Wikipedia.

[3] Imbalanced Binary Classification - A survey with code. Alessandro Morita, Juan Pablo Ibieta, Carlo Lemos.

[4] You Can Compute ROC Curve Also for Regression Models. Samuele Mazzanti.

You can find all files and environments for reproducing the experiments in the repository of this post.

The R² score does not vary between 0 and 1

2023-10-12T00:00:00+00:00

Este texto tem uma versão em português que pode ser encontrada no repositório de experimentos.

The coefficient of determination, known as $R^2$, is a fundamental metric in regression analyses. However, its definition and interpretation are not always straightforward. Indeed, there are several ways to define the $R^2$ and, although all are equivalent, each offers a different interpretative nuance. Some of these interpretations are more intuitive, facilitating an immediate understanding of the possible values, while others can lead to misunderstandings.

The current version of scikit-learn, in its docstring for sklearn.metrics.r2_score, mentions that the $R^2$ can range from negative infinity to 1. However, it's not uncommon to find data scientists claiming that the range of possible values for $R^2$ is strictly between 0 and 1. One of the reasons for this discrepancy lies in the classical interpretation of $R^2$, which is traditionally understood as the proportion of variance explained by the model relative to the total variance of the target variable [1].

Throughout this text, I will address the interpretation that I consider most enlightening and relevant. With it, I hope to clarify some peculiarities of the $R^2$ and highlight its importance as a robust metric, frequently referred to in regression problems.

Mean Squared Error and the choice of a constant model

The $R^2$ is a common metric in regression. However, often the first metric introduced for regression problems is the Mean Squared Error (MSE). The MSE of a model $h$ on a dataset $S = \{ (x_i, y_i) \}_{i=1}^n$ is defined by

\[\textrm{MSE}(h) = \frac{1}{n} \sum_{i=1}^n \left(y_i - h(x_i)\right)^2,\]

where we chose not to denote the dependence on $S$ in order to keep the notation more streamlined.

Given this definition, an intriguing question arises: if you had to create a model that was merely a constant, which value would you choose? Many might answer that they would choose the mean, which is indeed one of the correct answers. However, why not consider the median, mode, or some other descriptive statistic?

The answer to this question is intrinsically linked to the cost function we wish to optimize. This choice is, in fact, a problem of decision theory [2]. For instance, if the goal is to optimize the MSE, then we would need to choose an $\alpha \in \mathbb{R}$ such that $h_\alpha(x) = \alpha$ minimizes the $\textrm{MSE}(h_\alpha)$. Mathematically, this is expressed as

\[\alpha^* = \arg\min_{\alpha \in \mathbb{R}} \textrm{MSE}(h_\alpha) = \arg\min_{\alpha \in \mathbb{R}} \left( \frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 \right).\]

This function may seem complex at first glance, but it becomes simpler when considering only $\alpha$ as the free variable, which is how we approach this optimization problem. By expanding the square and performing the summation, we have a polynomial function of degree 2 in $\alpha$ in the form

\[\frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i^2 -2\alpha y_i + \alpha^2 \right) = \alpha^2 + \left(\frac{-2}{n} \sum_{i=1}^n y_i\right) \alpha+ \left(\frac{1}{n} \sum_{i=1}^n y_i^2\right).\]

In a quadratic function of the form $(a\,\alpha^2 + b\,\alpha + c)$, where $a>0$, the minimum occurs at the vertex of the parabola, located at $\frac{-b}{2a}$. Thus, in our context, the minimum is

\[\alpha^* = \frac{\left(\frac{-2}{n} \sum_{i=1}^n y_i\right)}{-2} = \frac{1}{n} \sum_{i=1}^n y_i = \bar{y}.\]

This means that, when minimizing the MSE, the optimal constant value is the average of the target $\bar{y}$ for this set. I encourage validating this result using other unconstrained optimization techniques such as identifying critical points followed by analyzing the concavity of the function.

This behavior changes when considering other metrics [3]. For example, to minimize the Mean Absolute Error (MAE), the constant value that optimizes it is the median, while the value that optimizes accuracy is the mode, and for pinball loss, it's the associated quantile. It's important to emphasize that if we consider sample_weight, all these statistics should be computed in a weighted manner.

$\oint$ This is used in defining prediction values for the nodes of decision trees. Looking at the scikit-learn code for trees, we notice that, depending on the criterion, the node_value can vary. It's adjusted to reflect the value that minimizes the loss when the node makes a constant prediction. For example, for the MSE criterion, the leaf's prediction is the average of the target of the training samples that fall in that leaf, while for the MAE criterion, it's the median.

$\oint$ In practice, a model that predicts the target's average isn't feasible because to calculate the average of the test set, you would need to know the $y_i$ values of that sample. However, this perspective is useful for comparing a basic model with your model, as we will discuss next.

R² as a comparison between your model and a constant model

Suppose I develop a model to predict a person's age based on their online behavior and obtain an MSE of 25 years squared. This number on its own might not be very informative. One way to interpret it is to calculate the Root Mean Squared Error, that is, $\textrm{RMSE} = \sqrt{\textrm{MSE}}$, resulting in an error of about 5 years. This value is more intuitive (I admit that, internally, I tend to think in terms of MAE), but it still doesn't provide a relative comparison like "is it possible to get a value significantly lower than this?". The $R^2$ might not answer this question directly, but it aids in this evaluation.

We've already discussed a simple model that can serve as a benchmark. Imagine that the mean-based model already produces an MSE of 30 years squared. Suddenly, our previous model, which might have seemed excellent, doesn't stand out as much. If a simple model already achieves an MSE just slightly higher than the current model, is it worth implementing the more complex model in a production environment?

The interpretation I have of $R^2$ is precisely this comparison. Its formula is

\[R^2(h) = 1 - \frac{\textrm{MSE}(h)}{\textrm{MSE}(\bar{y})},\]

where $\bar{y}$ represents the average of the target in the set $S$ in which we are evaluating the model.

With this, we can understand the possible values of $R^2$:

If $R^2 = 1$, it means that $\textrm{MSE}(h) = 0$; that is, the model is perfect.
If $R^2 = 0$, we have $\textrm{MSE}(h) = \textrm{MSE}(\bar{y})$, indicating that our model is as effective as a model that simply provides the target's average.
For an $R^2$ between 0 and 1, we have $0 < \textrm{MSE}(h) < \textrm{MSE}(\bar{y})$. This indicates that the model has an error greater than zero, but less than that of a constant model based on the average.
A negative $R^2$ suggests that $\textrm{MSE}(h) > \textrm{MSE}(\bar{y})$, meaning our model is less accurate than one that always provides the average.

This interpretation helps in understanding the values obtained when using the function sklearn.metrics.r2_score. In the previous example, we would have an $R^2$ of $(1 - 25/30) \approx 0.17$, indicating a model that surpasses the simple model, but not very significantly.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    *fetch_california_housing(return_X_y=True),
    test_size=0.33,
    random_state=42,
)

lr = LinearRegression().fit(X_train, y_train)

def evaluate_model(y_true, y_pred):
    print(f"MSE: {mean_squared_error(y_true, y_pred)}")
    print(f"R^2: {r2_score(y_true, y_pred)}")
    
y_pred_lr =  lr.predict(X_test)
evaluate_model(y_test, y_pred_lr)

MSE: 0.5369686543372444
R^2: 0.5970494128783965

y_mean_test = y_test.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_test)

MSE: 1.3325918152222385
R^2: 0.0

y_pred_terrible_model = np.zeros_like(y_test)
evaluate_model(y_test, y_pred_terrible_model)

MSE: 5.6276808369101445
R^2: -3.2231092616846126

Although a model with an $R^2$ of zero might seem like the lowest achievable threshold, in reality, this metric uses a baseline model with data leakage. In practice, we build our models using training data, and in scenarios subject to "dataset shift," there can be significant changes in fundamental statistics, such as the average.

y_mean_train = y_train.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_train)

MSE: 1.3326257277946882
R^2: -2.5448582275933163e-05

Regardless of these nuances, interpreting the $R^2$ in this way offers a valuable comparative mindset. It's always essential to compare your model with simple baselines, whether with established business rules or with more basic models, like a constant.

Generalization of R² beyond MSE

The notion of comparison with a basic or simple model can easily be generalized to other metrics, as long as we know which statistics to use as a baseline. Considering this, I propose extending this idea to the MAE using the median $\tilde{y}$ as the baseline model

\[R^2_{\textrm{MAE}}(h) = 1 - \frac{\textrm{MAE}(h)}{\textrm{MAE}(\tilde{y})},\]

where

\[\textrm{MAE}(h) = \frac{1}{n} \sum_{i=1}^n \left| y_i - h(x_i) \right|.\]

Thus, the $R^2_{\textrm{MAE}}$ provides a way to evaluate the model's performance relative to a simple baseline, using the MAE as the error metric.

from sklearn.metrics import mean_absolute_error

def r2_score_mae(y_true, y_pred, *args, **kwargs):
    mae_model = mean_absolute_error(y_true=y_true, y_pred=y_pred, *args, **kwargs)
    y_median_true = np.median(y_true) * np.ones_like(y_true)
    mae_median = mean_absolute_error(
        y_true=y_true, y_pred=y_median_true, *args, **kwargs
    )
    return 1 - mae_model / mae_median

def evaluate_model_mae(y_true, y_pred):
    print(f"MAE: {mean_absolute_error(y_true, y_pred)}")
    print(f"R^2_MAE: {r2_score_mae(y_true, y_pred)}")

evaluate_model_mae(y_test, y_pred_lr)

MAE: 0.5295710106684688
R^2_MAE: 0.40256278728026484

y_median_test = np.median(y_test) * np.ones_like(y_test)
evaluate_model_mae(y_test, y_median_test)

MAE: 0.8864044612448619
R^2_MAE: 0.0

Final considerations

The misconception that $R^2$ varies only between 0 and 1 originates from a simplified interpretation of its most common meaning: the proportion of the target's variance that is explained by the independent variables, which suggests that the value lies between 0% and 100%. In practice, in many cases, $R^2$ indeed falls within this range. However, in situations where the model is inferior to a simple horizontal model (i.e., a straight line representing the average), $R^2$ can have negative values. This negative scenario is often underestimated by the statistical community, as it is usually associated with overfitting situations. Rarely will a linear regression that tends to suffer from underfitting be inferior to the horizontal model included in the hypothesis space of linear regression.

Throughout this post, we analyzed some of the reasons why $R^2$ is such an interesting metric and widely used in regression problems. By understanding the implicit comparison with a baseline model, we gain a valuable perspective on the relative performance of our model, normalizing the less informative values of MSE when viewed in isolation. Moreover, the interpretation proposed here truly allows us to understand the resulting values in a clear and objective manner.

Bibliography

[1] Coefficient of determination. Wikipedia.

[2] Introdução à Teoria da Decisão. Fundamentos de Inferência Bayesiana. Victor Fossaluza e Luís Gustavo Esteves.

[3] Estimação Pontual. Fundamentos de Inferência Bayesiana. Victor Fossaluza e Luís Gustavo Esteves.

You can find all files and environments for reproducing the experiments in the repository of this post.

Conformal prediction in CATE estimation

2023-07-17T00:00:00+00:00

As we've discussed in the post about Conditional Density Estimation, having a sense of confidence associated with your prediction is important for decision making [1], and this is no different in applications of causal inference. Here, estimating confidence intervals for the Conditional Average Treatment Effect (CATE) can greatly enhance the validity of causal inference studies.

In the binary treatment $T\in\{0, 1\}$ scenario, CATE is defined as the expected difference in outcomes $Y$ when an individual with certain observable characteristics is treated versus when the same individual is not treated. Mathematically, depending on the school of causal inference that you come from, we can write "the average difference in expected potential outcomes conditional on the same covariates $Z=z$" as [2, 3, 4].

\[\begin{align*} \textrm{CATE}_{T, Y}(z) &= \mathbb{E}(Y| do(T=1), Z=z) - \mathbb{E}(Y| do(T=0), Z=z)\\ &= \mathbb{E}(Y_1 | Z=z) - \mathbb{E}(Y_0 | Z=z). \end{align*}\]

CATE helps to estimate the effect of a treatment at an individual level, taking into account the specific characteristics of each instance. This is incredibly valuable in many fields of industry where understanding the effect of a treatment ($T$) on different subpopulations ($Z$) helps in creating personalized treatment plans depending on the desired outcome ($Y$).

Brief review of confounder control

It's common to use as $Z$ a set of variables that, in the CATE conditionals, satisfies the backdoor criterion — or, in Rubin's theory, renders $T$ conditionally ignorable — to measure the causal effect of $T$ on $Y$, i.e., $(Y_0, Y_1) \, \bot \, T \, | \, Z$. This is important because, in this scenario, $Z$ controls confounders [2], and we have the causal identification given by

\[f(z|do(T=t)) = f(z)\textrm{, and }f(y|do(T=t), Z=z) = f(y|T=t, Z=z).\]

Consequently [2]

\[\mathbb{E}(Y|do(T=t), Z=z) = \mathbb{E}(Y|T=t, Z=z).\]

This relationship is crucial as it enables us to estimate this quantity using any supervised machine learning model. This technique is known as the adjustment formula and has different flavors such as meta-learners and matching [2, 3].

Despite its utility, applying conformal prediction for estimating CATE in the above scenario is not straightforward. Since binary CATE involves estimating two quantities, it is necessary to combine the prediction intervals of these two estimates in some way. We will discuss how we can do this without any parametric assumptions.

$\oint$ In continuous treatment scenarios, my experience has shown that $\mathbb{E}(Y| do(T=t), Z=z)$ provides more information than CATE, which is defined as the derivative of this expectation with respect to $t$. It is easier to directly use conformal prediction in $\mathbb{E}(Y| do(T=t), Z=z)$ as this scenario can be interpreted just as a regression, when using the adjustment formula. On the other hand, if you really need to use CATE, this interval estimate is much more complicated, and bootstrap strategies would be the approach I would use. If you have another idea, please reach out!

Creating the dataset

To illustrate our application, we will use a simple causal graph where $Z$ will act as a confounder, serving as a set that satisfies the backdoor criterion.

With structural causal graph given by

\[U_Z \sim \textrm{Uniform}(-\pi, \pi)\textrm{, with }g_Z(u_Z) = u_Z,\] \[U_T \sim \textrm{Uniform}(0, 1)\textrm{, with }\] \[g_T(u_T, z) = \mathbb{1}(u_T \leq 0.05 + 0.9\, \sigma(z))\textrm{, where }\sigma(x) = \frac{1}{1 + \exp(-x)},\] \[U_Y \sim \mathcal{N}(0, 1)\textrm{, with }\] \[g_Y(u_Y, z, t) = \mathbb{1}(t=0) (10 \sin(z)) + \mathbb{1}(t=1) (10 \cos(z)) + 0.5 (1 + t\,|z|)\,u_Y.\]

Note that we are in a suitable scenario to apply causal inference as the positivity assumption [5] is guaranteed; in other words, it holds that

\[0 < \mathbb{P}(T=t | Z=z) < 1 \textrm{, }\forall t \in \textrm{Im}(T)= \{ 0, 1\}, z \in \textrm{Im}(Z) = (0, 1).\]

def adapted_sigmoid(x):
    return 0.05 + 0.9 / (1 + np.exp(-x))

def func_0(Z):
    return 10 * np.sin(Z)

def func_1(Z):
    return 10 * np.cos(Z)

def generate_data(size=100, obs=True, random_state=None):
    rs = np.random.RandomState(random_state).randint(
        0, 2**32 - 1, dtype=np.int64, size=4
    )

    Z_obs = np.random.RandomState(rs[0]).uniform(low=-np.pi, high=np.pi, size=size)

    def g_T_noised(Z):
        return (
            np.random.RandomState(rs[1])
            .binomial(n=1, p=adapted_sigmoid(Z))
            .astype(bool)
        )

    T_obs = g_T_noised(Z_obs)

    noise = np.random.RandomState(rs[3]).normal(size=size)

    def g_Y(T, Z, noise):
        return (
            np.select(condlist=[T], choicelist=[func_1(Z)], default=func_0(Z))
            + 0.5 * (1 + T * np.abs(Z)) * noise
        )

    Y_obs = g_Y(T_obs, Z_obs, noise)
    Y_cf = g_Y(~T_obs, Z_obs, noise)

    def generate_df(T, Z, Y):
        return pd.DataFrame(
            np.vstack([T.astype(int), Z, Y]).T,
            columns=["treatment", "confounder", "target"],
        )

    df_obs = generate_df(T_obs, Z_obs, Y_obs)
    df_cf = generate_df(~T_obs, Z_obs, Y_cf)

    return df_obs, df_cf

df_obs, df_cf = generate_data(size=50_000, obs=True, random_state=42)

Since we are dealing with synthetic data, we can observe both the observational and the counterfactual scenarios. In this instance, we can actually derive $Y_1 - Y_0$ for each example. Thus, we will be able to evaluate our estimates using a test set that's separate from the training set, as is typical in supervised scenarios.

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(
    df_obs.assign(target_cf=df_cf.target),
    test_size=0.2,
    random_state=42,
)

df_train_t0 = df_train.query("treatment == 0")
df_train_t1 = df_train.query("treatment == 1")

def return_TZ_y(df, backdoor_set_list):
    return df.filter(backdoor_set_list), np.array(df.target)

backdoor_set = ["confounder"]

XZ_train_t0, y_train_t0 = return_TZ_y(df_train_t0, backdoor_set)
XZ_train_t1, y_train_t1 = return_TZ_y(df_train_t1, backdoor_set)

XZ_test, y_test = return_TZ_y(df_test, backdoor_set)

Positivity assumption

One assumption, often overlooked in Pearl's theory but crucial to test for good estimation, is the positivity assumption. As we observed earlier, this assumption is satisfied in our synthetic data, but in a real-life scenario, it would require validation.

$\oint$ If you are in a situation where you are applying a "$\varepsilon$-greedy strategy" in your population to have randomization, then this assumption is ensured. This emphasizes the importance of a continuous experimentation process in a product based on causal inference.

The importance of the positivity assumption being satisfied is immediate: How do we predict what happens with $Y$ when $T$ has a certain value in regions of $Z$ where no individual has received such treatment? Naturally, the problem becomes impossible, or your approximation becomes very bad because it uses distant examples to make predictions for that point.

The common approach to ensure this is to employ a model that estimates $T$ using $Z$ and then evaluate it. If this model demonstrates exceptional performance, it implies that the relationship is likely deterministic, thereby violating the positivity assumption. In the case of binary treatment, which is our scenario, we can assess a reasonably well-calibrated model (or calibrate the model ourselves [6]) and examine the distribution of probabilities.

from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

positivity_assumption_check_estimator = LogisticRegression(
    random_state=42,
).fit(df_train.drop(columns=["treatment", "target", "target_cf"]), df_train.treatment)

roc_auc_score(
    df_test.treatment, positivity_assumption_check_estimator.predict_proba(XZ_test)[:, 1]
)

0.8370462957096292

The sklearn.metrics.roc_auc_score already suggests that we are in a plausible scenario to assume the positivity assumption. When there exist deterministic regions in the relationship between $T$ and $Z$, this typically results in a sklearn.metrics.roc_auc_score close to 1.

from calibration_stuff import calibration_curve

probs = positivity_assumption_check_estimator.predict_proba(XZ_test)[:, 1]
prob_true, prob_pred, size_bin = calibration_curve(df_test.treatment, probs, n_bins=10)

fig, ax = plt.subplots(ncols=2, figsize=(10, 3))
ax[0].plot([0, 1], "--")
ax[0].scatter(prob_true, prob_pred, s=(0.1 * size_bin).astype(int), edgecolor="k")
ax[0].set_xlabel("True probability of bin")
ax[0].set_ylabel("Mean predicted probability of bin")
ax[1].hist(
    probs, bins=np.linspace(0, 1, 21), weights=np.ones_like(probs) / probs.shape[0]
)
ax[1].set_xlabel("Histogram of predicted probability")
plt.tight_layout()

Indeed, after confirming that the model is reasonably calibrated, we can observe that the probability histograms do not contain examples with probabilities close to 0 or 1. This suggests that we are in an appropriate scenario for estimating CATE.

$\oint$ The scenario of continuous treatment is slightly more complex, but evaluating regression metrics can provide a good intuition of this relationship. Another viable technique is to discretize the treatment and analyze these probabilities in a manner similar to the approach used for the binary case.

Conformalized Quantile Regression

Quantile regression with pinball loss [7] is a suitable method for predicting conditional quantiles of a target variable. However, these estimates $Q_{\beta}$ and $Q_{1-\beta}$ of the conditional quantiles $\beta \in (0, 1)$ and $1 - \beta$, respectively, usually do not satisfy the coverage property which requires $\mathbb{P}((Y|Z=z) \in (Q_{\beta}$, $Q_{1-\beta})) \geq 1 - 2 \beta$ [8].

Conformalized Quantile Regression utilizes the previous quantile regression approach, but with a correction in these predictions of conditional quantiles, thereby ensuring marginal coverage [1, 8].

We can implement a version of Conformalized Quantile Regression using the aforementioned strategy, trying to follow the scikit-learn standards and using lightgbm.LGBMRegressor with `objective="quantile"` as the quantile regressor.

from functools import partial
from lightgbm import LGBMRegressor
from scipy.stats import loguniform
from sklearn.base import BaseEstimator
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.utils.validation import check_X_y, check_is_fitted, _check_sample_weight
from statsmodels.stats.weightstats import DescrStatsW

class ConformalizedQuantileRegression(BaseEstimator):
    """
    Conformalized Quantile Regression with LGBMRegressor.

    This estimator provides prediction intervals for one dimension
    regression tasks by using CQR with LightGBM.

    Parameters
    ----------
    alpha : float, default=0.05
        Determines the size of the prediction interval. For example,
        alpha=0.05 results in a 95% coverage prediction interval.

    calibration_size : float, default=0.2
        The proportion of the dataset to be used for the calibration set
        which computes the conformity scores.

    random_state : int, RandomState instance or None, default=None
        Controls the randomness for reproducibility.

    n_iter_cv : int, default=10
        Number of parameter settings that are sampled in RandomizedSearchCV
        for the LightGBM model during fit.
    """

    def __init__(
        self, alpha=0.05, calibration_size=0.2, random_state=None, n_iter_cv=10
    ):
        self.alpha = alpha
        self.calibration_size = calibration_size
        self.random_state = random_state
        self.n_iter_cv = n_iter_cv

    def _quantile_loss(self, y_true, y_pred, quantile=None, sample_weights=None):
        weighted_errors = (y_true - y_pred) * (quantile - (y_true < y_pred))
        if sample_weights is not None:
            weighted_errors *= sample_weights
        return np.mean(weighted_errors)

    def _return_quantile_model(self, quantile):
        quantile_scorer = make_scorer(
            partial(self._quantile_loss, quantile=quantile), greater_is_better=False
        )

        return RandomizedSearchCV(
            estimator=LGBMRegressor(
                random_state=self.random_state, objective="quantile", alpha=quantile
            ),
            cv=KFold(shuffle=True, random_state=self.random_state),
            param_distributions={
                "learning_rate": loguniform.rvs(
                    random_state=self.random_state, a=0.0001, b=1, size=1000
                ),
                "n_estimators": [50, 100, 200],
                "num_leaves": [31, 63, 127],
            },
            scoring=quantile_scorer,
            n_iter=self.n_iter_cv,
            random_state=self.random_state,
            n_jobs=-1,
        )

    def fit(self, X, y, sample_weight=None):
        X, y = check_X_y(X, y)
        sample_weight = _check_sample_weight(sample_weight, X)

        (
            X_train,
            X_cal,
            y_train,
            y_cal,
            sample_weight_train,
            sample_weight_cal,
        ) = train_test_split(
            X,
            y,
            sample_weight,
            test_size=self.calibration_size,
            random_state=self.random_state,
        )

        self.model_lower_ = self._return_quantile_model(quantile=self.alpha / 2).fit(
            X_train, y_train, sample_weight=sample_weight_train
        )
        self.model_upper_ = self._return_quantile_model(
            quantile=1 - self.alpha / 2
        ).fit(X_train, y_train, sample_weight=sample_weight_train)

        self.y_cal_conformity_scores_ = np.maximum(
            self.model_lower_.predict(X_cal) - y_cal,
            y_cal - self.model_upper_.predict(X_cal),
        )
        wq = DescrStatsW(data=self.y_cal_conformity_scores_, weights=sample_weight_cal)
        self.quantile_conformity_scores_ = wq.quantile(
            probs=1 - self.alpha, return_pandas=False
        )[0]

        return self

    def predict(self, X):
        """
        Predicts conformalized quantile regression intervals for X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The input samples.

        Returns
        -------
        y_test_interval_pred_cqr : ndarray of shape (n_samples, 2)
            Returns the predicted lower and upper bound for each sample in X.
        """
        check_is_fitted(self)
        y_test_interval_pred_cqr = np.column_stack(
            [
                self.model_lower_.predict(X) - self.quantile_conformity_scores_,
                self.model_upper_.predict(X) + self.quantile_conformity_scores_,
            ]
        )
        return y_test_interval_pred_cqr

Using the T-learner

In this example, we will utilize the T-learner technique [3, 9], building a model to estimate each $\mathbb{E}(Y|do(T=t), Z)$ for $t\in\{0, 1\}$. We will set alpha=0.05 to construct prediction sets with 95% coverage.

model_t0 = ConformalizedQuantileRegression(
    random_state=42, alpha=0.05, n_iter_cv=30
).fit(XZ_train_t0, y_train_t0)
y_test_interval_pred_cqr_t0 = model_t0.predict(XZ_test)

model_t1 = ConformalizedQuantileRegression(
    random_state=42, alpha=0.05, n_iter_cv=30
).fit(XZ_train_t1, y_train_t1)
y_test_interval_pred_cqr_t1 = model_t1.predict(XZ_test)

$\oint$ It's worth noting that you may want to implement an importance weighting strategy here to achieve a better prediction set in regions where $P(T=t | Z=z)$ is close to zero (naturally, these being regions with fewer examples). We can interpret this as being in a covariate shift environment, where the covariates of the population to which we are applying the model are different from those of the population on which we are training it. However, if you can ensure the positivity assumption, it may be less critical (especially with models that don't underfit, such as tree ensembles [10]).

def return_sample_weight_treatment_i(df_train, df_test):
    df_ood_ti = pd.concat(
        [
            df.assign(train_or_test=j)
            for j, df in enumerate(
                [
                    df_train.drop(columns=["treatment", "target_cf"]),
                    df_test.drop(columns=["treatment", "target_cf"]),
                ]
            )
        ]
    )

    ood_sample_correction_ti = LogisticRegression(
        random_state=42,
    ).fit(df_ood_ti.drop(columns=["train_or_test"]), df_ood_ti.train_or_test)

    roc = roc_auc_score(
        df_ood_ti.train_or_test,
        ood_sample_correction_ti.predict_proba(
            df_ood_ti.drop(columns=["train_or_test"])
        )[:, 1],
    )

    probs = ood_sample_correction_ti.predict_proba(
        df_train.drop(columns=["treatment", "target_cf"])
    )
    # Equivalent to `probs[:, 1]/probs[:, 0]`.
    sample_weights_ti = 1 / probs[:, 0] - 1

    return roc, sample_weights_ti

_, sw_0 = return_sample_weight_treatment_i(df_train=df_train_t0, df_test=df_test)

Evaluating the conformal regression

With the interval estimates calculated in y_test_interval_pred_cqr_t0 and y_test_interval_pred_cqr_t1, we can assess the effectiveness of our predictions. To do this, we will examine factors such as the coverage of our predictions, in both the observational and counterfactual scenarios (given that we also have this value for evaluation) and the size of these intervals.

df_val = (
    df_test.assign(pred_lower_t_0=y_test_interval_pred_cqr_t0[:, 0])
    .assign(pred_upper_t_0=y_test_interval_pred_cqr_t0[:, 1])
    .assign(ic_size_t_0=lambda df_: df_.pred_upper_t_0 - df_.pred_lower_t_0)
    .assign(pred_lower_t_1=y_test_interval_pred_cqr_t1[:, 0])
    .assign(pred_upper_t_1=y_test_interval_pred_cqr_t1[:, 1])
    .assign(ic_size_t_1=lambda df_: df_.pred_upper_t_1 - df_.pred_lower_t_1)
    .assign(
        prob=lambda df_: positivity_assumption_check_estimator.predict_proba(
            df_.filter(backdoor_set)
        )[:, 1]
    )
    .assign(prob_cut=lambda df_: pd.cut(df_.prob, bins=np.linspace(0, 1, 6)))
    .assign(
        coverage=lambda df_: np.select(
            condlist=[df_.treatment == 0],
            choicelist=[
                (df_.target > df_.pred_lower_t_0) & (df_.target < df_.pred_upper_t_0)
            ],
            default=(df_.target > df_.pred_lower_t_1)
            & (df_.target < df_.pred_upper_t_1),
        )
    )
    .assign(
        coverage_cf=lambda df_: np.select(
            condlist=[df_.treatment != 0],
            choicelist=[
                (df_.target_cf > df_.pred_lower_t_0)
                & (df_.target_cf < df_.pred_upper_t_0)
            ],
            default=(df_.target_cf > df_.pred_lower_t_1)
            & (df_.target_cf < df_.pred_upper_t_1),
        )
    )
)

df_val.coverage.mean()

0.9497

It's important to highlight that conformal prediction ensures us marginal coverage, which doesn't always convert into conditional coverage [1]. We could be generating excellent estimates for certain regions of $Z$ and inferior ones for the rest and still have good marginal coverage because they would cancel out. To examine this, we would need to study

\[P((Y|Z=z)\in \tau(Z=z) \,|\, T=t, Z=z),\]

where $\tau(Z=z)$ is the prediction set for $Z=z$.

One method to visualize this is by partitioning, for instance, the regions using $P(T=1 | Z=z)$ (from the same model as used in the positivity assumption check) to construct buckets where we can calculate coverage estimates, i.e., the mean of $(Y|Z=z)\in \tau(Z=z)$. If we further break it down by treatment, we will be measuring something similar to the conditional coverage.

from scipy.stats import bootstrap

def bootstrap_ci(x, ci=0.95):
    boot = bootstrap((x,), np.mean, confidence_level=ci)
    return np.round(boot.confidence_interval, 5)

df_val_cond_aux1 = (
    df_val.groupby(["prob_cut", "treatment"])
    .coverage.apply(bootstrap_ci)
    .to_frame()
    .rename(columns={"coverage": "coverage_confidence_interval"})
)

df_val_cond_aux2 = (
    df_val.groupby(["prob_cut", "treatment"])
    .coverage_cf.apply(bootstrap_ci)
    .to_frame()
    .rename(columns={"coverage_cf": "coverage_cf_confidence_interval"})
)

df_val_cond_aux3 = (
    df_val.groupby(["prob_cut", "treatment"])
    .agg(
        {
            "coverage": np.mean,
            "coverage_cf": np.mean,
            "ic_size_t_0": np.mean,
            "ic_size_t_1": np.mean,
        }
    )
    .rename(columns=lambda col: col + "_mean")
)

pd.concat(
    [df_val_cond_aux1, df_val_cond_aux2, df_val_cond_aux3], axis=1
).reset_index().sort_values(["treatment", "prob_cut"])

	prob_cut	treatment	coverage_confidence_interval	coverage_cf_confidence_interval	coverage_mean	coverage_cf_mean	ic_size_t_0_mean	ic_size_t_1_mean
0	(0.0, 0.2]	0.0	[0.93669, 0.95635]	[0.93046, 0.95108]	0.947242	0.941487	1.980407	6.920811
2	(0.2, 0.4]	0.0	[0.94852, 0.96972]	[0.93565, 0.95988]	0.959879	0.948524	2.032159	4.205569
4	(0.4, 0.6]	0.0	[0.93891, 0.96946]	[0.94024, 0.96946]	0.956175	0.956175	2.054778	2.642784
6	(0.6, 0.8]	0.0	[0.92321, 0.96071]	[0.94464, 0.97679]	0.944643	0.962500	1.949311	4.066495
8	(0.8, 1.0]	0.0	[0.91579, 0.96842]	[0.92982, 0.97895]	0.947368	0.957895	2.180449	6.671260
1	(0.0, 0.2]	1.0	[0.90459, 0.96466]	[0.92226, 0.97527]	0.939929	0.954064	1.985375	6.573730
3	(0.2, 0.4]	1.0	[0.8998, 0.94499]	[0.91749, 0.95874]	0.925344	0.941061	2.043951	4.025180
5	(0.4, 0.6]	1.0	[0.93103, 0.96296]	[0.93103, 0.96296]	0.948914	0.948914	2.054981	2.636064
7	(0.6, 0.8]	1.0	[0.94165, 0.96353]	[0.9329, 0.95697]	0.953319	0.945295	1.973618	4.291990
9	(0.8, 1.0]	1.0	[0.94, 0.95902]	[0.94049, 0.95951]	0.950244	0.950732	2.214706	6.862401

Indeed, it appears that we're also doing a reasonable job in terms of our conditional coverage, very close to 95%, the coverage requested from ConformalizedQuantileRegression. This implies that even in regions with fewer examples with treatment $T=0$ (for instance, where prob_cut=[0.8, 1)), our coverage is fairly substantial.

$\oint$ Since $P((Y|Z=z)\in \tau(Z=z) \,|\, T=t, Z=z)$ shares many characteristics of a classification problem, another viable strategy might be to explore what the probabilistic output of a classifier, tasked with predicting the coverage, would yield.

probs_coverage = (
    LogisticRegression()
    .fit(df_val.filter(["treatment", "confounder"]), df_val.coverage.astype(int))
    .predict_proba(df_val.filter(["treatment", "confounder"]))[:, 1]
)

roc_auc_score(df_val.coverage.astype(int), probs_coverage)

0.5152119817684396

By executing this, we can observe that the classifier is incapable of identifying regions where there is poor coverage. We can see that the minimum of these estimated conditional probabilities (without extensive verification of calibration) remains reasonably high.

min(probs_coverage), max(probs_coverage)

(0.9405329900858612, 0.9577096822516356)

$\oint$ It's also common to evaluate the conditional coverage in relation to the size of the predicted interval (partitioning the intervals into "small", "medium", and "large") [1]. In a real application, I would undertake this, but I wish to avoid overloading this text with code, so the above already illustrates the exercise adequately.

Joining confidence intervals

While our estimates appear to be coherent, what we ultimately aim to estimate is what happens when we subtract the predicted intervals. Combining intervals while maintaining coverage isn't a straightforward task. Let's delve into this scenario a bit more.

Let's assume we have two random variables with given probabilities of being within certain intervals:

\[\mathbb{P}(A \in (m_a, M_a)) \geq 1 - \alpha, \mathbb{P}(B \in (m_b, M_b)) \geq 1 - \beta.\]

Observe that the intersection of these two events implies that the sum of the random variables lies within the interval derived from the summation of the ends of the intervals. In other words,

\[\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\} \subset \{A + B \in (m_a + m_b, M_a + M_b)\}.\]

In probability theory, a set contained in another is bounded by the probability of the larger set, so

\[\mathbb{P}(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}) \leq \mathbb{P}(\{A + B \in (m_a + m_b, M_a + M_b)\}).\]

From here, let's develop an inequality starting from the left term. The probability of the complement can be calculated as

\[\begin{align*} \mathbb{P}(\left(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}\right)^C) &= \mathbb{P}(\{A \in (m_a, M_a)\}^C \cup \{ B \in (m_b, M_b)\}^C)\\ &\leq \mathbb{P}(\{A \in (m_a, M_a)\}^C) + \mathbb{P}(\{ B \in (m_b, M_b)\}^C), \end{align*}\]

using De Morgan's laws and an overestimation of the probability of the union as the sum of the probabilities.

Following this, we can conclude that

\[\begin{align*} \mathbb{P}(\left(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}\right)^C) &\leq 1 - \mathbb{P}(\{A \in (m_a, M_a)\}) + 1 - \mathbb{P}(\{ B \in (m_b, M_b)\})\\ &\leq 1 - (1 - \alpha) + 1 - (1 - \beta) = \alpha + \beta. \end{align*}\]

$\oint$ This inequality is loose because $\{A \in (m_a, M_a)\}^C $ and $ \{ B \in (m_b, M_b)\}^C$ have a significant intersection. However, we assume it's zero when we overestimate the probability of the union by the sum of the probabilities (we are presuming they are disjoint intervals).

Since

\[\mathbb{P}(\left(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}\right)^C) \leq \alpha + \beta,\]

we find

\[\mathbb{P}(\left(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}\right)) = 1 - \mathbb{P}(\left(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}\right)^C) \geq 1 - (\alpha + \beta).\]

From this, we can deduce that since

\[\mathbb{P}(\{A \in (m_a, M_a)\} \cap \{ B \in (m_b, M_b)\}) \leq \mathbb{P}(\{A + B \in (m_a + m_b, M_a + M_b)\}),\]

we obtain an inequality for the interval resulting from the sum of the ends of the initial intervals:

\[\mathbb{P}(\{A + B \in (m_a + m_b, M_a + M_b)\}) \geq 1 - (\alpha + \beta).\]

$\oint$ This method is generally used in hypothesis testing with a Bonferroni correction derived from Boole's inequality [11].

Prediction interval of CATE

In our particular scenario, we are working with $A = \mathbb{E}(Y|do(T=1), Z=z)$ and $B = - \mathbb{E}(Y|do(T=0), Z=z)$. As a result, the limits of the intervals for $B$ are flipped from the ones we have in y_test_interval_pred_cqr_t0.

Once again, it would be valuable to assess the coverage and size of the intervals that we have now created.

df_val_cate = (
    df_val.assign(
        cate_actual=lambda df_: np.select(
            condlist=[(df_.treatment == 0)],
            choicelist=[df_.target_cf - df_.target],
            default=[df_.target - df_.target_cf],
        )[0]
    )
    .assign(cate_ci_lower=lambda df_: df_.pred_lower_t_1 - df_.pred_upper_t_0)
    .assign(cate_ci_upper=lambda df_: df_.pred_upper_t_1 - df_.pred_lower_t_0)
    .assign(cate_ci_size=lambda df_: df_.cate_ci_upper - df_.cate_ci_lower)
    .assign(
        coverage_cate=lambda df_: (df_.cate_actual > df_.cate_ci_lower)
        & (df_.cate_actual < df_.cate_ci_upper)
    )
)

As expected, the prediction intervals are larger than the ones found earlier.

fig, ax = plt.subplots(ncols=3, figsize=(9, 2))
aux_hist = np.hstack([df_val.ic_size_t_0, df_val.ic_size_t_1])
min_hist, max_hist = np.min(aux_hist), np.max(aux_hist)
ax[0].hist(
    df_val.ic_size_t_0,
    bins=np.linspace(min_hist, max_hist, 16),
    weights=np.ones_like(df_val.ic_size_t_0) / df_val.shape[0],
)
ax[1].hist(
    df_val.ic_size_t_1,
    bins=np.linspace(min_hist, max_hist, 16),
    weights=np.ones_like(df_val.ic_size_t_1) / df_val.shape[0],
)
ax[2].hist(
    df_val_cate.cate_ci_size,
    bins=16,
    weights=np.ones_like(df_val_cate.cate_ci_size) / df_val_cate.shape[0],
)
ax[0].set_title(
    "Histogram of interval size for $\mathbb{E}(Y | do(T=0), Z=z)$", fontsize="medium"
)
ax[1].set_title(
    "Histogram of interval size for $\mathbb{E}(Y | do(T=1), Z=z)$", fontsize="medium"
)
ax[2].set_title("Histogram of interval size for CATE(Z=z)", fontsize="medium")
plt.tight_layout()

Even though our individual prediction intervals were constructed for a coverage of $1 - \alpha = 0.95$, our prediction intervals for the CATE should only be $1 - (0.05 + 0.05) = 0.9$. However, as we discussed before, this is a loose approximation, and the actual coverage is substantially better than that.

df_val_cate.coverage_cate.mean()

0.9997

Given that we are dealing with $Z\in\mathbb{R}$, we can visually evaluate our conformal estimator by plotting the prediction intervals for the meta-estimators and for our estimate of the CATE. In addition, since we have control over the noise variance, we can also plot the real 95% confidence interval.

confounder_plot = np.linspace(XZ_test.confounder.min(), XZ_test.confounder.max(), 10_001)
ci_t1_plot = model_t1.predict(confounder_plot.reshape(-1, 1)).T
ci_t0_plot = model_t0.predict(confounder_plot.reshape(-1, 1)).T
ci_cate_plot = ci_t1_plot - ci_t0_plot[::-1,]

fig, ax = plt.subplots(figsize=(8, 4))

ax.plot(
    confounder_plot,
    func_0(confounder_plot) + 1.96 * 0.5,
    alpha=0.5,
    c="C0",
    label="Real confidence interval for $\mathbb{E}(Y | do(T=0), Z=z)$",
)
ax.plot(confounder_plot, func_0(confounder_plot) - 1.96 * 0.5, alpha=0.5, c="C0")

ax.plot(
    confounder_plot,
    func_1(confounder_plot) + 1.96 * (0.5 * (1 + np.abs(confounder_plot))),
    alpha=0.5,
    c="C1",
    label="Real confidence interval for $\mathbb{E}(Y | do(T=1), Z=z)$",
)
ax.plot(
    confounder_plot,
    func_1(confounder_plot) - 1.96 * (0.5 * (1 + np.abs(confounder_plot))),
    alpha=0.5,
    c="C1",
)

# Variance of CATE(Z=z) is 0.5 * |z| because the term
# related to 1 u_Y is annulled when we do
# \mathbb{E}(g_Y(u_Y, z, 1)) - \mathbb{E}(g_Y(u_Y, z, 0)).
ax.plot(
    confounder_plot,
    func_1(confounder_plot)
    - func_0(confounder_plot)
    + 1.96 * (0.5 * (np.abs(confounder_plot))),
    alpha=0.5,
    c="C2",
    label="Confidence interval for CATE(Z=z)",
)
ax.plot(
    confounder_plot,
    func_1(confounder_plot)
    - func_0(confounder_plot)
    - 1.96 * (0.5 * (np.abs(confounder_plot))),
    alpha=0.5,
    c="C2",
)

ax.fill_between(
    confounder_plot,
    *ci_t0_plot,
    alpha=0.5,
    label="Prediction interval for $\mathbb{E}(Y | do(T=0), Z=z)$",
    color="C0",
)
ax.fill_between(
    confounder_plot,
    *ci_t1_plot,
    alpha=0.5,
    label="Prediction interval for $\mathbb{E}(Y | do(T=1), Z=z)$",
    color="C1",
)
ax.fill_between(
    confounder_plot,
    *ci_cate_plot,
    alpha=0.5,
    label="Prediction interval for CATE(Z=z)",
    color="C2",
)

ax.set_xlabel("z")
ax.legend()
plt.tight_layout()

In fact, all our prediction intervals seem to align closely with the theoretical value of the confidence intervals, with the exception of the CATE interval, where we are overestimating it.

Final considerations

The CATE is an extremely interesting quantity to have in various scenarios of applied causal inference. The ability to integrate the concepts of conformal prediction into CATE estimation serves as a powerful tool, ensuring that we leverage the full potential of uncertainty quantification in our analyses and decisions. In this exploration, Conformalized Quantile Regression demonstrated its aptitude as a robust method for estimating the CATE while also offering reliable uncertainty quantification despite some overestimation.

$\oint$ After writing this post, I took a closer look at the discussions connecting causal inference with conformal predictions and found the article Conformal Inference of Counterfactuals and Individual Treatment Effects very interesting. There, they also experiment with variations of CQR, but with the doubly robust estimator. They seem to pay special attention to the scenario of conformal prediction with covariate shift — the exact scenario we are addressing here — and demonstrate heightened caution when deploying CQR in this context. In this post, I only implemented a sample_weight that is also used when calculating the quantiles of the conformal prediction calibration set.

Bibliography

[1] A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. Anastasios N. Angelopoulos, Stephen Bates.

[2] Class notes on Causal Inference (PTBR). Rafael Bassi Stern.

[3] Causal Inference for The Brave and True. Matheus Facure

[4] Causal Inference Course. Brady Neal.

[5] Causal Inference on Observational Data: It's All About the Assumptions. Jean-Yves Gérardy.

[6] Probability calibration. Scikit-Learn User Guide.

[7] Estimating conditional quantiles with the help of the pinball loss. Ingo Steinwart, Andreas Christmann.

[8] How to Predict Risk-Proportional Intervals with Conformal Quantile Regression. Samuele Mazzanti.

[9] T-learners, S-learners and X-learners. Statistical Odds & Ends.

[10] Analysis of Kernel Mean Matching under Covariate Shift. Yaoliang Yu, Csaba Szepesvari.

[11] Bonferroni correction. Wikipedia.

You can find all files and environments for reproducing the experiments in the repository of this post.

Conditional Density Estimation

2023-06-16T00:00:00+00:00

Typically, when we seek to model the relationship between a target variable $Y\in\mathbb{R}$ and one or more covariates $X$, our goal is to establish a conditional-expectation type association. Mathematically, if we define our loss as the mean squared error, our explicit aim is to identify the function $\mathbb{E} \left( Y \,|\, X=x\right)$. This function intuitively gives a prediction of the average value of $Y$ given that the covariates are $X=x$. Despite the straightforward and simplified summary provided by point estimates, they often fail to encapsulate the inherent intricacies and uncertainties prevalent in most real-world predictive scenarios. This prompts us to ask: Is the variance around this average value extensive, or can we confidently anticipate the value to be in close proximity to the predicted one?

Diverging from the conventional approach of a single point estimation, Conditional Density Estimation (CDE) aims to understand the plausibility of an entire range of potential outcomes given specific input data. In mathematical terms, we are estimating the probability density function $f \left( y \,|\, X=x \right)$.

The holistic nature of CDE affords a deeper understanding of data characteristics and proves beneficial in addressing two fundamental aspects: evaluating model trustworthiness and accommodating multi-modal outcomes.

Model trustworthiness: Unlike point estimation predictions, which offer no insight into their own reliability or uncertainty, CDE provides a full distribution of potential outcomes, thereby inherently conveying information about prediction confidence. The variance of the predicted distribution can act as a measure of uncertainty or confidence, affording users a more comprehensive understanding of the predictions. Such an understanding proves critical when making decisions based on these predictions. For instance, in the healthcare sector, a prediction about patient outcomes accompanied by an understanding of its confidence or uncertainty could lead to more informed and suitable medical decisions.
Multi-modal outcomes: Traditional regression or classification problems, generally focused on single point predictions, often fall short in capturing the full complexity of real-world phenomena. This shortfall becomes particularly apparent when a single input could feasibly yield multiple valid outputs, a situation termed multi-modality. Consider a task of predicting salary based on certain features, but we're unsure if the individual resides in a state with a high or low average salary. In such a context, a more nuanced salary estimate shouldn't merely be an average drawn from both regions. Rather, it would be more fitting to present a bi-modal distribution with two distinct peaks. Each peak would denote a plausible salary range for the individual, depending on whether they live in one state or another.

$\oint$ The field of conformal predictions aims to address this uncertainty by estimating prediction sets $\tau(X=x)$, such that $\mathbb{P}\left(\left(Y\,|\,X=x\right) \in \tau(X=x)\right) \geq 1 - \alpha$ with a certain desired coverage $\alpha$ [1]. Interpreting the prediction sets, for instance by inspecting their size, begins to address some of the queries we raised earlier. However, in regression tasks, the prediction set is usually framed as an interval. Having only the interval extremes, which naturally attempt to estimate conditional quantiles, does not fully portray the uncertainty associated with the prediction. This limitation is particularly evident when dealing with multi-modal densities. Or, if you have a utility metric associated with your predictions and aim to examine the average utility for an individual, the logical approach would be to perform an integral on the individual's probability density.

Creating the dataset

Let's construct a simple illustrative problem to explore the application of non-parametric techniques in the context of CDE. Consider a data generating process of the following form:

\[X\sim\textrm{Uniform}(0, 1),\] \[\left(Y \,|\, X=x\right) \sim \sin\left(2\pi x\right) + \mathcal{N}\left(0, \sigma\left(x\right)\right),\]

where $\sigma(x) = 0.3 - 0.25 \sin(2\pi x)$.

In this instance, $X$ is one-dimensional primarily for the purpose of visualization, although our discussion is applicable regardless of the dimensionality of $X$.

def mean_function(X):
    return np.sin(2 * np.pi * X)

def deviation_function(X):
    return 0.3 + 0.25 * mean_function(X)

def generate_data_with_normal_noise(
    mean_generator, deviation_generator, size=5_000, random_state=None
):
    def normal_noise_generator(X, deviation_generator, random_state=None):
        noise = np.random.RandomState(random_state).normal(
            loc=0, scale=deviation_generator(X)
        )
        return noise

    rs = np.random.RandomState(random_state).randint(
        0, 2**32 - 1, dtype=np.int64, size=2
    )
    X = np.random.RandomState(rs[0]).uniform(size=size)
    y_pred = mean_generator(X=X)
    noise = normal_noise_generator(
        X=X, deviation_generator=deviation_generator, random_state=rs[1]
    )
    y_pred_noisy = y_pred + noise

    return X, y_pred_noisy

X, y = generate_data_with_normal_noise(
    mean_generator=mean_function,
    deviation_generator=deviation_function,
    random_state=42,
)

By the design of the data, the conditional density is influenced by the covariates in both the mean and the variance.

x_grid = np.linspace(0, 1, 1000)

fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(x_grid, mean_function(x_grid), color="C0", label="Mean function")
ax.fill_between(
    x_grid,
    mean_function(x_grid) - 1.96 * deviation_function(x_grid),
    mean_function(x_grid) + 1.96 * deviation_function(x_grid),
    color="C0",
    alpha=0.2,
    label="95% confidence interval given x",
)
ax.scatter(X, y, s=2, color="C0", alpha=0.2)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Scatter plot of generated data")
ax.legend()
plt.tight_layout()

Histograms

The task of density estimation may initially seem daunting, but in reality, it becomes quite intuitive once we recognize that a histogram (normalized to have an integral of 1) is effectively a technique aimed at achieving this objective. By counting the number of examples in each bin, we discretize the distribution, enabling us to estimate the probability of the regions and thus obtain a "low-resolution" density estimation.

However, employing all samples only yields a density estimate of $Y$ without imposing any condition on $X$.

We can easily condition this strategy on $X=x$ by only including points in proximity to $X=x$ when generating the histogram that will represent the conditional density. The definition of "proximity" can be flexible. For instance, we could use a strategy like sklearn.neighbors.NearestNeighbors.radius_neighbors, which selects only the examples that reside within a radius of $\varepsilon$ from point $x$, or we could select a fixed number of nearest neighbors using a method like sklearn.neighbors.NearestNeighbors.kneighbors.

from scipy.stats import rv_histogram, norm

hist = np.histogram(y, bins=np.linspace(-1.5, 3, 51))
hist_dist = rv_histogram(hist)

def plot_conditional_y_using_near_data(ax, x_value, X, y, c, eps=0.05):
    ax.plot(
        y_grid_refined,
        norm(loc=mean_function(x_value), scale=deviation_function(x_value)).pdf(
            y_grid_refined
        ),
        "--",
        color=c,
        label=f"real $f(y | x = {x_value})$",
    )
    ax.hist(
        y[(X < x_value + eps) & (X > x_value - eps)],
        alpha=0.3,
        bins=y_grid,
        density=True,
        color=c,
        label=f"estimated $f(y | x = {x_value})$ using near data",
    )

min_y, max_y = min(y), max(y)
y_grid = np.linspace(min_y, max_y, 20)
y_grid_refined = np.linspace(min_y, max_y, 1000)

fig, ax = plt.subplots(ncols=2, figsize=(10, 3))
ax[0].bar(y_grid, hist_dist.pdf(y_grid), label="estimated $f(y)$")
ax[0].set_title("Density of y")
ax[0].set_xlabel("y")
ax[0].legend()

plot_conditional_y_using_near_data(ax=ax[1], x_value=0.2, X=X, y=y, c="C1")
plot_conditional_y_using_near_data(ax=ax[1], x_value=0.6, X=X, y=y, c="C2")
ax[1].set_title("Conditional density of y given X=x")
ax[1].set_xlabel("y")
ax[1].legend()
plt.tight_layout()

Kernel Density Estimation

While histograms serve as excellent baselines, they can pose challenges for more intricate distributions. Determining the appropriate number of bins can prove difficult, and we may end up with stair-step functions that aren't the most manageable to work with.

In general, the problem of non-parametric density estimation is frequently tackled using Kernel Density Estimation (KDE), and it is logical to use it here too, aligning it with a strategy to convert the problem into a conditional estimation. The essential concept of KDE is to place "bumps" around observed points (shaped like a Gaussian, for instance) and then sum these bumps, normalizing them to yield a density estimate.

$\oint$ The nature of the bump (which is called a kernel) and the width (bandwidth) of these bumps are hyperparameters that can be tuned using cross-validation with a likelihood-style metric to assess the likelihood of a test sample having been drawn from your estimated density [2].

To condition our KDE, we can once again use a neighbor search. Utilizing sklearn.neighbors.NearestNeighbors and sklearn.neighbors.KernelDensity (without being overly concerned about this model's hyperparameters), we can identify the neighbors closest to a specific point, say $X=0.2$, and then estimate the density using these neighbors.

from sklearn.neighbors import NearestNeighbors, KernelDensity

x_value = 0.2
knn = NearestNeighbors(n_neighbors=100).fit(X.reshape(-1, 1))
_, ind_x_value = knn.kneighbors([[x_value]])

kde = KernelDensity(kernel="gaussian", bandwidth="scott").fit(
    y[ind_x_value].reshape(-1, 1)
)

fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(
    y_grid_refined,
    norm(loc=mean_function(x_value), scale=deviation_function(x_value)).pdf(
        y_grid_refined
    ),
    "--",
    color="C0",
    label=f"real $f(y | x = {x_value})$",
)
ax.hist(
    y[ind_x_value].ravel(),
    alpha=0.3,
    bins=y_grid,
    density=True,
    color="C0",
    label=f"estimated $f(y | x = {x_value})$ using a histogram of nearest neighbors",
)
ax.plot(
    y_grid_refined,
    np.exp(kde.score_samples(y_grid_refined.reshape(-1, 1))),
    color="C0",
    label=f"estimated $f(y | x = {x_value})$ using a kde with nearest neighbors",
)
ax.set_title("Conditional density of y given X=x")
ax.set_xlabel("y")
ax.set_ylim(0, 1.3)
ax.legend()
plt.tight_layout()

Notice that this method provides a much smoother estimate compared to the histogram.

We can encapsulate this logic within a class that tries to follow the scikit-learn API, so that the .predict method applies the aforementioned logic for each requested value. In other words, it initially searches for the neighbors, and then employs a KDE to obtain the estimates for each example.

from sklearn.base import BaseEstimator, clone

class ConditionalNearestNeighborsKDE(BaseEstimator):
    """Conditional Kernel Density Estimation using nearest neighbors.

    This class implements a Conditional Kernel Density Estimation by applying
    the Kernel Density Estimation algorithm after a nearest neighbors search.

    It allows the use of user-specified nearest neighbor and kernel density
    estimators or, if not provided, defaults will be used.

    Parameters
    ----------
    nn_estimator : NearestNeighbors instance, default=None
        A pre-configured instance of a `~sklearn.neighbors.NearestNeighbors` class
        to use for finding nearest neighbors. If not specified, a
        `~sklearn.neighbors.NearestNeighbors` instance with `n_neighbors=100`
        will be used.

    kde_estimator : KernelDensity instance, default=None
        A pre-configured instance of a `~sklearn.neighbors.KernelDensity` class
        to use for estimating the kernel density. If not specified, a
        `~sklearn.neighbors.KernelDensity` instance with `bandwidth="scott"`
        will be used.
    """

    def __init__(self, nn_estimator=None, kde_estimator=None):
        self.nn_estimator = nn_estimator
        self.kde_estimator = kde_estimator

    def fit(self, X, y=None):
        if self.nn_estimator is None:
            self.nn_estimator_ = NearestNeighbors(n_neighbors=100)
        else:
            self.nn_estimator_ = clone(self.nn_estimator)
        self.nn_estimator_.fit(X, y)
        self.y_train_ = y
        return self

    def predict(self, X):
        """Predict the conditional density estimation of new samples.

        The predicted density of the target for each sample in X is returned.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Vector to be estimated, where `n_samples` is the number of samples
            and `n_features` is the number of features.

        Returns
        -------
        kernel_density_list : list of len n_samples of KernelDensity instances
            Estimated conditional density estimations in the form of
            `~sklearn.neighbors.KernelDensity` instances.
        """
        _, ind_X = self.nn_estimator_.kneighbors(X)
        if self.kde_estimator is None:
            kernel_density_list = [
                KernelDensity(bandwidth="scott").fit(self.y_train_[ind].reshape(-1, 1))
                for ind in ind_X
            ]
        else:
            kernel_density_list = [
                clone(self.kde_estimator).fit(self.y_train_[ind].reshape(-1, 1))
                for ind in ind_X
            ]
        return kernel_density_list

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

ckde = ConditionalNearestNeighborsKDE().fit(X_train.reshape(-1, 1), y_train)
ckde_preds = ckde.predict(X_test.reshape(-1, 1))

Evaluation metrics for conditional density estimation methods

Clearly, applying traditional regression metrics directly here can be challenging, necessitating an approach specific to the problem we're addressing. This discussion is a bit more involved, but it's critical for evaluating our estimators.

$\oint$ Certain metrics from conformal prediction could be utilized here, like "how often the observed target falls within a confidence interval", if you construct confidence intervals from the estimated conditional densities. However, metrics inherently suited to the nature of the problem are more suitable.

Let's denote the true conditional probability density of the problem as $f(y \,|\, X=x)$, and the estimated density as $\hat{f}(y \,|\, X=x)$. We want to gauge how close these two functions are, even though we don't have $f(y | x)$. A smart way to handle this is to compute the mean square error of the difference [3]

\[L(f, \hat{f}) = \mathbb{E}_X\left( \int \left( \hat{f}(y \,|\, X) - f(y \,|\, X) \right)^2 dy \right) = \int \int \left( \hat{f}(y \,|\, X=x) - f(y \,|\, X=x) \right)^2 dy \, f(x) \, dx.\]

$\oint$ This metric differs somewhat from the mean square error as empirical risk for our point estimates $h(x)$. When we calculate $\frac{1}{n} \sum_{i=1}^n \left( h(x_i) - y_i \right)^2$, we're effectively estimating.

\[\mathbb{E}_{(X, Y)}\left( (h(X) - Y)^2 \right) = \int \int \left( h(x) - y \right)^2 f(x,y) \, dx \, dy.\]

In the metric $L$, we average only with respect to $X$, so that, for a fixed $X=x$, we want $\hat{f}(y \,|\, X=x)$ to approximate $f(y \,|\, X=x)$ well for all possible $y$ values uniformly in $\mathbb{R}$.

Upon expanding $L$, we obtain

\[L(f, \hat{f}) = \int \int \left( \hat{f}(y \,|\, X=x) \right)^2 f(x) \, dy \, dx + \int \int -2\hat{f}(y \,|\, X=x) f(y, x) dx dy + C,\]

where $f(x,y) = f(y \,|\, x) f(x)$ and $C$ is defined as $C = \int \int \left( f(y \,|\, x) \right)^2 f(x) \, dy \, dx$. As $C$ is a constant independent of the estimation method of $\hat{f}$, it can be disregarded when comparing models.

The first term can be written as

\[\int \left( \int \left( \hat{f}(y \,|\, X=x) \right)^2\, dy \right) f(x) \, dx,\]

and the interior integral can be calculated using a numerical integration method while the x-integral can be estimated using an empirical average in a validation sample $S=(x_i, y_i)_{i=1}^n$. Specifically, we have

\[\frac{1}{n} \sum_{i=1}^n \left( \int \left( \hat{f}(y \,|\, X=x_i) \right)^2 \, dy \right).\]

The second term can be directly estimated as the empirical average

\[\frac{-2}{n} \sum_{i=1}^n \hat{f}(y_i \,|\, X=x_i),\]

also using $S$.

Our estimates enable us to calculate a model comparison metric given by

\[L(f, \hat{f}) \approx \hat{L}(f, \hat{f}, S) = \frac{1}{n} \sum_{i=1}^n \left( \int \left( \hat{f}(y \,|\, X=x_i) \right)^2 \, dy \right) - \frac{2}{n} \sum_{i=1}^n \hat{f}(y_i \,|\, X=x_i),\]

where a good model should yield as small a value as possible [3].

$\oint$ It's intriguing to casually interpret this final expression we obtained. As we aim to minimize $\hat{L}(f, \hat{f}, S)$, we want the integrals of the conditional densities squared to be small, while the likelihoods of the observed samples are large (rendering the second term highly negative). That is, we desire our function to be well-behaved and not explode, while we want the observed points to have a high likelihood of being sampled according to our prediction.

We can implement this in a way that it accepts pre-calculated density estimates and performs the necessary operations (both integration and summation). For the integral, we're explicitly asking for a y_grid where it will be estimated using sklearn.metrics.auc.

from sklearn.metrics import auc, make_scorer
from joblib import Parallel, delayed

def squared_loss(y_true, cde_preds, y_grid, n_jobs=-1):
    """
    Average squared loss between the true conditional density and predicted one.

    This method can be used to assess the quality of the conditional probability
    density function fit.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        The true values of y for each sample.

    cde_preds : list of len n_samples of KernelDensity instances
        The predicted conditional densitys. Each instance should be a fitted
        KernelDensity instance.

    y_grid : array-like of shape (n_samples,)
        The grid of y values used for computing the area under the curve (AUC)
        for the squared probability density function.

    n_jobs : int, optional
        The number of jobs to run in parallel. '-1' means using all processors.

    Returns
    -------
    average_squared_loss: float
        The average squared loss between the true and predicted conditional
        probability density functions. Note that it is always off by C.
    """

    def _compute_individual_loss(y_, cde_pred):
        # The score_samples and score methods returns stuff on log scale,
        # so we have to exp it.
        squared_auc = auc(
            y_grid, np.exp(cde_pred.score_samples(y_grid.reshape(-1, 1))) ** 2
        )
        expected_value = np.exp(cde_pred.score([[y_]]))
        return squared_auc - 2 * expected_value

    individual_squared_loss = Parallel(n_jobs=n_jobs)(
        delayed(_compute_individual_loss)(y_, cde_pred)
        for y_, cde_pred in zip(y_true, cde_preds)
    )

    average_squared_loss = sum(individual_squared_loss) / len(y_true)
    return average_squared_loss

Applying this to the previous data provides us with a method to quantify our performance in CDE.

squared_loss(y_test, ckde_preds, np.linspace(-5, 5, 1000))

-0.837595643080642

For the sake of comparison, we could contrast it with the density estimation of $Y$ without considering conditionality, that is, training the KDE on all the training data.

squared_loss(
    y_test,
    len(y_test) * [KernelDensity(bandwidth="scott").fit(y_train.reshape(-1, 1))],
    np.linspace(-5, 5, 1000),
)

-0.38725117712967094

Since the previous value is lower, we can conclude that it provides a better density estimation, as anticipated.

$\oint$ While this metric is useful for comparing models, it might be difficult to interpret from a business perspective. In this case, it might be helpful to convert your distribution forecast into a point forecast to calculate a more traditional metric, such as sklearn.metrics.mean_absolute_error (or even conformal prediction metrics), to provide a more digestible interpretation.

With a method to compare models in place, it's natural to want to optimize hyperparameters using a tool like sklearn.model_selection.GridSearchCV. Given that we've designed the ConditionalNearestNeighborsKDE to comply with the scikit-learn standard, and the metric in a way that it accepts the output from a .predict method, we can readily employ sklearn.model_selection.GridSearchCV to optimize our usage of sklearn.neighbors.NearestNeighbors.

from functools import partial
from sklearn.model_selection import GridSearchCV

squared_loss_score = make_scorer(
    partial(squared_loss, y_grid=np.linspace(-5, 5, 1000)), greater_is_better=False
)
param_grid = {
    "nn_estimator": [
        NearestNeighbors(n_neighbors=n_neighbors) for n_neighbors in [100, 500, 1000]
    ],
}
gs = GridSearchCV(
    ConditionalNearestNeighborsKDE(), param_grid=param_grid, scoring=squared_loss_score
).fit(X_train.reshape(-1, 1), y_train)

(
    pd.DataFrame(gs.cv_results_)
    .filter(
        ["param_nn_estimator", "mean_score_time", "mean_test_score", "std_test_score"]
    )
    .sort_values(by="mean_test_score", ascending=False)
    .reset_index(drop=True)
)

	param_nn_estimator	mean_score_time	mean_test_score	std_test_score
0	NearestNeighbors(n_neighbors=500)	3.838123	0.890078	0.021875
1	NearestNeighbors(n_neighbors=100)	1.243606	0.859500	0.016847
2	NearestNeighbors(n_neighbors=1000)	6.416354	0.711722	0.018199

squared_loss(
    y_test, gs.best_estimator_.predict(X_test.reshape(-1, 1)), np.linspace(-5, 5, 1000)
)

-0.9058110146877884

In this case, we achieve a better score than before using any value of neighbors. However, we could still be interested in aspects of kernel estimation, which could further enhance the result.

The ConditionalNearestNeighborsKDE structure was proposed as it is more intuitive. Nonetheless, in higher dimensions or in scenarios with a lot of data, the neighbor search can encounter certain issues. Firstly, it's computationally costly due to the requirement for distance comparisons. Secondly, we might easily be at the mercy of varying scales of variables, potentially including categorical variables. Thirdly, we might have many less informative variables in $X$ and consequently suffer from the curse of dimensionality, with our neighbors becoming increasingly distant and less representative. In a real-world problem, you might have hundreds of covariates you wish to condition on and millions of examples, making this strategy possibly less suitable.

LeafNeighbors

A potential way to bypass the complications posed by neighbor searches in high dimensions, such as irrelevant variables, and varied scales and types is to formulate a more suitable distance metric robust to these challenges.

The manner in which tree training is conducted naturally equips it to tackle these issues effectively because: tree models learn what the important features are through the process of choosing the best splits; and they are not concerned with the scale of variables, as they focus only on the ordering during training.

When training a bagging of trees, we see variability in splits across the feature space, enabling us to use co-occurrence in the same leaves as a measure of similarity between examples [4].

Hence, if we train a bagging model of regression trees like sklearn.ensemble.RandomForestRegressor or sklearn.ensemble.ExtraTreesRegressor to predict $Y$ from $X$, we are inherently constructing trees that create splits in relevant variables for predicting $Y$. At the same time, we disregard different scales by considering all instances that occur in the same leaf as similar, achieved by counting the co-occurrences of leaves across different models in the bagging [5].

We can design a neighbor search class following this rationale, in accordance with the scikit-learn standards.

from sklearn.neighbors._base import NeighborsBase
from sklearn.ensemble import RandomForestRegressor

class LeafNeighbors(NeighborsBase):
    """Neighbors search using leaf nodes coincidence in a tree ensemble as a
    similarity measure.

    This class implements a supervised neighbor search using the leaves of an
    ensemble tree estimator as a measure of distance. Examples that occur
    simultaneously in several leaves are naturally close in variables relevant
    to the target.

    Parameters
    ----------
    tree_ensemble_estimator : ForestRegressor instance, default=None
        The ensemble tree estimator to use. If None, a
        `~sklearn.ensemble.RandomForestRegressor` with `max_depth=10` will be
        used.

    n_neighbors : int, default=5
        Number of neighbors to use in the neighbor-based learning method.

    random_state : int, RandomState instance or None, default=None
        Controls the randomness of the ensemble tree estimator. Pass an int
        for reproducible output across multiple function calls.
    """

    def __init__(self, tree_ensemble_estimator=None, n_neighbors=5, random_state=None):
        self.tree_ensemble_estimator = tree_ensemble_estimator
        self.n_neighbors = n_neighbors
        self.random_state = random_state

    def fit(self, X, y=None):
        if self.tree_ensemble_estimator is None:
            self.tree_ensemble_estimator = RandomForestRegressor(
                max_depth=10, random_state=self.random_state
            )
        else:
            self.tree_ensemble_estimator = clone(self.tree_ensemble_estimator)

        self.nn_estimator_ = NearestNeighbors(
            n_neighbors=self.n_neighbors, metric="hamming"
        )

        self.tree_ensemble_estimator.fit(X, y)
        leafs_X = self.tree_ensemble_estimator.apply(X)
        self.nn_estimator_.fit(leafs_X)
        return self

    def kneighbors(self, X):
        leafs_X = self.tree_ensemble_estimator.apply(X)
        return self.nn_estimator_.kneighbors(leafs_X)

    def radius_neighbors(self, X):
        leafs_X = self.tree_ensemble_estimator.apply(X)
        return self.nn_estimator_.radius_neighbors(leafs_X)

And use it in the ConditionalNearestNeighborsKDE, defining the parameter nn_estimator with the custom search method.

crfkde = ConditionalNearestNeighborsKDE(
    nn_estimator=LeafNeighbors(n_neighbors=100)
).fit(X_train.reshape(-1, 1), y_train)
crfkde_preds = crfkde.predict(X_test.reshape(-1, 1))

squared_loss(y_test, crfkde_preds, np.linspace(-5, 5, 1000))

-0.8328820889563571

In this scenario, the metric ended up similar to the one employed in the previous problem with neighbors, as the dimensionality is low. Consequently, the neighbors identified along the line closely align with the conventional approach of searching for nearby neighbors with Euclidean distance.

FlexCode

FlexCode takes a fundamentally different approach to the CDE problem by employing arguments from linear algebra to estimate the conditional probability density function using a function basis.

The space of square integrable functions ($L^2(\mathbb{R})$) is a vector space equipped with an inner product defined as $\left\langle g, h\right\rangle = \int_{\mathbb{R}} g(t)\, h(t) \, dt$. Similar to finite-dimensional vector spaces, it possesses a (in this case, countably infinite) basis $\{ \phi_i \in L^2(\mathbb{R}) : i \in \mathbb{N}\}$, where any function $g \in L^2(\mathbb{R})$ can be expressed as a linear combination of the basis elements: $g(t) = \sum_{i=1}^\infty \beta_i \phi_i(t)$, for all $t \in \mathbb{R}$. Furthermore, it is possible to impose an orthonormal condition on the basis, such that $\left\langle \phi_i, \phi_j\right\rangle = \delta_{i,j}$, where $\delta_{i,j}$ equals $1$ if $i = j$ and $0$ otherwise [3]. To help illustrate this concept, if you are unfamiliar with it, consider the analogy to the application of Fourier series.

With any fixed orthonormal basis $\{ \phi_i \}$, it is possible to express the conditional probability density function as follows [3]:

\[f(y \,|\, X=x) = \sum_{i=1}^\infty \beta_i(x)\, \phi_i(y),\]

In this formulation, we explicitly incorporate the dependence of $X=x$ within the coefficients of the linear combination.

It is worth noting that due to the orthonormality of the basis $\{ \phi_i \}$, we have that

\[\begin{align*} \mathbb{E}\left( \phi_j(Y) \,|\, X=x \right) &= \int_\mathbb{R} \phi_j(y) \,f(y \,|\, X=x) \,dy\\ &= \int_\mathbb{R} \phi_j(y) \sum_{i=1}^\infty \beta_i(x)\, \phi_i(y) \,dy\\ &= \sum_{i=1}^\infty \beta_i(x) \int_\mathbb{R} \phi_j(y) \, \phi_i(y) \,dy\\ &= \sum_{i=1}^\infty \beta_i(x) \,\delta_{i,j} = \beta_j(x). \end{align*}\]

Hence, the estimation of $\hat{\beta}_j(x)$ can be achieved through regression, utilizing $X$ as predictors to estimate $\phi_j(Y)$. Note that it is possible to interchange the summation and integration due to Fubini's Theorem.

The FlexCode algorithm [3] adopts this approach. By employing a designated basis_system (a hyperparameter of the model), the algorithm estimates the coefficients using regressions of $\phi_j(Y)$. Since computing the infinite sum is not practical, it is truncated at a specified value, max_basis denoted as $I$ (which can be determined through cross-validation as a hyperparameter). Consequently, we obtain that

\[\hat{f}(y \,|\, X=x) = \sum_{i=1}^I \hat{\beta}_i(x) \, \phi_i(y).\]

Using FlexCode in Python

To utilize FlexCode, we first need to define the regression model along with its parameters, as well as the previously mentioned hyperparameters.

from flexcode.regression_models import RandomForest
from flexcode import FlexCodeModel

flexcode_model = FlexCodeModel(
    RandomForest,
    basis_system="cosine",
    max_basis=31,
    regression_params={"max_depth": 5, "n_estimators": 100},
)
flexcode_model.fit(X_train, y_train)

As implemented, the estimator returns the value of $\hat{f}(y \,|\, X=x)$ on a grid of $y$ values.

cdes, y_grid_flexcode = flexcode_model.predict(X_test, n_grid=400)
y_grid_flexcode = y_grid_flexcode.reshape(-1)

fig, ax = plt.subplots(figsize=(8, 3))
for c, sample_index in enumerate(np.random.RandomState(13).choice(len(X_test), size=3)):
    x_value = np.round(X_test[sample_index], 4)
    ax.plot(
        y_grid_refined,
        norm(loc=mean_function(x_value), scale=deviation_function(x_value)).pdf(
            y_grid_refined
        ),
        "--",
        color=f"C{c}",
        label=f"real $f(y | x = {x_value})$",
    )
    ax.plot(
        y_grid_flexcode,
        cdes[sample_index],
        color=f"C{c}",
        label=f"estimated $f(y | x = {x_value})$ using flexcode",
    )
ax.set_title("Conditional density of y given X=x")
ax.set_xlabel("y")
ax.legend()
plt.tight_layout()

To evaluate the estimator, considering that we constructed our metric to work with an object similar to sklearn.neighbors.KernelDensity, we need to ensure it has specific methods that we can implement, adapting the output of flexcode.FlexCodeModel to match this format.

class FlexCode_return_to_DensityEstimator:
    def __init__(self, y_grid, pdf_values):
        self.y_grid = y_grid
        self.pdf_values = pdf_values
        self.density = rv_histogram(
            (pdf_values, np.hstack([y_grid, [y_grid[-1] + y_grid[-1] - y_grid[-2]]]))
        )

    def score_samples(self, X):
        return np.log(self.density.pdf(X))

    def score(self, X):
        return np.sum(self.score_samples(X))

density_estimation_preds_flexcode = [
    FlexCode_return_to_DensityEstimator(y_grid=y_grid_flexcode, pdf_values=cde)
    for cde in cdes
]
squared_loss(y_test, density_estimation_preds_flexcode, np.linspace(-5, 5, 1000))

-1.5436164449474372

In this scenario, the metric we obtained outperforms the KDE based on nearest neighbors that we used previously.

Practical application

Let's apply these various techniques to a real regression problem, namely sklearn.datasets.fetch_california_housing, to evaluate the performance of the different approaches discussed.

from sklearn.datasets import fetch_california_housing

X_california, y_california = fetch_california_housing(return_X_y=True)
(
    X_california_train,
    X_california_test,
    y_california_train,
    y_california_test,
) = train_test_split(X_california, y_california, test_size=0.33, random_state=42)
print(f"X dimension: {X_california.shape[1]}")

X dimension: 8

ckde_california = ConditionalNearestNeighborsKDE().fit(
    X_california_train, y_california_train
)
ckde_california_preds = ckde_california.predict(X_california_test)

squared_loss(y_california_test, ckde_california_preds, np.linspace(0, 6, 1000))

-0.2948159711962537

crfkde_california = ConditionalNearestNeighborsKDE(
    nn_estimator=LeafNeighbors(n_neighbors=100)
).fit(X_california_train, y_california_train)
crfkde_california_preds = crfkde_california.predict(X_california_test)

squared_loss(y_california_test, crfkde_california_preds, np.linspace(0, 6, 1000))

-0.6802084650191235

model_california = FlexCodeModel(
    RandomForest, max_basis=31, regression_params={"max_depth": 10, "n_estimators": 100}
)
model_california.fit(X_california_train, y_california_train)

cdes_california, y_grid_california = model_california.predict(
    X_california_test, n_grid=400
)
y_grid_california = y_grid_california.reshape(-1)
density_estimation_preds_flexcode_california = [
    FlexCode_return_to_DensityEstimator(y_grid=y_grid_california, pdf_values=cde)
    for cde in cdes_california
]

squared_loss(
    y_california_test,
    density_estimation_preds_flexcode_california,
    np.linspace(0, 6, 1000),
)

-1.2739272081741533

We can observe that the neighbor search using LeafNeighbors outperforms the conventional neighbor search in our ConditionalNearestNeighborsKDE method. However, the flexcode.FlexCodeModel yields even better results compared to both methods in this example.

Final considerations

Delving deeper into regression problems, beyond simple point estimates, can be challenging. However, this approach provides a wealth of insightful information that can enhance your decision-making process. While it's a significant area of study, it has not yet become a primary focus within the community. Nonetheless, I anticipate a surge of interest as more individuals realize its value.

Currently, the libraries designed to address these intricate problems are being refined, with issues being resolved over time. Therefore, when utilizing these tools, it's essential to exercise caution and promptly report any anomalous behavior observed.

$\oint$ I wanted to mention the tree-based neighbor method because it is possible to adapt the tree training in a specific way for the CDE problems. Typically, in regression problems, a decision tree would aim to optimize a particular metric such as sklearn.metrics.mean_squared_error when establishing its splits. However, for CDE problems, there's a possibility to optimize a CDE-specific metric directly within the splits. One such metric could be the CDE squared_loss we implemented earlier. This approach is what the RFCDE (Random Forests for Conditional Density Estimation) method suggests [6].

Bibliography

[1] A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. Anastasios N. Angelopoulos, Stephen Bates. 2021.

[2] Python Data Science Handbook: In-Depth Kernel Density Estimation. Jake VanderPlas. 2016.

[3] Converting high-dimensional regression to high-dimensional conditional density estimation. Rafael Izbicki, Ann B. Lee. Electron. J. Statist. 2017.

[4] Supervised clustering and forest embeddings. Guilherme Duarte Marmerola. 2018.

[5] Quantile Regression Forests. Nicolai Meinshausen. Journal of Machine Learning Research. 2006.

[6] RFCDE: Random Forests for Conditional Density Estimation. Taylor Pospisil, Ann B. Lee. 2018.

You can find all files and environments for reproducing the experiments in the repository of this post.

Hyperparameter search with threshold-dependent metrics

2023-01-06T00:00:00+00:00

In a binary classification problem, you probably shouldn't ever use the .predict method from scikit-learn (and consequently from libraries that follow its design pattern). In scikit-learn, the implementation of .predict, in general, follows the logic implemented for sklearn.ensemble.RandomForestClassifier:

def predict(self, X):
    ...
    proba = self.predict_proba(X)
    ...
    return self.classes_.take(np.argmax(proba, axis=1), axis=0)

In the case where we only have two classes (0 or 1), the .predict, when picking the class with the highest "probability", is equivalent to the rule " if .predict_proba > 0.5, then predict 1; otherwise, predict 0". That is, under the hood, we are using a threshold of 0.5 without having visibility.

Up to now, nothing new. However, we will show in an example how this can be harmful to superficial analyses that don't take this into account.

Optimizing f1 in a naive way

To exemplify this issue, we will use a dataset from imbalanced-learn, a library with several implementations of techniques that deal with imbalanced problems, from the scikit-learn-contrib environment. So, let's build a model that ideally has the best possible sklearn.metrics.f1_score.

from imblearn.datasets import fetch_datasets

dataset = fetch_datasets()["coil_2000"]
X, y = dataset.data, (dataset.target==1).astype(int)

print(f"Percentage of y=1 is {np.round(y.mean(), 5)*100}%.")
print(f"Number of rows is {X.shape[0]}.")

Percentage of y=1 is 5.966%.
Number of rows is 9822.

I'm going to divide the dataset (taking care of the stratification because we are in an imbalanced problem) into a set for training the model, a second set for choosing the threshold, and a last one for validation. We will not be dealing with the second set for now, but we will show some ways of optimizing the threshold that will need this extra set.

from sklearn.model_selection import train_test_split

X_train_model, X_test, y_train_model, y_test = train_test_split(X, y, random_state=0, stratify=y)
X_train_model, X_train_threshold, y_train_model, y_train_threshold = \
train_test_split(X_train_model, y_train_model, random_state=0, stratify=y_train_model)

Suppose we want to optimize the hyperparameters of a sklearn.ensemble.RandomForestClassifier getting the best possible sklearn.metrics.f1_score (as we anticipated just now).

I'm going to create an auxiliary function to run this search for hyperparameters because we're going to do this several times (using a sklearn.model_selection.GridSearchCV, but it could be any other way to search for hyperparameters).

from sklearn.model_selection import StratifiedKFold

params = {
    "max_depth": [2, 4, 10, None],
    "n_estimators": [10, 50, 100],
}

skfold = StratifiedKFold(n_splits=3,
                         shuffle=True,
                         random_state=0)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

def run_experiment(estimator, scoring, X, y, params, cv):
    gscv = (
        GridSearchCV(estimator=estimator,
                     param_grid=params,
                     scoring=scoring,
                     cv=cv)
        .fit(X, y)
    )

    return (
        pd.DataFrame(gscv.cv_results_)
        .pipe(lambda df:
              df[list(map(lambda x: "param_" + x,  params.keys())) + ["mean_test_score", "std_test_score"]])
    )

With this auxiliary function built, we run our search trying to optimize scoring="f1".

run_experiment(estimator=RandomForestClassifier(random_state=0),
               scoring="f1", X=X_train_model, y=y_train_model, params=params, cv=skfold)

	param_max_depth	param_n_estimators	mean_test_score	std_test_score
0	2	10	0.000000	0.000000
1	2	50	0.000000	0.000000
2	2	100	0.000000	0.000000
3	4	10	0.000000	0.000000
4	4	50	0.000000	0.000000
5	4	100	0.000000	0.000000
6	10	10	0.059510	0.039552
7	10	50	0.040333	0.016119
8	10	100	0.034938	0.014265
9	None	10	0.097418	0.007834
10	None	50	0.105050	0.022298
11	None	100	0.096360	0.016211

Some combinations of hyperparameters seem to have an sklearn.metrics.f1_score of 0. Weird.

This happens because as sklearn.metrics.f1_score is a threshold-dependent metric (in the sense that it needs hard predictions instead of predicted probabilities), scikit-learn understands that it needs to use .predict instead of .predict_proba (and consequently "uses the threshold of 0.5", as we discussed the equivalence earlier).

As our problem is imbalanced, a threshold of 0.5 is usually suboptimal. And that's the case. We will have a considerable accumulation of .predict_proba close to 0 in almost any model, and, probably, a threshold closer to 0 in our problem seems more reasonable.

from collections import Counter
out_of_the_box_model = RandomForestClassifier(random_state=0).fit(X_train_model, y_train_model)

predict_proba = out_of_the_box_model.predict_proba(X_train_threshold)[:, 1]
predict = out_of_the_box_model.predict(X_train_threshold)

# Just to check. ;)
assert ((predict_proba > 0.5).astype(int) == predict).all()

fig, ax = plt.subplots(ncols=2, figsize=(6, 2.5))

ax[0].hist(predict_proba, bins=np.linspace(0, 1, 26))
ax[0].set_title("Histogram of .predict_proba(X)", fontsize=10)

count_predict = Counter(predict)
ax[1].bar(count_predict.keys(), count_predict.values(), label=".predict(X)", width=0.4)
count_y = Counter(y_train_threshold)
ax[1].bar(np.array(list(count_y.keys())) + 0.4, count_y.values(), label="y", width=0.4)
ax[1].set_xticks([0.2, 1.2])
ax[1].set_xticklabels([0, 1])
ax[1].tick_params(bottom = False)
ax[1].set_yscale("log")
ax[1].set_title("Count of 0's and 1's", fontsize=10)
ax[1].legend(fontsize=7)

plt.tight_layout()

Very few examples pass the 0.5 threshold, significantly fewer than the actual number of class 1 samples. This tells us that a softer threshold (less than 0.5) makes more sense in this problem.

This is often the case in imbalanced learning scenarios. For instance, if you have 1% of people with some disease in your population and your model predicts that this person has a 10% chance of having that disease, then chances are that you should treat him as someone with a high likelihood of being ill.

Tuning the threshold

To find the optimal threshold, we can bootstrap a set separate from the one used in training to find the best threshold for that model by optimizing some metric (threshold-dependent) of interest, such as, in our case, sklearn.metrics.f1_score.

from tqdm import tqdm

def optmize_threshold_metric(model, X, y, metric, threshold_grid, n_bootstrap=20):
    metric_means, metric_stds = [], []
    for t in tqdm(threshold_grid):
        metrics = []
        for i in range(n_bootstrap):
            ind_bootstrap = np.random.RandomState(i).choice(len(y), len(y), replace=True)
            metric_val = metric(y[ind_bootstrap],
                          (model.predict_proba(X[ind_bootstrap])[:, 1] > t).astype(int))
            metrics.append(metric_val)
        metric_means.append(np.mean(metrics))
        metric_stds.append(np.std(metrics))

    metric_means, metric_stds = np.array(metric_means), np.array(metric_stds)
    best_threshold = threshold_grid[np.argmax(metric_means)]

    return metric_means, metric_stds, best_threshold

For each threshold value, we estimate the mean of sklearn.metrics.f1_score that we expect to obtain with that choice if we run the experiment multiple times through the bootstrap and the standard deviation to get an idea of the variance of the sklearn.metrics.f1_score we got. We chose the final threshold as the one with the best-estimated sklearn.metrics.f1_score.

threshold_grid = np.linspace(0, 1, 101)
from sklearn.metrics import f1_score

f1_means_ootb, f1_stds_ootb, best_threshold_ootb = \
optmize_threshold_metric(out_of_the_box_model, X_train_threshold, y_train_threshold, f1_score, threshold_grid)

fig, ax = plt.subplots(figsize=(5, 2.5))
ax.plot(threshold_grid, f1_means_ootb)
ax.fill_between(threshold_grid, f1_means_ootb - 1.96 * f1_stds_ootb, f1_means_ootb + 1.96 * f1_stds_ootb, alpha=0.5)
ax.vlines(best_threshold_ootb, min(f1_means_ootb - 1.96 * f1_stds_ootb), max(f1_means_ootb + 1.96 * f1_stds_ootb), "k", label="Chosen threshold")
ax.set_xticks(np.linspace(0, 1, 11))
ax.set_xlabel("Threshold")
ax.set_ylabel("f1_score")
ax.legend()
plt.tight_layout()

100%|██████████| 101/101 [02:00<00:00,  1.19s/it]

f1_score(y_test, (out_of_the_box_model.predict_proba(X_test)[:, 1] > best_threshold_ootb).astype(int))

0.1878453038674033

f1_score(y_test, out_of_the_box_model.predict(X_test))

0.043478260869565216

With the threshold chosen through optimization, we ended up with a much better sklearn.metrics.f1_score than the one we get with .predict, with the 0.5 threshold.

$\oint$ Here we are directly choosing the threshold that, on average, has the best metric value of interest, but there are other possibilities [1]. We could, for example, play with the "confidence interval" (which, in this case, I'm just plotting to give an order of magnitude of the variance), optimizing for the upper or lower limit, or even use the threshold that maximizes Youden's J statistic (which is equivalent to taking the threshold that gives the most significant separation of the KS curves between the .predict_proba(X[y==0]) and .predict_proba(X[y==1])).

Back to hyperparameters search

But what to do now? How can we get around this if optimizing the sklearn.metrics.f1_score directly doesn't look like a good idea since scikit-learn will use .predict? We will discuss three possibilities of how to get around this issue. One case is not necessarily better than the other, and the idea is to show some options for facing the problem.

The most common approach is, even if you are interested in the threshold-dependent metric, to use a threshold-independent metric to do this optimization and only, in the end, use something like optmize_threshold_metric to optimize the metric of genuine interest.

$\oint$ This sounds sub-optimal, but we do this all the time in Machine Learning. Even if you're interested in optimizing sklearn.metrics.roc_auc_score on a credit default classification problem, your sklearn.ensemble.RandomForestClassifier will be optimizing for criterion="gini" or something related to sklearn.metrics.roc_auc_score, but that is different. Here the idea is the same. Optimizing for sklearn.metrics.roc_auc_score or sklearn.metrics.average_precision_score is not the same as optimizing for sklearn.metrics.f1_score, for example, but models that are good at the former will be good at the latter too.

run_experiment(estimator=RandomForestClassifier(random_state=0),
               scoring="roc_auc", X=X_train_model, y=y_train_model, params=params, cv=skfold)

	param_max_depth	param_n_estimators	mean_test_score	std_test_score
0	2	10	0.719377	0.008165
1	2	50	0.746675	0.007476
2	2	100	0.742196	0.007105
3	4	10	0.733715	0.013691
4	4	50	0.744482	0.010491
5	4	100	0.747113	0.007466
6	10	10	0.695511	0.018646
7	10	50	0.703767	0.019845
8	10	100	0.708600	0.022674
9	None	10	0.652099	0.031056
10	None	50	0.682542	0.017131
11	None	100	0.685519	0.020818

2. Leak the threshold search

But what if we want to explicitly optimize our interest metric within the grid search for some reason? In that case, we need to make a bigger workaround. A reasonable proxy of how your model will perform when you optimize the threshold is to optimize the threshold on your test set. In this case, as you will choose the threshold that will optimize the metric in the validation set, your metric will be the best possible, and you can directly take the max or the min.

from sklearn.metrics import make_scorer

def make_threshold_independent(metric, threshold_grid=np.linspace(0, 1, 101), greater_is_better=True):
    opt_fun = {True: max, False: min}
    opt = opt_fun[greater_is_better]
    def threshold_independent_metric(y_true, y_pred, *args, **kwargs):
        return opt([metric(y_true, (y_pred > t).astype(int), *args, **kwargs) for t in threshold_grid])
    return threshold_independent_metric

f1_threshold_independent_score = make_threshold_independent(f1_score)
f1_threshold_independent_scorer = make_scorer(f1_threshold_independent_score, needs_proba=True)

As this is a threshold-independent metric (because we passed needs_proba=True), we will no longer have the problem of scikit-learn using .predict.

df_best_f1 = run_experiment(estimator=RandomForestClassifier(random_state=0),
                            scoring=f1_threshold_independent_scorer,
                            X=X_train_model, y=y_train_model, params=params, cv=skfold)

df_best_f1

	param_max_depth	param_n_estimators	mean_test_score	std_test_score
0	2	10	0.253281	0.009199
1	2	50	0.267678	0.005953
2	2	100	0.257495	0.002502
3	4	10	0.241877	0.017142
4	4	50	0.257753	0.014293
5	4	100	0.263571	0.011393
6	10	10	0.202218	0.016497
7	10	50	0.225597	0.032149
8	10	100	0.230246	0.025504
9	None	10	0.181869	0.015010
10	None	50	0.213798	0.037220
11	None	100	0.209927	0.034730

On the other hand, we are leaking our model and consequently overestimating our metric since we are choosing the best threshold in the cross-validation validation set.

3. Tuning the threshold during gridsearch on a chunk of the training set

A better way to do this (in terms of correctly evaluating the performance during cross-validation) is to modify our estimator's training function so that it also calculates the best threshold. To clarify what we are doing without having to look at the class details we will implement, it is worth comparing the difference between methods 2 and 3.

In each step of our cross-validation, we will have a training set and a validation set that we will use to evaluate the performance of the classifier trained in that training set. That is what we were doing in method 1, for instance.

In solution 2, we optimize the threshold on the validation set by taking the best possible metric value for the different thresholds of our threshold grid. But, as we are leaking the threshold search, we will overestimate our metric, which can be harmful.

In the solution we are discussing, during the training stage, we will do a hold-out to have a set that we will use to optimize the threshold, and the optimal threshold will be used in the validation evaluation.

A rough implementation of a class that does this logic is as follows:

import inspect
def dic_without_keys(dic, keys):
    return {x: dic[x] for x in dic if x not in keys}

class ThresholdOptimizerRandomForestBinaryClassifier(RandomForestClassifier):

    def __init__(self, n_bootstrap=20, metric=f1_score, threshold_grid=np.linspace(0, 1, 101), *args, **kwargs,):

        kwargs_without_extra = dic_without_keys(kwargs, ("n_bootstrap", "metric", "threshold_grid"))
        super().__init__(*args, **kwargs_without_extra)
        self.metric = metric
        self.threshold_grid = threshold_grid
        self.n_bootstrap = n_bootstrap

    @classmethod
    def _get_param_names(cls):
        init = getattr(super().__init__, "deprecated_original", super().__init__)
        init_signature = inspect.signature(init)
        parameters = [p for p in init_signature.parameters.values() if p.name != "self" and p.kind != p.VAR_KEYWORD]
        return sorted([p.name for p in parameters] + ["n_bootstrap", "metric", "threshold_grid"])

    def fit(self, X, y, sample_weight=None):

        X_train_model, X_train_threshold, y_train_model, y_train_threshold = \
        train_test_split(X, y, random_state=self.random_state, stratify=y)

        super().fit(X_train_model, y_train_model, sample_weight=sample_weight)
        _, _, self.best_threshold_ = self.optmize_threshold_metric(X_train_threshold, y_train_threshold)

        return self

    def optmize_threshold_metric(self, X, y):
        metric_means, metric_stds = [], []
        for t in self.threshold_grid:
            metrics = []
            for i in range(self.n_bootstrap):
                ind_bootstrap = np.random.RandomState(i).choice(len(y), len(y), replace=True)
                metric_val = self.metric(y[ind_bootstrap],
                                         (self.predict_proba(X[ind_bootstrap])[:, 1] > t).astype(int))
                metrics.append(metric_val)
            metric_means.append(np.mean(metrics))
            metric_stds.append(np.std(metrics))

        metric_means, metric_stds = np.array(metric_means), np.array(metric_stds)
        best_threshold = self.threshold_grid[np.argmax(metric_means)]

        return metric_means, metric_stds, best_threshold

    def predict(self, X):
        preds = self.predict_proba(X)[:, 1]
        return (preds > self.best_threshold_).astype(int)

$\oint$ scikit-learn doesn't like you using args and kwargs on your estimator's init because of how they designed the way they deal with hyperparameter optimization. But as I didn't want my init to look like this, I decided to change the _get_param_names from the sklearn.base.BaseEstimator to call only the parameters of the class I'm inheriting from (sklearn.ensemble.RandomForestClassifier, a.k.a. super()). If you want to design it properly, you should do this.

$\oint$ Note that although I'm inheriting from sklearn.ensemble.RandomForestClassifier, I don't use any sklearn.ensemble.RandomForestClassifier-specific logic here, and actually, you can do the same with any scikit-learn estimator.

We are basically using the same optimization function we had discussed earlier on the part of the set that is given in .fit by doing a sklearn.model_selection.train_test_split. This implementation is computationally expensive, mainly because of bootstrap. So we can lower the number of bootstrap samples to make it faster.

%%time

df_best = run_experiment(
    estimator=ThresholdOptimizerRandomForestBinaryClassifier(random_state=0, n_bootstrap=5,
                                                             metric=f1_score, threshold_grid=threshold_grid),
    scoring="f1", X=X_train_model, y=y_train_model, params=params, cv=skfold)

df_best

CPU times: total: 5min 25s
Wall time: 5min 28s

	param_max_depth	param_n_estimators	mean_test_score	std_test_score
0	2	10	0.238970	0.011282
1	2	50	0.238447	0.016450
2	2	100	0.243230	0.022790
3	4	10	0.203598	0.039442
4	4	50	0.226371	0.023246
5	4	100	0.249048	0.007759
6	10	10	0.200635	0.034000
7	10	50	0.199724	0.050758
8	10	100	0.176026	0.042777
9	None	10	0.175387	0.015105
10	None	50	0.158617	0.015450
11	None	100	0.179195	0.036804

Tuning the threshold for the best hyperparameters combination

With this best combination of hyperparameters of method 3 chosen, we can do the procedure we discussed earlier to find the best threshold for this model.

best_params_values = df_best.sort_values("mean_test_score", ascending=False).iloc[0][list(map(lambda x: "param_" + x,  params.keys()))].values
best_params = dict(zip(params.keys(), best_params_values))
best_params

{'max_depth': 4, 'n_estimators': 100}

best_model = (
    RandomForestClassifier(random_state=0)
    .set_params(**best_params)
    .fit(X_train_model, y_train_model)
)

f1_means_best, f1_stds_best, best_threshold_best = \
optmize_threshold_metric(best_model, X_train_threshold, y_train_threshold, f1_score, threshold_grid)

fig, ax = plt.subplots(figsize=(5, 2.5))
ax.plot(threshold_grid, f1_means_best)
ax.fill_between(threshold_grid, f1_means_best - 1.96 * f1_stds_best, f1_means_best + 1.96 * f1_stds_best, alpha=0.5)
ax.vlines(best_threshold_best, min(f1_means_best - 1.96 * f1_stds_best), max(f1_means_best + 1.96 * f1_stds_best), "k", label="Chosen threshold")
ax.set_xticks(np.linspace(0, 1, 11))
ax.set_xlabel("Threshold")
ax.set_ylabel("f1_score")
ax.legend()
plt.tight_layout()

100%|██████████| 101/101 [01:13<00:00,  1.37it/s]

f1_score(y_test, (best_model.predict_proba(X_test)[:, 1] > best_threshold_best).astype(int))

0.24038461538461534

f1_score(y_test, best_model.predict(X_test))

0.0

Notice that we got a much better sklearn.metrics.f1_score than the initial search was telling us we would get!

tl;dr

When optimizing hyperparameters, threshold-dependent metrics make sklearn.model_selection.GridSearchCV-like search methods use the estimator's .predict method instead of .predict_proba. This can be harmful as 0.5 might not be the best threshold, especially in imbalanced learning scenarios.

Always prioritize the threshold-independent metrics, but if you need to use a threshold-dependent metric, you can try to make it threshold-independent by getting the optimal value for it (max or min depending on if greater_is_better=True or False) for a threshold grid of options. As this is the same as optimizing it for the validation set, it can slightly overestimate your results.

A more honest way to do this is to explicitly optimize the threshold on a part of your training set for each cross-validation fold. This mimics reality better but is more time-consuming as this optimization takes time if you want it to be robust (for instance, using bootstrap to better estimate the performance value).

$\oint$ This is the current state of this topic, in version 1.2.0 of scikit-learn. In a future release, there will be a sklearn.model_selection.CutoffClassifier (from PR #16525) that will behave very closely to my ThresholdOptimizerRandomForestBinaryClassifier. One significant change will be that it will receive the estimator during initialization instead of inheriting it.

Bibliography

[1] A Gentle Introduction to Threshold-Moving for Imbalanced Classification by Jason Brownlee.

You can find all files and environments for reproducing the experiments in the repository of this post. In addition, I recorded a video version of this post in Portuguese.

Meta K-Means: um ensemble de K-Means

2022-10-23T00:00:00+00:00

Após ouvir falar superficialmente sobre comitês de algoritmos de clusterização [1], me perguntei: qual seria um jeito esperto de agregar as decisões individuais de cada um dos clusters em um valor final? A resposta não é imediata, principalmente porque o problema aqui é que a definição de cada cluster pode ser diferente mesmo quando eles concordam nas separações.

Por exemplo, dado um conjunto de oito exemplos, as segmentações [0, 0, 1, 0, 2, 2, 2, 1] e [1, 1, 0, 1, 3, 3, 3, 0] são idênticas a menos de uma permutação de nomes, isto é, basta chamar o cluster 0 de 1 e o 1 de 0 em alguma das listas e o 3 de 2 na segunda lista (ou o 2 de 3 na primeira lista). É importante ter clareza de que esses clusters de fato concordam, uma vez que a nomenclatura não tem significado algum já que não estamos num problema de classificação.

Isso motiva a criação de métricas de "avaliação de clusterização" como a sklearn.metrics.rand_score que responde a pergunta: o quão similar são duas clusterizações? Em que, obter o valor próximo de 1 significa que os agrupamentos concordam bastante (a menos de possíveis trocas de nomes).

from sklearn.metrics import rand_score

rand_score([0, 0, 1, 0, 2, 2, 2, 1], [1, 1, 0, 1, 3, 3, 3, 0])

1.0

$\oint$ A ideia por trás do (unadjusted) rand index é bem intuitiva e para explicar, vamos pensar em um exemplo específico. Imagine o cenário em que temos um conjunto de dados [a, b, c, d] e duas clusterizações possíveis: A = [1, 1, 0, 0] e B = [1, 1, 1, 2].

Primeiro, separamos todos os pares possíveis de elementos que temos no nosso conjunto. No nosso exemplo teríamos (a, b), (a, c), (a, d), (b, c), (b, d) e (c, d).
Em seguida, contabilizamos quantos desses pares concordam nas clusterizações A e B. Concordar nas clusterizações significa que estão no mesmo cluster ao mesmo tempo, tanto em A quanto em B, ou não estão no mesmo cluster ao mesmo tempo nas duas clusterizações. No nosso caso, o par (a, b) concorda porque, tanto em A quanto em B, ambos estão no mesmo cluster. Mas também os pares (a, d) e (b, d) concordam nas duas clusterizações porque são alocados em clusters diferentes simultaneamente.
Com o número de pares concordantes, fazemos a razão pelo número total de pares para ter o valor do unadjusted rand index calculado, nossa medida de similaridade entre agrupamentos. No nosso caso, 3/6=0.5.

rand_score([1, 1, 0, 0], [1, 1, 1, 2])

0.5

Essas permutações deixam o problema extremamente mais desafiador do que temos num comitê supervisionado e existe uma literatura extensa [1] que tenta abordá-lo uma vez que gostaríamos de poder utilizar ideias de comitê também aqui.

Conversando com o Alessandro, tentamos encarar esse problema em uma versão mais compacta dele, analisando o caso específico de comitê de sklearn.cluster.KMeans (apesar de mais simples, ainda assim seria um caso com possível ganho prático pela popularidade do método). A hipótese seria de que é possível utilizar os centróides para achar as concordâncias entre os diferentes estimadores individuais e daí surgiu a ideia de clusterizar os centróides dos sklearn.cluster.KMeans individuais para renomear os clusters finais de uma maneira única entre os diferentes estimadores individuais.

Para exemplificar a ideia, um exemplo ajuda: se temos dois sklearn.cluster.KMeans com n_clusters=3, então teríamos três centróides $K_1, K_2, K_3$ associados ao primeiro sklearn.cluster.KMeans e os centróides $C_1, C_2, C_3$ do segundo sklearn.cluster.KMeans. Se, ao clusterizar (com o mesmo número de clusters n_clusters), encontrássemos os metaclusters $G_1 = \{ K_1, C_1 \}$, $G_2 = \{ K_2, K_3, C_3\}$ e $G_3 = \{ C_2\}$, então teríamos um mapeamento na hora de agregar o resultado dos diferentes sklearn.cluster.KMeans individuais.

Um exemplo que cai no cluster do centróide $K_1$ no primeiro agrupamento e no de $C_3$ no segundo é associado ao cluster $G_1$ com peso $1/2=0.5$ (já que um de dois K-Means base associou-o a esse grupo), ao cluster $G_2$ com peso $1/2=0.5$ (já que um de dois K-Means base associou-o a esse grupo) e ao cluster $G_3$ com peso $0/2=0$ (já que nenhum dos dois K-Means base associou-o a esse grupo). Já um exemplo que cai em $K_3$ e $C_3$ nos agrupamentos individuais estaria associado ao grupo $G_2$ com peso $2/2=1$, enquanto nos outros $G_i$ com peso $0$. Outros casos são análogos. Nesse formato, estamos voltando à mesma ideia de uma votação de um comitê clássico de classificação para criar um índice de pertencimento de cada exemplo em cada cluster como um algoritmo de soft clustering.

Testando a ideia no dataset de dígitos

Para fazer um experimento com esse modelo, vamos brincar com o conjunto de imagens de baixa resolução de dígitos escritos à mão que podemos carregar usando a função sklearn.datasets.load_digits.

from sklearn.datasets import load_digits

digits = load_digits(n_class=9)
X = digits.data
X.shape

(1617, 64)

Para introduzir variância nos clusters individuais e eles não concordarem totalmente (a menos de alguma permutação), podemos tanto mudar a estratégia de treinamento do sklearn.cluster.KMeans (por exemplo, diminuindo o número de inicializações que ele faz para encontrar a melhor partição em termos de inércia), quanto fazer um bootstrap do nosso conjunto de treino (inspirado em como um bagging funciona no caso supervisionado). Nesse experimento, estamos seguindo com a segunda opção.

from sklearn.cluster import KMeans

n_estimators = 250
n_clusters = 9

km_list = \
[KMeans(n_clusters=n_clusters, random_state=i)
 .fit(X[np.random.RandomState(i).choice(X.shape[0], X.shape[0])]) 
 for i in tqdm(range(n_estimators))]

Após treinar os diferentes sklearn.cluster.KMeans, precisamos treinar o "Meta K-Means" que utilizará os centróides para treinamento.

cluster_centers = np.vstack([km.cluster_centers_ for km in km_list])

meta_kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(cluster_centers)

Desse modo, conseguimos construir os mapeamentos que agrupam os centróides fazendo a tradução dos clusters individuais de forma que eles concordem de acordo com o critério de agrupamento do "Meta K-Means".

meta_clusters_map = \
[{j: meta_kmeans.labels_[n_clusters*i+j] for j in range(n_clusters)} for i in range(n_estimators)]

Para fazer o agrupamento dos clusters individuais, fazemos algum tipo de agrupamento (como a média, pensando em uma votação simples) dos diferentes clusters para obter um índice de pertencimento de cada exemplo a cada cluster.

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer().fit(list(range(n_clusters)))

aggregated_predicts = \
np.array([lb.transform(np.array(list(map(map_dic.get, km.predict(X)))))
          for km, map_dic in zip(km_list, meta_clusters_map)]).mean(axis=0)

aggregated_predicts.shape

(1617, 9)

Para analisar se o que encontramos parece fazer sentido, vamos tentar interpretar os metacentróides encontrados (ou seja, os centróides que encontramos quando rodamos o sklearn.cluster.KMeans nos centróides dos sklearn.cluster.KMeans base). Como estamos mexendo com essa base de dígitos, podemos olhar para a imagem representada pelo plot do metacentróide de cada cluster final.

fig, ax = plt.subplots(ncols=3, nrows=3, figsize=(4, 4))

plt.gray()
for i, j in product(range(3), range(3)):
    ax[i, j].matshow(meta_kmeans.cluster_centers_[3*i+j].reshape(8, 8))
    ax[i, j].set_xticks([])
    ax[i, j].set_yticks([])
    ax[i, j].set_title(f"Cluster {3*i+j} centroid", fontsize=8)
plt.tight_layout()

A inspeção visual nos permite dar nomes para os clusters seguindo o formato dos números, construindo o seguinte dicionário:

dict_cluster = {0: 2, 1: 4, 2: 8, 3: 6, 4: 0, 5: 5, 6: 3, 7: 1, 8: 7}

Para ver os clusters finais e em que regiões do espaço estão os nossos pontos associados a clusters incertos, vamos aplicar um sklearn.manifold.MDS e, em seguida, um sklearn.manifold.TSNE para reduzir a dimensionalidade dos nossos dados.

from sklearn.manifold import MDS, TSNE

X_emb = \
(TSNE(random_state=42).fit_transform(MDS(random_state=42).fit_transform(X)))

É legal ver que nossos clusters estão fazendo sentido com a marcação original de dígitos, mas o gráfico mais importante aqui é o último: vemos que de fato, existem exemplos que parecem ser mais confusos de atribuir a algum cluster de forma certa (como as imagens associadas ao número 8 que são facilmente confundidas com outros números e exemplos que parecem estar "na fronteira", entre dois agrupamentos).

fig, ax = plt.subplots(ncols=4, figsize=(12, 3))

im0 = ax[0].scatter(X_emb[:, 0], X_emb[:, 1], s=3, c=digits.target, cmap="Set1")
cbar0 = plt.colorbar(im0, ax=ax[0], ticks=np.linspace(0.5, 7.5, 9))
cbar0.ax.set_yticklabels(np.arange(0, 9))
ax[0].set_title("Real number class", fontsize=11)

im1 = ax[1].scatter(X_emb[:, 0], X_emb[:, 1], s=3,
                    c=list(map(dict_cluster.get, aggregated_predicts.argmax(axis=1))),
                    cmap="Set1")
cbar1 = plt.colorbar(im1, ax=ax[1], ticks=np.linspace(0.5, 7.5, 9))
cbar1.ax.set_yticklabels(np.arange(0, 9))
ax[1].set_title("Cluster class", fontsize=11)

cmap2 = colors.ListedColormap(["#e41a1c", "#4daf4a"])
im2 = ax[2].scatter(X_emb[:, 0], X_emb[:, 1], s=3,
                    c=(aggregated_predicts.max(axis=1)==1).astype(int), cmap=cmap2)
im2.set_clim(0, 1)
cbar2 = plt.colorbar(im2, ax=ax[2], ticks=[0.25, 0.75])
cbar2.ax.set_yticklabels(["Some uncertainty", "No uncertainty"],
                         rotation=270, ha="center", rotation_mode="anchor", fontsize=9)
cbar2.ax.tick_params(pad=10)
ax[2].set_title("Certainty about the assigned cluster", fontsize=11)

cmap3 = colors.LinearSegmentedColormap.from_list('', colors=["#e41a1c", "#4daf4a"])
im3 = ax[3].scatter(X_emb[:, 0], X_emb[:, 1], s=3,
                    c=aggregated_predicts.max(axis=1), cmap=cmap3, norm=colors.LogNorm())
im3.set_clim(0.73, 1.02)
cbar3 = plt.colorbar(im3, ax=ax[3], ticks=[0.75, 0.8, 0.85, 0.9, 0.95, 1])
cbar3.ax.set_yticklabels(['$\leq$0.75', '0.80', '0.85', '0.9', '0.95', '1.00'])
ax[3].set_title('Maximum of "predict_proba"', fontsize=11)

for axs in ax:
    clean_axes(axs)
plt.tight_layout()

Observando o histograma do máximo do nosso ".predict_proba", vemos que para um número razoável de exemplos, os clusters encontrados pelos agrupamentos individuais podem discordar ligeiramente gerando uma visão de incerteza e robustez associada à sua atribuição de agrupamento (ideia central dos algoritmos de soft clustering). Entretanto, para maioria dos exemplos os sklearn.cluster.KMeans individuais concordam totalmente.

fig, ax = plt.subplots(figsize=(5, 2.5))
ax.hist(aggregated_predicts.max(axis=1), bins=np.linspace(0, 1, 25))
ax.set_yscale("log")
ax.set_xlabel('Maximum of "predict_proba" per instance')
ax.set_ylabel("Frequency (log scale)")
ax.set_title("Histogram of assigned cluster certainty")
plt.tight_layout()

Essa visão nos permite ver os exemplos mais difíceis de agrupar, dando uma noção de instance hardness para o nosso problema de clusterização que, no nosso exemplo, parece estar associado a números parecidos com o 8.

(pd.DataFrame(aggregated_predicts)[(aggregated_predicts<0.45).all(axis=1)]
 .rename(columns=dict_cluster).T.sort_index().T)

	1	2	3	5	6	7	8
630	0.084	0.06	0.408	0.000	0.000	0.000	0.448
1385	0.204	0.00	0.164	0.196	0.000	0.424	0.012
1386	0.088	0.00	0.060	0.228	0.000	0.312	0.312
1533	0.076	0.00	0.388	0.000	0.196	0.000	0.340
1616	0.032	0.00	0.420	0.000	0.308	0.000	0.240

fig, ax = plt.subplots(ncols=5, figsize=(5, 2.5))

plt.gray()
for axs, i in zip(ax, pd.DataFrame(aggregated_predicts)[(aggregated_predicts<0.45).all(axis=1)].index):
    axs.matshow(X[i].reshape(8,8))
    axs.set_xticks([])
    axs.set_yticks([])
    axs.set_title(f"{i} - Target: {digits.target[i]}", fontsize=7)
plt.tight_layout()

Considerações finais

Essa ideia de clusterização de centróides não é nova e, inclusive, pode ser utilizada para definir a inicialização do K-Means. Esse algoritmo é chamado Refined K-Means [1], entretanto não parece ter uma vantagem clara quando comparado ao K-Means++ com múltiplas inicializações (maneira como o sklearn.cluster.KMeans segue).

Apesar de claramente ter aplicações em que vale a pena testar essa visão, nos experimentos feitos para construir essa discussão, os clusters encontrados individualmente raramente discordam muito (conseguimos ver isso pelo número significativo de exemplos com aggregated_predicts.max(axis=1) sendo igual a 1) e os hard clusters encontrados no final da nossa estratégia de soft clustering (pegando o .argmax) são muito parecidos com os clusters encontrados em um K-Means usual. Portanto, não acho que seja uma técnica extremamente promissora, apesar de valer o teste sempre que você estiver interessado em um K-Means pelo baixo esforço adicional.

unique_km_labels = KMeans(random_state=42).fit(X).labels_

(rand_score(unique_km_labels, aggregated_predicts.argmax(axis=1)),
 (aggregated_predicts.max(axis=1)==1).mean())

(0.9799745280650514, 0.6951144094001237)

Por fim, é fácil generalizar as ideias aqui para qualquer outro algoritmo de clusterização baseado em centróides como o K-Medians ou o K-Medoids. Isso significa que não estamos necessariamente presos à distância euclidiana, que é a distância utilizada pelo K-Means.

Implementação grosseira da classe do estimador

Se você estiver interessado em utilizar essas ideias, elas deveriam funcionar utilizando algo na linha da classe implementada a seguir, que é compatível com bibliotecas que seguem o padrão de código do scikit-learn. Apenas fique atento ao caso em que n_clusters=2, pois o sklearn.preprocessing.LabelBinarizer mantém apenas uma coluna ao invés de criar duas e, nesse caso, o return do seu .predict_proba terá apenas uma dimensão.

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelBinarizer

class MetaKMeans(BaseEstimator):
    """Meta K-Means clustering.

    A Meta K-Means is a meta estimator that fits several K-Means
    on various sub-samples of the dataset and uses averaging to
    measure uncertainty related to predicted clusters.

    Parameters
    ----------
    n_clusters : int, default=8
        The number of clusters to form as well as the number of
        metacentroids to generate.

    n_estimators : int, default=100
        The number of K-Means in the ensemble.

    random_state : int, default=42
        Controls both the randomness of the bootstrapping of the samples used
        when building the individual K-Means and the randomness of the
        choice of initial centroids of each K-Means.

    KMeans_params : dict, default={}
        Explicitly set some of the base K-Means parameters as **KMeans_params.
    """
    
    def __init__(self, n_clusters=8, n_estimators=100, random_state=42, KMeans_params={}):
        self.n_clusters = n_clusters
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.KMeans_params = KMeans_params

    def fit(self, X, y=None):
        self.estimators_ = \
        [KMeans(n_clusters=self.n_clusters, random_state=i+self.random_state, **self.KMeans_params)
         .fit(X[np.random.RandomState(i).choice(X.shape[0], X.shape[0])]) 
         for i in range(self.n_estimators)]
        
        cluster_centers = np.vstack([km.cluster_centers_ for km in self.estimators_])

        self.meta_kmeans_ = KMeans(n_clusters=self.n_clusters, random_state=42).fit(cluster_centers)
        
        self.metacluster_centers_ = self.meta_kmeans_.cluster_centers_
        
        self.meta_clusters_map_ = \
        [{j: self.meta_kmeans_.labels_[self.n_clusters*i+j] for j in range(self.n_clusters)} for i in range(self.n_estimators)]
        
        self.lb_ = LabelBinarizer().fit(list(range(self.n_clusters)))
        
        return self
    
    def predict_proba(self, X):
        return \
        np.array([self.lb_.transform(np.array(list(map(map_dic.get, km.predict(X)))))
                  for km, map_dic in zip(self.estimators_, self.meta_clusters_map_)]).mean(axis=0)
    
    def predict(self, X):
        return self.predict_proba(X).argmax(axis=1)

class_meta_kmeans_with_params = \
MetaKMeans(n_clusters=9, n_estimators=10, random_state=0, KMeans_params={"init": "random"}).fit(X)

class_meta_kmeans = \
MetaKMeans(n_clusters=9, n_estimators=250, random_state=0).fit(X)
class_predict_probas = class_meta_kmeans.predict_proba(X)

# As I'm choosing the same random_state, I expect results of the class
# to match the ones we did above.
((class_predict_probas == aggregated_predicts).all(),
 (class_meta_kmeans.predict(X) == aggregated_predicts.argmax(axis=1)).all())

(True, True)

Referências

[1] Cluster ensembles: A survey of approaches with recent extensions and applications. Tossapon Boongoen Natthakan Iam-On. Computer Science Review Volume 28, 2018.

Todos os arquivos e ambiente para reprodução dos experimentos podem ser encontrados no repositório deste post.

Uma utilização crítica do Boruta

2022-09-05T00:00:00+00:00

Se fixarmos o poder preditivo no conjunto de desenvolvimento, um modelo com menos atributos tende a ter menor propensão de abusar de ruídos e relações espúrias do seu conjunto de treinamento, o que pode levá-lo a ganhos de performance fora do laboratório. Uma seleção bem feita de variáveis é, portanto, uma ferramenta data-centric importante na modelagem de problemas de aprendizado de máquina supervisionado.

$\oint$ Para ilustrar a afirmação anterior, temos, como exemplo, que a dimensão-VC (medida de complexidade de uma família de hipóteses) de um perceptron (classificador linear) é $d+1$, em que $d$ é o número de variáveis utilizadas no modelo [1]. Um modelo com dimensão-VC maior significa que você precisa de um volume de dados maior para garantir que sua performance, medida no treinamento, seja semelhante à performance real. Na prática, isso significa que quanto maior a dimensão-VC, maior a chance de overfitting. Consequentemente, nesse exemplo, se temos dois perceptrons com performances semelhantes no treino, com a diferença de que um tem mais variáveis que o outro, o que tem mais variáveis tem maior chance de apresentar overfitting [1].

Entretanto, a seleção de variáveis não é vista com o cuidado devido na maioria dos cursos de Aprendizado de Máquina. São apresentados poucos métodos e de maneira superficial. Os poucos lugares que discutem o tema, no geral, focam ainda em técnicas que são pouco escaláveis com o aumento de variáveis e, por isso, são pouco praticáveis na maioria das aplicações do mercado (como as estratégias gulosas de sklearn.feature_selection.SequentialFeatureSelector).

No DataLab da Serasa Experian, seleção de variáveis se torna extremamente relevante pela natureza dos problemas que trabalhamos. Na grande maioria dos casos temos algumas milhares de variáveis disponíveis no bureau de dados da Serasa e não é fácil identificar de antemão quais serão as features que nos darão mais ganhos. É necessário aplicar técnicas que são robustas à grandeza do número de variáveis que temos ao mesmo tempo que garantam uma seleção que faça sentido.

Neste post, iremos motivar a construção do Boruta [2], uma das técnicas mais utilizadas pelos cientistas do DataLab na seleção de features, com algumas dicas de uso prático. Ilustraremos ainda o uso da função boruta.BorutaPy, do ambiente scikit-learn-contrib (ou seja, compatível com bibliotecas que seguem o padrão de código do scikit-learn).

Para ilustrar o problema de seleção de features, utilizaremos o sklearn.datasets.make_classification para criar um problema genérico de classificação em que podemos definir, como um parâmetro da função, o número de variáveis úteis para o problema de previsão.

from sklearn.datasets import make_classification

N_FEATURES = 20

X, y = \
make_classification(n_samples=1000,
                    n_features=N_FEATURES,
                    n_informative=2,
                    n_redundant=2,
                    n_classes=2,
                    flip_y=0.1,
                    shuffle=False,
                    random_state=42)

X = pd.DataFrame(X, columns=[f'column_{i+1}' for i in range(N_FEATURES)])

X.head()

	column_1	column_2	column_3	...	column_18	column_19	column_20
0	-1.050478	-1.323568	0.912474	...	1.238946	0.209659	-0.491636
1	-1.580834	-2.747104	1.777419	...	0.152355	-0.822420	1.121031
2	-0.885704	-0.614600	0.501004	...	0.193590	0.850898	-0.137372
3	-1.525438	-2.967793	1.884777	...	-0.316073	0.615771	1.203884
4	-1.076826	-1.014619	0.752233	...	0.300474	0.622207	-1.138833

5 rows × 20 columns

Como estamos escolhendo 2 features informativas e 2 features redundantes, temos que as 4 features mais importantes são as colunas: column_1, column_2, column_3 e column_4.

Motivando a construção do Boruta

Medindo a importância de uma variável

Uma das técnicas mais comuns para selecionar as variáveis é aproveitar-se de modelos que, de alguma forma, selecionam-nas no próprio processo de treinamento. Árvores e, consequentemente, comitês de árvores são, talvez, o melhor exemplo disso: pela estratégia gulosa de fazer a melhor quebra possível naquele instante (de acordo com algum critério de melhor, usualmente relacionado à pureza das folhas criadas, no caso de classificação), estamos sempre escolhendo variáveis relevantes. Variáveis pouco discriminativas são utilizadas muito menos que as variáveis que de fato ajudam a fazer a previsão [3].

Esse processo, naturalmente deriva medidas de importância para as variáveis como: o número de vezes que ela é utilizada (esse é o modo default do atributo .feature_importance_ dos ensembles do LGBM, como o lightgbm.LGBMClassifier) ou uma ponderação do ganho de informação durante a escolha das quebras das features (essa é a forma default dos ensembles de árvores do sklearn, como o sklearn.ensemble.RandomForestClassifier, o sklearn.ensemble.ExtraTreesClassifier, e o sklearn.ensemble.HistGradientBoostingClassifier, além de também virar o atributo do LGBM quando definimos o importance_type='gain').

Com alguma dessas medidas naturais de importância, é razoável ordenar nossas variáveis da mais importante para a menos importante.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42).fit(X, y)

df_imp = \
(pd.DataFrame(list(zip(X.columns, rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
 .reset_index(drop=True)
)

df_imp

	feature_name	feature_importance
0	column_2	0.278748
1	column_3	0.201150
2	column_4	0.092612
3	column_1	0.085144
...	...	...
16	column_5	0.018714
17	column_16	0.018641
18	column_18	0.017565
19	column_20	0.016912

20 rows × 2 columns

$\oint$ Existem algumas outras formas de metrificar a importância de uma variável como, por exemplo, utilizando suas contribuições de valores SHAP. Tendo em vista que o shap.Explainer(model).shap_values(X) nos retorna uma medida de quanto aquela variável agregou na previsão, pegar a sua média entre todos os exemplos nos dá uma forma de ver o quão útil ela foi para discriminar os exemplos como um todo. Para os valores não se cancelarem (imagine uma variável que para determinados valores joga a previsão para cima e em outros valores joga a previsão para baixo), tomamos o módulo antes de fazer a média. Repare que a ordem das importâncias dada pelo SHAP pode ser diferente da ordem de importâncias dada pelo atributo de .feature_importance_ usual do estimador, como é o caso do nosso exemplo.

explainer = shap.TreeExplainer(rfc)
shap_vals = explainer.shap_values(X)

df_imp_shap = \
(pd.DataFrame(list(zip(X.columns, np.abs(shap_vals[0]).mean(axis=0))),
              columns=['feature_name', 'shap_importance'])
 .sort_values(by='shap_importance', ascending=False)
 .reset_index(drop=True)
)

df_imp_shap

	feature_name	shap_importance
0	column_2	0.197645
1	column_3	0.107211
2	column_4	0.043797
3	column_1	0.041570
...	...	...
16	column_18	0.005851
17	column_16	0.005268
18	column_5	0.005099
19	column_20	0.005019

20 rows × 2 columns

Ainda não falamos do Boruta, mas ele se utiliza dessa ordenação para fazer suas análises e é implementado, usualmente, utilizando medida de importância do estimador (o atributo .feature_importances_ ou .coef_ para algoritmos lineares). Essa diferença motivou alguns contribuidores a implementar o Boruta-Shap. Entretanto, incorporar o SHAP ao processo do Boruta não parece trivial e a biblioteca costuma ser lenta.

Uma possível alternativa pode ser adaptar na mão o atributo .feature_importance_ do seu classificador, salvando o X no momento de treinamento para utilização no cálculo do SHAP. Como implemento aqui:

class SHAPImportanceRandomForestClassifier(RandomForestClassifier):
    def fit(self, X, y, sample_weight=None):
        self.X_ = X
        super().fit(X, y, sample_weight=sample_weight)
        return self
    @property
    def feature_importances_(self):
        check_is_fitted(self)
        explainer = shap.TreeExplainer(self)
        shap_vals = explainer.shap_values(self.X_)
        return np.abs(shap_vals[0]).mean(axis=0)

from shap_feature_importances_ import SHAPImportanceRandomForestClassifier

rfc_shap = SHAPImportanceRandomForestClassifier(random_state=42).fit(X, y)
rfc_shap.feature_importances_

array([0.04156985, 0.19764501, 0.10721142, 0.04379691, 0.00509938,
       0.00967927, 0.00900892, 0.00769202, 0.01053711, 0.00973848,
       0.00764462, 0.00725161, 0.00690175, 0.00718789, 0.00600269,
       0.00526766, 0.00659648, 0.00585107, 0.00726538, 0.00501896])

Note que essa implementação utiliza o mesmo conjunto de treino para cálculo do SHAP. Existe algum debate aqui, mas tenha em mente que os valores de importância calculados com SHAP (média do valor absoluto) no teste podem ser diferentes dos valores de importância calculados com SHAP no treino. Se você quiser esse nível de preciosismo, pode estar interessado em reservar um pedaço do seu conjunto de dados para calcular os valores SHAP. Implemento essa ideia na classe XSHAPImportanceRandomForestClassifier do arquivo shap_feature_importances_.py no repositório deste post. Entretanto, para poder dormir tranquilo, tenha em mente que o .feature_importances_ usual dos algoritmos baseados em árvore é calculado com o conjunto de treino, então calcular o SHAP no treino não é uma blasfêmia tão grande.

Selecionando as `K` “melhores variáveis”

Se quisermos que nosso modelo tenha apenas as K features mais úteis, a maneira natural de escolhê-las seria pegar as K variáveis com maiores valores de importância.

K = 4

(df_imp
 .head(K)
 .feature_name
 .to_list()
)

['column_2', 'column_3', 'column_4', 'column_1']

Essa é uma das estratégias mais comuns de se fazer seleção de features no mercado, mas levanta algumas questões. A primeira e mais imediata é: como escolher o número de variáveis K ideal. Nesse caso ilustrativo, sabemos que 4 variáveis é o número correto, mas na maioria dos casos de aplicação real é irrealista ter esse número de antemão.

$\oint$ Uma estratégia muito utilizada, mas que não vamos focar muito, é aumentar a lista de features do modelo seguindo a ordenação dada pelo modelo treinado em todas as features, encarando esse valor K como um hiper-parâmetro que estamos otimizando. No exemplo abaixo, fazemos isso utilizando o sklearn.model_selection.GridSearchCV ao construir uma classe SelectKTop utilizando o padrão necessário para os selecionadores de variáveis do scikit-learn, isto é, seguindo a forma que o sklearn.feature_selection.SelectorMixin exige. Você pode ver a implementação dessa classe no arquivo selectktop_selector.py no repositório deste post.

PS: A classe SelectKTop é mais ou menos equivalente à classe sklearn.feature_selection.SelectFromModel, cuja existência descobri após terminar de escrever o post!

from selectktop_selector import SelectKTop

from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.pipeline import make_pipeline

grid = (
    GridSearchCV(
        make_pipeline(SelectKTop(random_state=42),
                      RandomForestClassifier(random_state=42)),
        param_grid={'selectktop__K': np.arange(1,N_FEATURES+1)},
        cv=RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42),
        scoring='roc_auc')
    .fit(X, y))

df_cv = (
    pd.DataFrame(grid.cv_results_)[[
        'param_selectktop__K',
        'mean_test_score',
        'std_test_score'
    ]])

cv_best = (
    df_cv
    .sort_values(by='mean_test_score', ascending=False)
    .reset_index(drop=True)
    .loc[0])

plt.errorbar(df_cv.param_selectktop__K, df_cv.mean_test_score, 1.96*df_cv.std_test_score)
plt.scatter(cv_best.param_selectktop__K, cv_best.mean_test_score, s=100)
plt.ylim(0.75, 1)
plt.xlabel('K of SelectKTop')
plt.xticks(df_cv.param_selectktop__K.astype(int))
plt.ylabel('Performance (ROCAUC)')
plt.show()

No nosso experimento controlado, encontramos algumas poucas variáveis a mais do que o correto (e ficamos com todas as úteis).

grid.best_estimator_.steps[0][1].get_feature_names_out()

array(['column_1', 'column_2', 'column_3', 'column_4', 'column_10'],
      dtype=object)

Vale citar que podemos deixar esse método mais robusto variando o random_state do base_estimator e tendo uma distribuição de importâncias para cada variável ao invés de apenas um valor único (que naturalmente é mais ruidoso). Utilizar essa técnica com o SHAP para medir a importância (passando por exemplo o SHAPImportanceRandomForestClassifier como base_estimator do SelectKTop) é algo muito utilizado por alguns cientistas do DataLab como alternativa ao Boruta que, como vamos ver, costuma ser muito demorado.

Selecionando as K melhores variáveis com ponto de corte sugerido por uma variável aleatória

Criar uma variável de ruído, ou seja, que sabidamente não é útil para a previsão, nos auxilia a ter um ponto de corte para filtro das variáveis que demonstram ajudar na previsão. A ideia dessa abordagem é medir a importância da variável aleatória e ficar apenas com variáveis que se demonstrarem mais importantes do que ela.

Adicionando a nova coluna, por exemplo, amostrada de uma variável aleatória $\mathcal{N}(0,1)$ de forma independente, temos uma nova lista de importância das variáveis.

normal_noise_X = (X.assign(noise_column = np.random.RandomState(42).normal(size=X.shape[0])))
normal_noise_X[normal_noise_X.columns[::-1]].head()

	noise_column	column_20	column_19	...	column_3	column_2	column_1
0	0.496714	-0.491636	0.209659	...	0.912474	-1.323568	-1.050478
1	-0.138264	1.121031	-0.822420	...	1.777419	-2.747104	-1.580834
2	0.647689	-0.137372	0.850898	...	0.501004	-0.614600	-0.885704
3	1.523030	1.203884	0.615771	...	1.884777	-2.967793	-1.525438
4	-0.234153	-1.138833	0.622207	...	0.752233	-1.014619	-1.076826

5 rows × 21 columns

normal_noise_rfc = RandomForestClassifier(random_state=42).fit(normal_noise_X, y)

df_imp_normal_noise = \
(pd.DataFrame(list(zip(normal_noise_X.columns, normal_noise_rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
)

df_imp_normal_noise

	feature_name	feature_importance
1	column_2	0.266446
2	column_3	0.205667
3	column_4	0.087548
0	column_1	0.084593
...	...	...
8	column_9	0.019112
4	column_5	0.018706
18	column_19	0.018264
19	column_20	0.017692

21 rows × 2 columns

Como a última variável é a nossa coluna sabidamente ruidosa, a ideia dessa técnica é selecionar apenas as variáveis que têm importância maior do que o limiar definido pela importância da variável não relacionada.

normal_noise_importance = \
normal_noise_rfc.feature_importances_[-1]

np.array(
 df_imp_normal_noise
 .query(f"feature_importance > {normal_noise_importance}")
 .feature_name
)

array(['column_2', 'column_3', 'column_4', 'column_1', 'column_6',
       'column_10', 'column_14'], dtype=object)

Vale observar que, a escolha da variável ruidosa como $\mathcal{N}(0,1)$ foi totalmente arbitrária. Entretanto, isso faz diferença e pode fazer com que a seleção de variáveis seja distinta. No nosso exemplo controlado, mudar o ruído para $\textrm{Exp}(1)$ nos faria selecionar variáveis finais diferentes totalmente por sorte.

exp_noise_X = \
(X.assign(noise_column = np.random.RandomState(42).exponential(size=X.shape[0])))
exp_noise_rfc = \
RandomForestClassifier(random_state=0).fit(exp_noise_X, y)
exp_noise_importance = \
exp_noise_rfc.feature_importances_[-1]

np.array(
 pd.DataFrame(list(zip(exp_noise_X.columns, exp_noise_rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
 .query(f"feature_importance > {exp_noise_importance}")
 .feature_name
)

array(['column_2', 'column_3', 'column_4', 'column_1', 'column_14',
       'column_6', 'column_10', 'column_9', 'column_12', 'column_13',
       'column_7', 'column_18'], dtype=object)

Isso nos demonstra um problema desse método. Apesar de poderoso, por nos dar um jeito interessante de selecionar as variáveis sem escolher K de forma arbitrária, a escolha da distribuição da variável ruidosa é uma fonte de variação relevante.

Em muitos casos, ter variáveis discretas versus contínuas pode influenciar na medida de importância (como é o caso de árvores que, por terem mais quebras disponíveis, terão mais chance de escolher uma variável ruidosa contínua) ou, ainda, a própria escala da feature adicionada pode atrapalhar nessa mensuração (por exemplo, se estamos usando os coeficientes angulares de um sklearn.linear_model.Lasso).

Toda essa variabilidade pode fazer com que, às vezes, uma feature ruim seja selecionada, ao passo que uma variável boa seja descartada por azar.

O Boruta vem para tentar lidar com essas duas questões ao mesmo tempo: tentar manter as distribuições marginais das features ruidosas iguais às distribuições marginais das features originais, enquanto tenta ser robusto à variabilidade, repetindo o experimento algumas vezes.

Ideias gerais do Boruta

Já existem muitos textos úteis que explicam o Boruta de forma didática e com exemplos. Como a ideia desse post não é ser redundante com a literatura e sim compilar ideias centrais de uso prático, vamos apenas citar os principais aspectos e deixar o convite para uma leitura detalhada de outras referências do tema como o post Boruta Explained Exactly How You Wished Someone Explained to You. A construção que fizemos anteriormente vai deixar as ideias do Boruta ainda mais claras, justificando o seu modo de ser.

Em resumo, o Boruta [2,4]:

Cria variáveis não correlacionadas com a target ao embaralhar, entre as linhas, variáveis já presentes no dataset (essas são as variáveis que chamamos de shadow).
Lida com a variabilidade repetindo o processo várias vezes e marcando quantas vezes a nossa variável de interesse ficou atrás do percentil perc dos .feature_importances_ das shadow features (por default perc=100, portanto, comparamos com o máximo dos .feature_importances_ das shadow features, isto é, se alguma shadow for melhor, já descartamos aquela variável de interesse naquela rodada).
Por fim, um teste de hipótese é feito para avaliar se podemos afirmar com alguma significância estatística alpha que a feature de interesse é melhor que o percentil perc da importância das shadow features.
O teste de hipótese divide o conjunto de features em três categorias:
- As variáveis que estatisticamente são variáveis melhores que as shadow features (são as chamadas de .support_);
- As variáveis que estatisticamente são equivalentes às variáveis shadow (variáveis que excluímos);
- As variáveis que não são possíveis de afirmar com significância estatística como sendo melhores que as variáveis shadow (.support_weak_).
Na prática, a partir do momento que ele tem confiança de que uma determinada variável não é importante, ele já a exclui das próximas iterações.

O boruta.BorutaPy

Primeiro, precisamos instanciar um base_estimator que será utilizado dentro do boruta.BorutaPy para calcular a importância das variáveis (através do .feature_importances_ ou do .coef_). É importante ressaltar que podemos adicionar hiper-parâmetros que acharmos relevantes para o problema, como o class_weight se temos um problema muito desbalanceado.

Quando usamos um comitê de árvores, é importante ter em mente que árvores profundas vão mudar o .feature_importances_, mas vão demorar mais para treinar. É justificável utilizar árvores mais rasas, uma vez que os ganhos mais expressivos são feitos nas primeiras quebras, usualmente.

O boruta.BorutaPy aceita qualquer estimador que tenha o atributo .feature_importances_ disponível após rodar o método .fit(X, y) [5]. Você pode utilizar isso a seu favor usando os estimadores mais adequados para o seu problema, inclusive, utilizando algoritmos baseados em árvores mais eficientes como as sklearn.ensemble.ExtraTreesClassifier (tenha em mente que as Extra Randomized Trees vão ter seu .feature_importances_ afetado pelo método de construção e isso pode impactar a escolha final de variáveis).

Para exemplificar a utilização prática da biblioteca, vou utilizar o SHAPImportanceRandomForestClassifier que criamos anteriormente (basicamente um sklearn.ensemble.RandomForestClassifier com SHAP no lugar do .feature_importances_ usual).

from boruta import BorutaPy

boruta_forest = SHAPImportanceRandomForestClassifier(max_depth=7, random_state=42)

Um ponto de atenção que não é necessariamente claro na documentação, é que o parâmetro n_estimators do boruta.BorutaPy sobrescreve o n_estimators do estimador como podemos ver no código fonte do BorutaPy:

# set n_estimators
if self.n_estimators != 'auto':
    self.estimator.set_params(n_estimators=self.n_estimators)

Por default, temos n_estimators=1000. Se n_estimators='auto', então uma regra baseada no número de features que estamos avaliando é feita para escolher o número de árvores do ensemble.

Por fim, alpha e perc são os outros parâmetros importantes do boruta.BorutaPy que você deveria ficar atento:

O perc (percentil do .feature_importances_ das shadow features utilizado para decidir se as variáveis foram boas ou não naquela determinada rodada) é um int que vai de 0 a 100. Quanto mais próximo de 100, mais rigoroso estamos sendo na hora de avaliar nossas features. Pela aleatoriedade, alguns .feature_importances_ de shadow features podem ser grandes e muito rigorosos com o critério de corte, nesse caso, isso será ruim porque estaremos excluindo variáveis marginais que são relevantes, mas não têm uma importância tão expressiva. O default desse parâmetro é 100, mas recomendo abaixá-lo levemente (para 90, por exemplo) caso esteja trabalhando com um problema com muitas variáveis, desse modo haverá maior chance de se ter uma shadow feature com importância alta.
O alpha é um float que vai de 0 a 1 e é importante para delimitar a partição que fazemos do conjunto de variáveis (.support_weak_, .support_ e excluídas), uma vez que determinará o rigor de certeza que queremos ter para afirmar que uma determinada feature é relevante ou não para o problema de classificação (ou regressão). O default desse parâmetro é 0.05, e eu não tenho o costume de alterá-lo, pois prefiro mantê-lo fixo e variar o perc já que os dois se relacionam.

boruta = \
(BorutaPy(
    estimator=boruta_forest,
    n_estimators=50,
    max_iter=100, # number of trials to perform
    random_state=42)
 .fit(np.array(X), np.array(y)) # fit accepts np.array, not pd.DataFrame
)

Por fim, é fácil resgatar as features com os atributos .support_ e .support_weak_.

green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()

print('Support columns:', green_area)
print('Weak support columns:', blue_area)

Support columns: ['column_1', 'column_2', 'column_3', 'column_4', 'column_10']
Weak support columns: ['column_9']

Trade-off de “qualidade da seleção” vs “tempo” quando damos um undersample

Quando temos um dataset muito grande, o boruta.BorutaPy pode demorar bastante tempo para rodar pelo processo de criar tantas variáveis shadows quanto temos no conjunto inicial de variáveis. Em muitas aplicações práticas é necessário aplicar o boruta.BorutaPy em um subconjunto do seu conjunto de treinamento.

Faremos aqui um experimento para ver, em um caso sintético de make_classification com n_samples=5000, n_features=100, n_informative=40 e n_redundant=10, como seriam as escolhas de variáveis de um boruta.BorutaPy conforme variamos o parâmetro frac de um .sample feito na base de desenvolvimento.

from boruta_sample_experiment import experiment, plot_heatmap, plot_percentage_time

dic_sample, matrix, X_big, y_big = \
experiment(fracs=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

100%|██████████| 11/11 [14:08<00:00, 77.18s/it]

Como o número de variáveis informativas mais o número de variáveis redundantes é 50 então, nesse exemplo controlado, metade das nossas features são importantes. No plot abaixo, para diferentes valores de frac (fração dos exemplos da base usada para treinar o boruta.BorutaPy) vemos quais variáveis estão sendo escolhidas. Idealmente, o boruta.BorutaPy deveria conseguir identificar que as primeiras 50 variáveis (eixo x) são as úteis e selecioná-las (pintando-as de verde), enquanto exclui as 50 demais (pintando-as de azul), haja vista que são ruído. Conforme variamos o frac (eixo y), vemos como ele se comporta.

plot_heatmap(dic_sample, matrix)

Na primeira figura abaixo, vemos uma sumarização do plot anterior variando o frac (eixo x), enquanto observamos a porcentagem das variáveis úteis (em verde) e inúteis (em laranja) que são escolhidas. No gráfico ao lado, há uma análise de tempo (de treinamento do boruta.BorutaPy) e performance do modelo treinado com as variáveis escolhidas naquele valor de frac.

plot_percentage_time(dic_sample, matrix, X_big, y_big)

Como podemos ver, não precisamos de todas as amostras para treinar o nosso boruta.BorutaPy. No exemplo anterior, apesar de a nossa amostra ter 5000 elementos, com algo em torno de 3000 exemplos, já era possível encontrar perfeitamente todas as 50 variáveis úteis para o nosso problema.

Na minha experiência utilizando o boruta.BorutaPy, me sinto confortável com _uma amostra com 15 vezes mais exemplos do que features (ou seja, n_samples>=15*n_features)_. Nesse limiar, já costumo ter resultados bons em termos de seleção de variáveis e é possível rodar o algoritmo (em tempo satisfatório para desenvolvimento) com um max_depth controlado. Colocando um exemplo numérico: se, no DataLab, estou trabalhando com um problema de 5 mil variáveis, me sinto confortável em rodar o boruta.BorutaPy em uma amostra de 75 mil linhas, mesmo tendo muito mais exemplos na base de desenvolvimento.

Por outro lado, o exemplo anterior nos mostra que nem sempre isso é o melhor, mesmo em questão de tempo. O boruta.BorutaPy, na prática, não vai rodar por max_iter se já tiver certeza (no nível de significância alpha) das variáveis que ele acha úteis para o problema, que ele já exclui (ou seleciona) no meio do caminho. No experimento anterior, ter mais exemplos, na verdade, fez com que o boruta.BorutaPy ficasse com mais certeza de forma mais rápida sobre as variáveis. Na prática, isso dificilmente acontece.

Usando o Boruta na prática e algumas alternativas

As ideias por trás do boruta.BorutaPy são muito interessantes, mas o algoritmo final é temporalmente custoso. Por sorte, podemos utilizar as ideias da construção para fazer variações espertas que podem ser alternativas se uma rodada inicial (com max_depth ~ 10, perc=90 e n_estimators=500) estiver demorando demais:

Utilizar o SelectKTop com alguma métrica de .feature_importances_ mais robusta (como o SHAP, usando algo como nosso SHAPImportanceRandomForestClassifier) e tendo cuidado com a escolha do K;
Adaptar o SelectKTop que construímos para um versão ainda mais robusta que lida com uma distribuição de .feature_importances_ ao invés de apenas um estimador (aliás, esse é um ótimo exercício para o leitor interessado em entender melhor a API do scikit-learn);
Adaptar o SelectKTop para um "SelectAboveNoise", que explicamos anteriormente, criando as variáveis aleatórias a partir do numpy.random (outro exercício muito bom);
Utilizar o boruta.BorutaPy com algoritmos mais rápidos (como sklearn.ensemble.ExtraTreesClassifier), mas lembrando que seu treinamento (ainda mais randomizado) vai afetar o .feature_importances_ e, consequentemente, o resultado final.
Reduzir a amostra utilizada para treino do boruta.BorutaPy respeitando a rule of thumb de n_samples>=15*n_features.
Mexer mais estruturalmente no algoritmo de forma que ele crie menos variáveis shadows em problemas com muitas variáveis (to be tested).

Se o seu problema é razoavelmente pequeno, usar o boruta.BorutaPy com o SHAP e otimizar os hiper-parâmetros do boruta.BorutaPy é uma boa opção. Para isso, será útil utilizar o Boruta que criei no arquivo boruta_selector.py no repositório deste post. Ele já está no formato adequado de Selector do scikit-learn e pode ser utilizado da mesma forma que vimos o SelectKTop sendo usado (com um pipeline e qualquer BaseSearchCV do scikit-learn).

Conclusão

Seleção de variáveis é um assunto necessário quando queremos garantir ter um modelo robusto. Neste post vimos uma das técnicas mais úteis para abordar esse problema enquanto, ao entender suas ideias, discutimos como adaptá-la para uma variedade de casos específicos. Mesmo que você não consiga usar o Boruta no seu problema em questão, as ideias aqui expostas permitem que você faça uma seleção de variáveis sabendo melhor as falhas e os benefícios de abordagens usuais do mercado.

Referências

[1] Foundations of Machine Learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. MIT Press, Second Edition, 2018.

[2] Feature Selection with the Boruta Package. Miron B. Kursa, Witold R. Rudnicki. Journal of Statistical Software.

[3] Decision and Classification Trees, Clearly Explained!!!. Josh Starmer. StatQuest with Josh Starmer.

[4] Boruta Explained Exactly How You Wished Someone Explained to You. Samuele Mazzanti. Towards Data Science.

[5] boruta_py README.md documentation. Daniel Homola.

Para mais dicas práticas de uso (e com um argumento de autoridade muito melhor que o meu), o autor do Boruta tem o guia Boruta for those in a hurry que, apesar de estar escrito em R, tem dicas práticas interessantes de alguém que conhece a implementação com muita profundidade.

Todos os arquivos e ambiente para reprodução dos experimentos podem ser encontrado no repositório deste post.

Este post foi originalmente publicado no Medium do Experian DataLab! Passe no post e deixe uma palminha, se achar que faz sentido! :D

Covariate Shift: Classificador Binário

2020-08-30T00:00:00+00:00

Este post faz parte de uma série de postagens que discutem o problema de covariate shift. Assumo que você já conhece a motivação do problema e no que estamos interessados em identificar e corrigir. Se você ainda não leu o primeiro post dessa série, sugiro a leitura.

Agora, vamos focar em identificar o covariate shift na distribuição conjunta. Desta forma, o problema fica enunciado como: dados $X$ e $Z$ vetores aleatórios e dois conjuntos de observações amostrados de forma independente $\{x_1, x_2, \cdots, x_N \} $ e $\{z_1, z_2, \cdots, z_M \} $, queremos entender se a distribuição conjunta é a mesma, isto é se $X\sim Z$, estudando apenas as amostras coletadas. No contexto do dataset shift, em que estamos particularmente interessados, o vetor aleatório $X$ indica a distribuição das covariáveis no conjunto de treino e o vetor aleatório $Z$ nos revela a distribuição das variáveis explicativas dos dados em produção.

Anteriormente, no segundo post da série, discutimos uma técnica para encontrar mudança nas distribuições marginais dos vetores aleatórios, o QQ-plot. Sugerimos ainda uma variação numérica da técnica visual.

Agora, vamos utilizar aprendizado de máquina supervisionado para identificar problemas em aprendizado de máquina supervisionado.

Entendendo o problema de classificação

O problema de classificação binária surge naturalmente nesse cenário. Se temos duas amostras de distribuições possivelmente diferentes, podemos treinar um modelo que tenta identificar se os dados são da distribuição $X$ ou da distribuição $Z$.

Se o classificador binário consegue identificar as diferenças, então temos uma variação da distribuição. Se o classificador não consegue, mantendo uma acurácia baixa, então confiamos que a distribuição se manteve parecida.

Vamos ilustrar essa técnica nos dados que geraram o desconforto inicial apresentado no final da [postagem anterior](https://vitaliset.github.io/covariate-shift-1-qqplot/). Aqui fica claro que nem sempre analisar apenas as distribuições marginais é suficiente.

Explicitamente temos os vetores aleatórios $X= (X_1,X_2)$ e $Z=(Z_1, Z_2)$ tais que

\[\begin{equation*} \begin{pmatrix}X_{1}\\ X_{2} \end{pmatrix} \sim \mathcal{N} \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix} , \begin{bmatrix} 1 & 0.75 \\ 0.75 & 1 \end{bmatrix} \end{pmatrix} \textrm{, e }\begin{pmatrix}Z_{1}\\ Z_{2} \end{pmatrix} \sim \mathcal{N} \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix} , \begin{bmatrix} 1 & -0.75 \\ -0.75 & 1 \end{bmatrix} \end{pmatrix} . \end{equation*}\]

def sample(n, t = 1):
    return np.random.multivariate_normal(mean = [0,0], cov = [[1,t*0.75], [t*0.75,1]], size = n).T

X1, X2 = sample(1000)
Z1, Z2 = sample(1000, -1)

Figura 1: amostras das distribuições $X$ ($s=0$) e $Z$ ($s=1$), com correlações opostas entre as coordenadas.

A ideia é simples, vamos organizar nossos dados criando uma nova coluna que nos diz se o dado é da distribuição $X$ ($s=0$) ou da distribuição $Z$ ($s=1$).

df = pd.DataFrame({'variavel_1':np.concatenate([X1,Z1]), 'variavel_2':np.concatenate([X2,Z2]), 's':[0]*X1.shape[0]+[1]*Z1.shape[0]})

X_miss = np.asarray(df.drop(['s'],axis=1))
S_miss = np.asarray(df['s'])

variável 1	variável 2	y	s
0.178105	0.651739	$y_1$	0
0.464192	-0.461877	$y_2$	0
1.0948	0.823703	$y_3$	0
…	…	…	…
0.393783	-0.681826	?	1
0.623834	-0.344885	?	1
-0.800357	0.444416	?	1

Aqui, já fazendo um panorama com a realidade em que estamos aplicando esse modelo, coloquei uma coluna para a variável target $y$ que seria a variável alvo do nosso problema inicial. Não vamos usá-la em nenhum momento na identificação do covariate shift, o que é esperado já que não temos os targets dos dados novos encontrados na produção.

Com essa estrutura construída, a ideia é simples. Criamos um classificador que utiliza as variáveis 1 e 2 para prever $s$. Se o seu resultado em um conjunto de teste é ruim, então os dados de $X$ e $Z$ são indistinguíveis e concluímos que $X\sim Z$. Agora, se o nosso classificador tem boas métricas, então quer dizer que as distribuições diferem.

Construindo e avaliando o classificador binário

Primeiro, separamos nossos dados em 2 conjuntos. Um para treino e outro para teste.

X_miss_train, X_miss_test, S_miss_train, S_miss_test = train_test_split(X_miss, S_miss, test_size = 0.8)

Agora podemos utilizar um classificador binário qualquer. Como estou começando a me apaixonar pelo Vapnik, vou utilizar uma Support Vector Machine. Os hiper-parâmetros "default" das SVM costumam fazer um bom trabalho, mas em um mundo ideal, podemos fazer uma pequena otimização dos hiper-parâmetros maximizando a métrica roc_auc_score.

param = {'C': np.geomspace(0.01,100,13), 'gamma': ['scale']+list(np.geomspace(0.1,100,10)), 'kernel': ['rbf']}
grid_search = GridSearchCV(SVC(probability=True), param, cv = 5, scoring= ['roc_auc','accuracy'], refit = 'roc_auc', return_train_score=True)
grid_search.fit(X_miss_train, S_miss_train)

Em seguida, utilizamos o modelo encontrado em todos os dados e podemos avaliar seu desempenho.

svm = SVC(probability=True, **grid_search.best_params_)
svm.fit(X_miss_train,S_miss_train)

print('acuracia: ',accuracy_score(S_miss_test,svm.predict(X_miss_test)))
print('roc_auc: ',roc_auc_score(S_miss_test,svm.predict(X_miss_test)))
print('phi coeficiente: ',matthews_corrcoef(S_miss_test,svm.predict(X_miss_test)))

acuracia:  0.72625
roc_auc:  0.7268930344332967
phi coeficiente:  0.4827618287310226

Não temos uma acurácia estado da arte, mas claramente nosso modelo identificou um padrão e consegue discriminar dados como sendo de uma distribuição ou de outra.

$\oint$ Uma métrica não tão clássica, mas muito útil é o coeficiente de correlação de Matthews. Inspirado no coeficiente de correlação de Pearson, queremos entender correlação para atributos categóricos. Isso deu origem ao coeficiente phi de Pearson, a ideia dele é generalizar o coeficiente de correlação entre a nossa previsão e os valores reais da target binária. É uma forma numérica de avaliar a matriz de confusão. Seu cálculo é feito como

\[\begin{equation*} \textrm{MCC} = \frac{T_p \, T_n - F_p \, F_n}{\sqrt{(T_p+F_p)(T_p+F_n)(T_n+F_p)(T_n+F_n)}}, \end{equation*}\]

em que $T_p$ é o número de verdadeiros positivos, $T_n$, a quantidade de verdadeiros negativos, $F_p$ o número de falsos positivos e $F_n$ o número de falsos negativos. Apesar de parecer um pouco confuso, analisando o numerador vemos que estamos multiplicando os valores corretamente classificados e subtraindo a multiplicação dos incorretamente classificados. O denominador serve como uma normalização deixando o resultado entre $-1$ e $1$, em que $1$ significa uma previsão perfeita, $0$ uma previsão aleatória e $-1$ uma previsão trocada.

No nosso caso ilustrativo em duas dimensões, podemos fazer as curvas de nível do predict_proba do SVM e visualizar que ele entendeu as regiões mais prováveis de cada uma das distribuições.

Figura 2: curvas de nível do predict_proba do SVM, mostrando as regiões mais prováveis de cada distribuição.

$\oint$ O SVM não nos dá naturalmente o predict_proba, precisamos passar probability=True na sua inicialização. O sklearn aplica a abordagem de Platt utilizando uma regressão logística no score do SVM. Essa técnica pode ser utilizada com classificadores quaisquer, para melhorar a calibração de probabilidade. Inclusive é uma técnica útil para ensembles de árvores.

Entendendo a mudança na distribuição a partir do classificador

Agora precisamos avaliar se as distribuições são diferentes ou não. Podemos analisar um histograma dos predict_proba aplicado nas duas amostras separadamente como vemos na Figura 3. Claramente, nosso SVM identifica regiões em que a chance de ser de uma das distribuições é maior do que ser de outra. O fato de ele nos dar tanta certeza é um indicativo de que ele consegue distinguir bem.

Figura 3: histograma do predict_proba do SVM aplicado separadamente às amostras de $X$ e de $Z$.

Supondo que confiamos na medida de probabilidade que ele nos dá. Uma métrica um pouco arbitrária é olhar qual a porcentagem dos dados está na região entre $[0, x) \cup (0.5+x,1]$ para $0\leq x\lt 0.5$. Por exemplo, podemos olhar a proporção de exemplos com predict_proba de $0$ a $25\%$ ou de $75\%$ a $100\%$. Estes são os dados que o classificador julga como "fáceis de classificar" por estarem em regiões dominadas por alguma das classes.

x = 0.25
((svm.predict_proba(X_miss)[:,0]<0.5-x) | (svm.predict_proba(X_miss)[:,0]>0.5+x)).sum()/X_miss.shape[0]

0.4605

Quase metade dos dados estão nas regiões "fáceis" de acordo com essa análise de probabilidade. Claro que isso não é perfeito pela existência de outliers, mas é um indicativo claro de que existem regiões do espaço de atributos favorecidas por uma das distribuições e regiões do espaço favorecidas pela outra distribuição. Fixado $x$, podemos escolher um valor $\varepsilon\in(0,1]$ tal que: se a proporção de dados nas regiões "fáceis" for maior que $\varepsilon$ então temos um alerta de que há uma mudança na distribuição.

Podemos tentar criar também thresholds de acurácia ou do coeficiente phi que indicam que há uma mudança na distribuição ou não. Isso não é necessariamente claro também e podemos monitorar com rigor demais ou ser muito brandos.

Como discutido no post anterior, esses thresholds universais não existem. O que vale é analisar nos seus dados históricos casos de covariate shift que você sabe que aconteceram e analisar se existiria um $\varepsilon$ que teria funcionado neles.

Caso sem mudança

Vale estudar como essa metodologia se comportaria em casos em que não há mudança na distribuição. Por exemplo, se ambas as distribuições fossem geradas pela mesma normal multivariada dada por

\[\begin{equation*} \begin{pmatrix}X_{1}\\ X_{2} \end{pmatrix}, \begin{pmatrix}Z_{1}\\ Z_{2} \end{pmatrix} \sim \mathcal{N} \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix} , \begin{bmatrix} 1 & 0.75 \\ 0.75 & 1 \end{bmatrix} \end{pmatrix}. \end{equation*}\]

X1, X2 = sample(1000)
Z1, Z2 = sample(1000)

Fazendo exatamente os mesmos procedimentos que anteriormente, temos agora curvas de nível muito mais confusas como vemos na Figura 4. O classificador tenta se adaptar um pouco às particularidades das amostras, mas não se atreve a dar probabilidades altas para nenhuma das regiões justamente porque nenhuma das regiões é privilegiada por uma das distribuições neste caso.

Figura 4: curvas de nível do predict_proba quando as duas amostras vêm da mesma distribuição.

Isso fica ainda mais claro quando olhamos para as métricas de classificação neste caso. Fica claro que as distribuições são indistinguíveis nesse caso, como esperado.

acuracia:  0.511875
roc_auc:  0.5115329746824565
phi coeficiente:  0.023459774068708163

A análise da distribuição dos predict_proba também conversa com o que esperávamos. Agora, o modelo é muito mais conservador, colocando as probabilidades próximas de $0.5$ como vemos na Figura 5.

Figura 5: histograma do predict_proba no caso sem mudança de distribuição, concentrado em torno de $0.5$.

Neste caso, os predict_proba estão concentrados entre $0.4$ e $0.6$, como esperado. O modelo é conservador e não encontra regiões fáceis de classificação.

x = 0.1
((svm.predict_proba(X_miss)[:,0]<0.5-x) | (svm.predict_proba(X_miss)[:,0]>0.5+x)).sum()/X_miss.shape[0]

0.0

Pontos de atenção e considerações finais

Assim como a maioria das técnicas de monitoramento, não é necessariamente claro identificar se há ou não o covariate shift categoricamente. A criação de thresholds para alertas é nebulosa. A ideia é sempre utilizar várias formas de avaliar, gerando relatórios que precisam ser olhados de forma crítica.

Em muitos casos, toda essa análise com otimização de hiper-parâmetros e utilizando modelos custosos como o SVM pode ser inviável. Não precisamos ter um classificador binário estado da arte, ele só precisa ser bom o suficiente para conseguir aprender a identificar as regiões de cada uma das amostras (se existir) dando probabilidades adequadas. Logo, fique à vontade para escolher o classificador que você mais gostar, com o cuidado na hora das análises do predict_proba. Como comentei anteriormente, os parâmetros default das SVM costumam ser razoáveis e você pode sempre pegar algumas sub-amostras dos dados para fazer essas análises.

É razoável se preocupar também com o balanceamento entre o tamanho dos dados de treino ($s=0$) e dados de produção ($s=1$) para ser razoável analisar acurácia e métricas simples. Novamente, lembrando que esse classificador não precisa ser perfeito, um undersample da classe dominante me parece suficiente.

Essa técnica incorporada em linhas de produção robustas pode ser uma forma inteligente de identificação de variação das covariáveis de treino e produção. No próximo post utilizaremos o princípio da minimização do erro empírico de Vapnik para discutir porque o covariate shift se torna um problema. Essa narrativa nos indicará uma maneira elegante de amenizar os problemas causados pelo covariate shift quando o retreino com dados mais parecidos com os da produção não é possível.

Covariate Shift: QQ-plot

2020-08-16T00:00:00+00:00

Este post faz parte de uma série de postagens que discutem o problema de Covariate Shift. Assumo que você já conhece a motivação do problema e no que estamos interessados em identificar e corrigir. Se você ainda não leu o primeiro post dessa série, sugiro a leitura.

Relembrando a reformulação do enunciado do problema, temos $X$ e $Z$ variáveis (ou vetores) aleatórias e dois conjuntos de observações amostrados de forma independente $\{x_1, x_2, \cdots, x_N \} $ e $\{z_1, z_2, \cdots, z_M \} $. Queremos entender se a distribuição das variáveis é a mesma, isto é se $X\sim Z$, estudando apenas as amostras coletadas. No contexto do dataset shift, em que estamos particularmente interessados, o vetor aleatório $X$ indica a distribuição das covariáveis no conjunto de treino e o vetor aleatório $Z$ nos revela a distribuição das variáveis explicativas dos dados em produção.

A primeira técnica que vamos discutir é utilizando o QQ-plot (quantil-quantil-plot). Avaliando se os $\alpha$-quantis das duas amostras são parecidos, podemos discutir a validade de assumir $X\sim Z$ ou não.

$\alpha$-quantis de uma variável aleatória

Existem algumas formas diferentes de se calcular $\alpha$-quantis. Elas são mais ou menos equivalentes para as análises que estamos interessados, então não vamos detalhar pequenas variações. Começaremos discutindo um $\alpha$-quantil muito clássico que você já conhece: a mediana.

A mediana de um conjunto de dados é o valor real que divide nossos dados em dois subconjuntos de tamanhos iguais: o conjunto dos maiores que a mediana e o conjunto dos menores ou iguais à mediana. Por exemplo, se temos o conjunto $S =\{ 1, 2, 4, 6, 6, 9\}$, então a mediana pode ser $4$ já que ficamos com $|\{x \in S : x\leq 4 \}|$ $ = 3 =$ $ |\{x \in S : x\gt 4 \}|$.

O conceito de mediana pode ser estendido para variáveis aleatórias. Nesse caso, estamos interessados em procurar um valor real $p$ tal que a probabilidade da variável aleatória ser menor ou igual a $p$ seja 0.5. Isso significa que o valor $p$ divide a reta em duas regiões $\{ x\in\mathbb{R}:x\leq p \}$ e $\{ x\in\mathbb{R}:x\gt p \}$ com a mesma probabilidade, ou seja, $\mathbb{P}(X\leq p)$ $=0.5=$ $\mathbb{P}(X\gt p)$.

Dado $\alpha\in(0,1)$, a ideia de um $\alpha$-quantil de uma variável aleatória $X$ é uma generalização do que fizemos com a mediana. Queremos dividir a reta em duas regiões, uma com probabilidade $\alpha$ e a segunda com uma probabilidade $1-\alpha$. Na mediana, tínhamos $\alpha=0.5$, aqui é feito de forma análoga, mas mais geral. A ideia é que tenhamos que $q_X(\alpha)$, o $\alpha$-quantil de $X$, satisfaça a equação

\[\mathbb{P}\left( X\leq q_X(\alpha) \right) = \alpha.\]

Lembrando que $F_X(t) = \mathbb{P}(X\leq t)$ é a função de distribuição acumulada de uma variável aleatória $X$. A $q_X:(0,1)\to\mathbb{R}$, chamada função quantil, seria a inversa de $F_X$. Ou seja, $F_X(q_X(\alpha))=\alpha$. A mediana de uma variável aleatória $X$ é formalmente definida como $q_X(0.5)$.

Entretanto, podemos exibir variáveis aleatórias problemáticas tal que a equação não tem solução para alguns valores de $\alpha\in(0,1)$. Por exemplo, pegando $X\sim\textrm{Ber}(0.4)$, então não existe $p\in\mathbb{R}$ tal que $F_X(p ) = 0.5$ uma vez que

\[F_X(t) = \begin{cases} 0\textrm{, se }t\lt0, \\ 0.6\textrm{, se }0\leq t\lt 1,\\ 1\textrm{, se }t\geq1.\end{cases}\]

Dessa forma não conseguimos definir $q_X(0.5)$, a mediana da variável Bernoulli de parâmetro $0.4$ utilizando essa forma para função quantil.

Note também que no primeiro exemplo, para a mediana do conjunto $S$, a mediana não está unicamente determinada. Poderíamos ter pego a mediana como sendo $5$, já que este valor também dividiria nossos dados em conjuntos do mesmo tamanho.

Como queremos uma função bem definida, uma solução para esses problemas é fazer a função quantil tal que

\[\begin{equation*} q_X(\alpha) = \min \{t \in \mathbb{R} : \mathbb{P}(X\leq t) = F_X(t) \geq \alpha \}. \end{equation*}\]

Neste caso, o valor $q_X(\alpha)$ é o menor valor real tal que a probabilidade acumulada é pelo menos $\alpha$. No caso discutido para $X\sim\textrm{Ber}(0.4)$, agora temos que $q_X(0.5) = 0$ já que 0 é o menor valor real que faz $F_X$ ser maior ou igual a $0.5$. E a mediana do conjunto $S$ fica unicamente definida uma vez que $4$ é o menor valor que satisfaz a divisão em dois conjuntos iguais.

Para variáveis aleatórias $X$ tais que $F_X$ são contínuas, essa forma de definir $q_X(\alpha)$ equivale à primeira tentativa de definição. Esses são os exemplos em que estaremos mais interessados quando analisarmos o QQ-plot.

$\oint $ A generalização da inversa que fizemos é particularmente útil quando temos funções monotônicas, mas descontínuas e não necessariamente injetoras como é o caso das funções de distribuição acumulada de variáveis aleatórias discretas. A única alteração que temos que fazer em casos mais gerais é usar $\inf$ ao invés de $\min$ (pelas propriedades da função distribuição acumulada, como temos a continuidade pela direita, essas duas formas são equivalentes).

Cálculo da função quantil de uma variável aleatória contínua

Quando $X$ é uma variável aleatória contínua com distribuição de probabilidade $f_X$, temos uma forma explícita de cálculo para $F_X$ como

\[\begin{equation*} F_X(t) = \int_{-\infty}^t f_X(s) \, ds. \end{equation*}\]

Dada uma variável aleatória com distribuição exponencial $X\sim \textrm{Exp}(\lambda)$, vamos exibir diretamente $q_X$. Para calcular $F_X$, utilizamos a densidade de probabilidade $f_X$ da forma

\[\begin{equation*} f_X(s) = \begin{cases} \lambda e^{-\lambda s}\textrm{, se } s\geq 0\textrm{,}\\ 0 \textrm{, caso contrário.} \end{cases} \end{equation*}\]

Podemos calcular $F_X$ como

\[\begin{equation*} F_X(t) = \int_{-\infty}^{t} f_X(s) ds = \int_0^t \lambda e^{-\lambda s} ds = -\,e^{-\lambda s}\, \bigg\rvert_{0}^{t} = 1 - e^{-\lambda t}, \end{equation*}\]

para $t\geq 0$ e $F_X(t)=0$ para $t<0$.

Podemos achar uma forma explícita para $q_X(\alpha)$ neste caso. Basta resolver a equação:

\[\begin{equation*} \alpha = F_X(q_X(\alpha)) = 1 - e^{-\lambda q_X(\alpha)} \therefore 1- \alpha = e^{-\lambda q_X(\alpha)}, \end{equation*}\]

concluindo que

\[\begin{equation*} q_X(\alpha) = \frac{-\ln(1-\alpha)}{\lambda}. \end{equation*}\]

def dens_exp(s, lamb):
    return np.piecewise(s, [s < 0, s >= 0], [lambda s: 0, lambda s: np.exp(-lamb*s)/lamb])

def quantil_exp(t,lamb):
    return -np.log(1-t)/lamb

Por exemplo, se queremos calcular a mediana de $X\sim\textrm{Exp}(\lambda =1)$, fazemos simplesmente $q_X(0.5)=-\ln(0.5)\approx0.693$. Interpretando esse resultado, temos que $\mathbb{P}\left( X\leq -\ln(0.5) \right)=0.5$, logo pintando a área embaixo da curva, como na Figura 1, temos metade da área da densidade de probabilidade até $-\ln(0.5)$.

Figura 1: Densidade de probabilidade da variável aleatória exponencial com $\lambda=1$. A sombra representa a área embaixo da curva de 0 até $-\ln(0.5)$, representando metade da probabilidade.

Cálculo da função quantil de uma variável aleatória discreta

Agora suponha que $X\sim \textrm{Binomial}(2,0.5)$. Então $\mathbb{P}(X=0)=\mathbb{P}(X=2)= 0.25$ e $\mathbb{P}(X=1)=0.5$. Construímos a densidade acumulada como

\[F_X(t) = \begin{cases} 0\textrm{, se }t\lt0, \\ 0.25\textrm{, se }0\leq t \lt 1,\\ 0.75\textrm{, se }1\leq t \lt 2,\\ 1\textrm{, se }t\geq2.\end{cases}\]

Para calcular a função quantil, precisamos usar a versão que diz que

\[q_X(\alpha) = \min \{t \in \mathbb{R} : F_X(t) \geq \alpha \}.\]

Com isso, temos por exemplo que $q_X(0.9)=2$ uma vez que o menor valor de $F_X(t)$ maior ou igual a $0.9$ é $1$ e ocorre primeiro quando $t=2$. Fazendo esse mesmo tipo de raciocínio para todos os $\alpha \in (0,1)$, chegamos na função quantil como

\[q_X(\alpha) = \begin{cases} 0\textrm{, se }0\lt \alpha \leq 0.25, \\ 1\textrm{, se }0\lt \alpha \leq 0.75, \\ 2\textrm{, se }0.75\leq \alpha \lt 1.\end{cases}\]

QQ-plot

A ideia do QQ-plot (ou gráfico quantil-quantil) se baseia em uma observação inteligente: se duas variáveis aleatórias $X$ e $Y$ tem distribuições parecidas (isto é, se $F_X \approx F_Y$), então seus $\alpha$-quantis são semelhantes também (ou seja, as funções quantis são próximas $q_X \approx q_Y$).

Portanto, se $X$ e $Y$ têm distribuições parecidas, quando plotarmos a "curva parametrizada"

\[\begin{equation*} \{ (q_X(\alpha), q_Y(\alpha) ) \in \mathbb{R}^2 : \alpha \in (0,1) \}, \end{equation*}\]

esperamos que a curva fique próxima da reta identidade $y=x$ . O nome QQ-plot surge pois estamos plotando os quantis das nossas variáveis aleatórias nos dois eixos.

Para visualizar esse plot, vamos ver um exemplo analítico. Sejam $X \sim \textrm{Exp}(\lambda=1)$ e $Y \sim \textrm{Uniforme}([0,1])$. Já calculamos de forma transparente $q_X(\alpha)=-\ln(1-\alpha)$ e é fácil conferir que $q_Y(\alpha) = \alpha$.

def dens_uni(s):
    return np.piecewise(s, [s < 0, (s >= 0) & (s <= 1), s > 1], [0, 1, 0]) 
    
def quantil_uni(t):
    return t

Como podemos ver na primeira imagem da Figura 2, essas distribuições são próximas no início (perto da origem) e depois ficam qualitativamente bem diferentes. Plotando a curva dada por

\[\begin{equation*} \{ (-\ln(1-\alpha), \alpha ) \in \mathbb{R}^2 : \alpha \in (0,1) \}, \end{equation*}\]

temos o QQ-plot na segunda imagem da Figura 2.

Figura 2: à esquerda, as densidades de $X\sim\textrm{Exp}(1)$ e $Y\sim\textrm{Uniforme}([0,1])$; à direita, o QQ-plot analítico correspondente.

$\alpha$-quantis para amostras

Quando não conhecemos $F_X$, não temos como calcular $q_X(\alpha)$ analiticamente. Mas se temos disponível uma amostra $\left\{x_1,\ldots,x_N \right\}$ independentes e identicamente distribuídas de $X$ de tamanho $N$ podemos estimar os $\alpha$-quantis.

Primeiro, devemos ordenar a amostra $\left\{x_1,\ldots,x_N \right\}$ de forma crescente renomeando os índices dos exemplos como $\left\{ x_{(1)},\ldots,x_{(N)} \right\}$.
Com isso, dado $\alpha \in (0,1)$, a estimativa para o $\alpha$-quantil da variável aleatória que gerou a amostra é

$\begin{equation*} \widehat{q}_{X}(\alpha) = x_{( \lfloor N\alpha \rfloor +1)}, \end{equation*}$ em que $\lfloor N\alpha \rfloor$ é o menor inteiro menor ou igual a $N\alpha$.

A ideia dessa forma de estimar o $\alpha$-quantil é que uma fração $\alpha$ da nossa amostra será identificada como os elementos menores ou iguais a $\widehat{q}_X(\alpha)$. Na Figura 3 podemos observar alguns $\alpha$-quantis de uma amostra de dados para $N=40$. Plotando eles na horizontal, ordenados, identificamos o $0.25$-quantil como o décimo elemento da nossa lista, marcado em verde uma vez que $25$ por cento dos nossos dados são menores ou iguais a ele.

Figura 3: Uma coleção de dados colocado em ordem crescente e alguns $\alpha$-quantis ilustrativos.

Quando $N\to \infty$ temos que $\widehat{q}_{X}(\alpha) \to q_{X}(\alpha)$ em probabilidade, pelo menos para variáveis aleatórias contínuas. Isso nos permite acreditar que, para $N$ grande, o $\alpha$-quantil estimado é próximo do $\alpha$-quantil real, vamos usar esse fato para comparar nossas amostras.

QQ-plot para duas amostras

A idéia do QQ-plot é justamente utilizar essa ideia para afirmar que se a amostra $\left\{x_1,\ldots,x_N \right\}$ e a amostra $\left\{y_1,\ldots,y_M \right\}$ vieram de distribuições $X$ e $Y$, respectivamente, parecidas, então também serão parecidas as funções quantis estimadas

\[\begin{equation*} \widehat{q}_{X}(\alpha) \approx \widehat{q}_{Y}(\alpha). \end{equation*}\]

Neste caso, se parametrizamos uma curva pelo valor $\alpha$ e plotamos no eixo $x$ o valor $\widehat{q}_{X}$ e no eixo $y$ o valor $\widehat{q}_{Y}$, necessariamente devemos ter pontos próximos da reta identidade $y=x$.

Variando o parâmetro da curva com passos iguais, plotamos os pontos

\[\begin{equation*} \left\{ (\widehat{q}_X(\alpha_i), \widehat{q}_Y(\alpha_i) ) \in \mathbb{R}^2 : \alpha_i = \frac{i}{k} \textrm{, para }i\in\{1,2,\cdots,k-1\} \right\}, \end{equation*}\]

para natural $k \gt 2$. Estamos andando na curva anterior dando passos de tamanho $1/k$ no parâmetro $\alpha$. Por exemplo, para $k=10$, então plotamos os $9$ pontos referentes aos $\alpha_i$-quantis para $\alpha_i$$=0.1$, $0.2$, $\cdots$, $0.8$, $0.9$. Se temos $k=20$, então pegamos os $19$ pontos identificados por $\alpha_i$$=0.05$, $0.1$, $\cdots$, $0.9$, $0.95$.

Na Figura 4 temos vários QQ-plots para diferentes escolhas de variáveis aleatórias $X$ e $Y$, tamanhos das amostras $N$ e $M$, e números de pontos do plot $k-1$.

Na primeira imagem da Figura 4, temos que $X,Y\sim\mathcal{N}(0.5,1)$ com $N,M=200$ e $k=10$. Vemos que os pontos se aproximam da identidade, mas há uma pequena variação porque como a amostra é pequena as estimativas para os $\alpha$-quantis variam bastante.
Na segunda imagem, temos as mesmas distribuições, mas agora como $N,M=10000$ e $k=25$. Os $\alpha$-quantis estimados são mais precisos e por isso os pontos estão bem em cima da reta identidade.
Na terceira imagem, temos $X\sim\textrm{Uniforme}([0,1])$ e $Y\sim\mathcal{N}(0,1)$ com $N=2000$, $M=1000$ e $k=25$. Este é um caso em que a média das duas distribuições geradoras é igual (por isso os pontos do meio ficam próximos à identidade), mas conseguimos identificar a diferença das distribuições.
No caso da quarta imagem, temos $X\sim\mathcal{N}(0,1)$ e $Y\sim\mathcal{N}(1,1)$ com $N,M=3000$ e $k=20$. Como a distribuição é igual a menos da média, podemos perceber que os pontos ficam na reta $y=x+1$ ao invés da identidade.
A quinta imagem é a versão amostral do QQ-plot que fizemos analiticamente anteriormente na Figura 2, quando temos $X\sim\textrm{Exp}(1)$ e $Y\sim\textrm{Uniforme}([0,1])$. Estamos fazendo $N,M=2000$ e $k=100$.
Por fim, na última imagem temos um exemplo para comparação da distribuição binomial com a distribuição normal. Fazemos $X\sim\textrm{Binomial}(400,0.5)$ e $Y\sim\mathcal{N}(200,100)$, com $N,M=4000$ e $k=20$.

$\oint$ Para cada $t\in\mathbb{N^*}$, definindo $Z_t\sim\textrm{Binomial}(t,0.5)$, então temos que
\[\frac{Z_t- 0.5\, t}{0.5\, \sqrt{t}}\overset{\mathscr{D}}{\to} \mathcal{N}(0,1)\]
utilizando o teorema do limite central observando que $Z_t\sim\sum_{i=1}^t B_i$ em que $B_i \sim \textrm{Bernoulli}(0.5)$ são independentes.

Figura 4: QQ-plots para diferentes escolhas de distribuições $X$ e $Y$, tamanhos de amostra $N$ e $M$, e número de pontos $k-1$, conforme descrito no texto.

O QQ-plot foi construído originalmente para ser uma forma visual de identificar se duas amostras analisadas são de distribuições próximas ou não. A princípio, essa maneira de análise não nos dá uma métrica numérica que podemos estudar.

Sugestão de métrica quantitativa

Para obter um um valor numérico para que possamos avaliar se nossas distribuições estão próximas, devemos relembrar qual foi a motivação do QQ-plot: estamos comparando os pontos com a reta identidade. Isso nos leva a pensar em usar uma métrica de regressão do quão boa a reta identidade $f(x)=x$ se adapta aos nossos dados

\[\begin{equation*} \left\{ (\widehat{q}_X(\alpha_i), \widehat{q}_Y(\alpha_i) ) \in \mathbb{R}^2 : \alpha_i = \frac{i}{k} \textrm{, para }i\in\{1,2,\cdots,k-1\} \right\}. \end{equation*}\]

Utilizando o $\textrm{MSE}$ ou o $\textrm{MAE}$, por exemplo, ficamos com as expressões:

\[\textrm{MSE} = \frac{1}{k-1} \sum_{i=1}^{k-1} (f(\widehat{q}_X(\alpha_i)) - \widehat{q}_Y(\alpha_i))^2 = \frac{1}{k-1} \sum_{i=1}^{k-1} (\widehat{q}_X(\alpha_i) - \widehat{q}_Y(\alpha_i))^2,\] \[\textrm{MAE} = \frac{1}{k-1}\sum_{i=1}^{k-1} \left|\widehat{q}_X(\alpha_i) - \widehat{q}_Y(\alpha_i)\right|.\]

$\oint$ Gosto da ideia de usar métricas como $\textrm{MSE}$ e $\textrm{MAE}$ pela simetria. Não importaria se trocássemos as amostras $X$ e $Y$ de lugar.

Na Figura 5 temos alguns exemplos de QQ-plots e suas respectivas métricas. Estamos usando sempre $N,M=3000$. Na primeira imagem temos $X, Y\sim\mathcal{N}(0,1)$, para $k=10$. Na segunda temos $X\sim\textrm{Uniforme}([0,1])$ e $Y\sim\textrm{Uniforme}([-1,2])$, para $k=25$. Na terceira imagem temos $X\sim\textrm{Uniforme}([0,1])$ e $Y\sim\mathcal{N}(0.5,1)$, com $k=30$. Por fim, temos $X,Y\sim\mathcal{N}(300,400)$, escolhendo $k=20$.

Figura 5: QQ-plots e suas respectivas métricas ($\textrm{MSE}$ e $\textrm{MAE}$) para os exemplos descritos no texto.

Como podemos ver, essa forma de cálculo das métricas não soluciona o problema. Dependendo da escala dos nossos dados podemos ter a métrica inflada, mesmo com as amostras vindo da mesma distribuição. Isso ocorre no último QQ-plot da Figura 5.

Uma sugestão pra manter os dados não muito maiores que $1$ em módulo é aplicar um StandardScaler nos nossos dados. Calculamos a média e variância amostral da amostra $\{x_1,x_2,\cdots,x_n\}$ e transformamos nossos dados de forma que agora

\[\begin{equation*} \left\{ x_i^* = \frac{x_i - \widehat{\mu_X}}{S_X}\right\} \textrm{, e também } \left\{ y_i^* = \frac{y_i - \widehat{\mu_X}}{S_X}\right\}. \end{equation*}\]

É importante notar que não estamos modificando o formato do QQ-plot, apenas deformando e transladando os eixos já que aplicamos o mesmo scaler nos dois eixos. A ideia é que se $X\sim Y$, então o scaler fitado na amostra de $X$ deveria deixar as duas amostras com média $0$ e variância $1$.

Na Figura 6 temos o QQ-plot utilizando essa metodologia e suas respectivas métricas. Agora, fixamos que $N,M=3000$ e $k=20$. Na primeira imagem temos $X\sim\textrm{Exp}(1)$ enquanto $Y\sim \mathcal{N}(0,1)$. Na segunda temos $X\sim \mathcal{N}(10,9)$ e $Y\sim\mathcal{N}(5,1)$. Na terceira imagem temos $X\sim \mathcal{N}(11,1)$ e $Y\sim \mathcal{N}(10,1)$. Por fim, na última temos $X,Y\sim \mathcal{N}(300,400)$.

Figura 6: QQ-plots com os dados padronizados (StandardScaler) e suas respectivas métricas, para os exemplos descritos no texto.

Com isso, temos maior esperança de ter métricas com valores baixos para amostras de uma mesma distribuição, independentes da escala, como é o caso da última imagem da Figura 6.

$\oint$ Um pequeno detalhe é que agora nem sempre temos a métrica simétrica, pois a média e variância da amostra de $Y$ possivelmente é diferente da de $X$.

Fixados $N$, $M$ e $k$, o ideal seria definir um $\varepsilon\in \mathbb{R}^+$ universal para criar um critério do tipo: se $ \textrm{MSE}$ (ou $\textrm{MAE}$) $< \varepsilon$, então desconfiamos que $X\sim Y$ e caso contrário, acreditamos que $X\nsim Y$. Entretanto essa tarefa parece impossível e o valor de $\varepsilon$ depende da natureza dos nossos dados e do quanto somos tolerantes com o problema de covariate shift.

Para avaliar se essa forma de monitoramento é útil, vale aplicar em alguns dados reais da área que você está analisando. Entender como se comportam as métricas sugeridas ($\textrm{MAE}$ e $\textrm{MSE}$) nos casos em que não há dataset shift e nos casos em que há.

Se você não tem muitas versões de tempos diferentes, ou se você não sabe se há ou não covariate shift, vale a pena dividir seus dados de uma mesma base em dois conjuntos disjuntos. Entender como fica a métrica aplicada a essas duas amostras e depois mudar artificialmente a distribuição da segunda somando e multiplicando ruídos aos dados.

Problemas e considerações finais

O QQ-plot é uma estratégia visual muito útil de verificação de covariate shift. É uma maneira interessante e eficiente de gerar relatórios de acompanhamento de qualidade de bases. Fácil de explicar e de implementar, não sendo muito custoso computacionalmente por apenas precisar ordenar os dados nos cálculos do $\alpha$-quantis. Apesar de suas qualidades, temos alguns problemas importantes.

O QQ-plot funciona bem para variáveis aleatórias contínuas. Porém, no geral, para variáveis aleatórias discretas temos funções quantis patológicas, com descontinuidades e as funções quantis estimadas não são muito confiáveis.

$\oint$ Imagine o cenário em que $X,Y\sim\textrm{Ber}(0.5)$, então podemos calcular $q_X(0.5)=0$. Mas agora, nas nossas amostras, temos uma com um valor de $0$ a mais e a outra um valor de $1$ a mais. Nesse cenário, as medianas estimadas seriam $0$ e $1$, respectivamente e ganharíamos um ponto completamente distante da nossa reta identidade. Esse problema independe do tamanhos das amostras e pode ocorrer inflando nossa métrica. A falta de continuidade gera esses problemas.

Além disso, com as variáveis aleatórias contínuas, o QQ-plot peca em não nos dar uma métrica numérica para avaliar em monitoramentos automatizados. A escolha de $\varepsilon$ é arbitrária demais e em muitos casos podemos gerar alertas desnecessários sendo muito rigorosos ou deixar passar casos problemáticos se formos muito tolerantes.

Por fim, esse tipo de métrica avalia nossas variáveis aleatórias de forma independente. Em muitos casos, o covariate shift pode ocorrer na distribuição conjunta do vetor aleatório e não perceberemos isso olhando para as distribuições marginais. Um exemplo desse problema pode ser visto na Figura 7.

Figura 7: um exemplo em que o covariate shift ocorre na distribuição conjunta, sem ser percebido nas distribuições marginais.

Nos próximos posts dessa série, vamos ver uma outra técnica que pode ajudar nesses casos. No geral, as técnicas de monitoramento de covariate shift tem seus pontos fortes e fracos. O ideal é sempre ter várias formas diferentes para identificar possíveis problemas e fazer intervenções.

Covariate Shift: Introduction

2020-08-02T00:00:00+00:00

Este texto foi inicialmente redigido em português e posteriormente traduzido. A versão original em português pode ser encontrada no repositório de experimentos.

The primary goal of supervised learning is to identify patterns between independent variables (explanatory variables) and a dependent variable (target variable). In mathematical terms, within a regression context, we have a random vector $V = (X_1, X_2, \cdots, X_n, Y)$ and we suppose that there exists a relationship between the independent variables $X_i$ and the dependent variable $Y$, expressed as:

\[\left(Y \,|\, X_1=x_1, X_2=x_2,\cdots, X_n=x_n\right)\sim f(x_1, x_2,\cdots, x_n) + \varepsilon,\]

where $f:\mathbb{R}^n\to \mathbb{R}$ is any given function and $\varepsilon$ is a random variable with mean $0$, referred to as noise (which might also vary depending on the values of $X_i$). The supervised learning approach attempts to estimate the function $f$ using prior observations (a sample of the random vector $V$).

$\oint$ Note that our illustration uses regression as an example due to its straightforwardness. Nonetheless, the case of classification isn't significantly more complex. In binary classification, the aim is to estimate $f:\mathbb{R}^n\to [0,1]$ as follows:

\[\left(Y \,|\, X_1=x_1, X_2=x_2,\cdots, X_n=x_n\right)\sim \textrm{Bernoulli}(p)\textrm{, with }p=f(x_1, x_2,\cdots, x_n).\]

Generally, during cross-validation, we expect that the performance of our estimated function will remain consistent on the validation set when faced with new data. Machine learning in non-stationary environments, however, presents a challenge: What happens if there's a dataset shift, meaning the distribution of the random vector $V$ differs in new data? Can we realistically expect the model to uphold its validated performance?

In this context, we encounter two common scenarios [1]. The first, concept shift, takes place when the function $f$ connecting the variables $X_i$ and $Y$ changes. A seemingly less noticeable but equally alarming issue arises when the relationship between the explanatory and target variables remains constant, but the distribution of variables $X_i$ in new examples deviates from the distribution in the training data. This is known as covariate shift, a situation that we'll learn to identify and offer a potential solution for in this series of posts.

But first, let's create an artificial scenario that exhibits covariate shift. This will help illuminate the concepts through a practical situation and explore the problems that may emerge if this shift isn't properly identified and addressed.

Example of dataset shift between training data and production data

Consider $X$ to be a random variable that follows a normal distribution, $X\sim \mathcal{N}(0,1)$. Let $f:\mathbb{R}\to\mathbb{R}$ be a function defined as $f(x) = \cos(2\pi x)$, and $\varepsilon$ be a noise variable modeled as $\varepsilon \sim \mathcal{N}(0,0.25)$. We will build a dataset generated by this random experiment.

def f(X):
    return np.cos(2 * np.pi * X)

def f_ruido(X, random_state):
    return f(X) + np.random.RandomState(random_state).normal(0, 0.5, size=X.shape[0])

def sample(n, mean=0, random_state=None):
    rs = np.random.RandomState(random_state).randint(
        0, 2**32 - 1, dtype=np.int64, size=2
    )
    X = np.random.RandomState(rs[0]).normal(mean, 1, size=n)
    Y = f_ruido(X, random_state=rs[1])
    return X.reshape(-1, 1), Y.reshape(-1, 1)

In this example, we will conduct this experiment $100$ times, creating our data with the mean of $X$ at $0$ as previously mentioned.

Despite the noise being of the same order of magnitude as $f$, the pattern of the function that drives the generation of the data can still be discerned. Our goal is to make predictions: given new observations of $X=x$, we aim to estimate the corresponding values for $(Y \, | \, X=x)$.

X_past, Y_past = sample(100, random_state=42)

x_plot = np.linspace(np.min(X_past), np.max(X_past), 1000).reshape(-1, 1)

fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(X_past, Y_past, alpha=0.5, label="Sample")
ax.plot(x_plot, f(x_plot), c="k", label="f(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
plt.tight_layout()

We will employ a simple model for regression, namely the sklearn.tree.DecisionTreeRegressor. By using sklearn.model_selection.GridSearchCV, we can determine the optimal value for the minimum number of samples per leaf (a regularization parameter, intended to prevent overfitting). Based on cross-validation, we can estimate the potential value of sklearn.metrics.r2_score we might achieve if we applied the decision tree to unseen data.

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

dtr = DecisionTreeRegressor(random_state=42)
param = {"min_samples_leaf": np.arange(1, 10, 1)}
grid_search = GridSearchCV(
    dtr, param, cv=5, scoring="r2", return_train_score=True
).fit(X_past, Y_past)

df_cv = (
    pd.DataFrame(grid_search.cv_results_)
    .sort_values("rank_test_score")
    .filter(["param_min_samples_leaf", "mean_test_score", "std_test_score"])
)
df_cv.head(3)

	param_min_samples_leaf	mean_test_score	std_test_score
2	3	0.554561	0.094576
6	7	0.502175	0.100091
3	4	0.490702	0.131177

We attain a reasonable $R^2$ value, indicating that the model successfully captures the patterns in the data, despite its simplicity and the small size of the dataset.

fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(X_past, Y_past, alpha=0.6, label="Sample")
ax.plot(x_plot, f(x_plot), c="k", alpha=0.5, label="f(x)")
ax.plot(
    x_plot,
    grid_search.best_estimator_.predict(x_plot),
    c="r",
    label="Decision tree estimator",
)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
plt.tight_layout()

Visually, the model performs well around $x=0$, where there's a high density of $x$ values. As expected, the model's performance deteriorates at the fringes where fewer training examples are present.

Let's now imagine a scenario where circumstances have changed: the relationship between $X$ and $Y$ remains intact, but for some reason, the distribution of the variable $X$ is no longer $X\sim \mathcal{N}(0,1)$. Instead, it's given by $X\sim \mathcal{N}(2,1)$. In other words, there's a shift in the distribution.

X_new, Y_new = sample(100, mean=2, random_state=13)

min_X = np.min(np.vstack([X_past, X_new]))
max_X = np.max(np.vstack([X_past, X_new]))

fig, ax = plt.subplots(figsize=(5, 3))
ax.hist(
    X_past,
    alpha=0.6,
    bins=np.linspace(min_X, max_X, 16),
    density=True,
    label="Old distribution of X",
)
ax.hist(
    X_new,
    alpha=0.6,
    bins=np.linspace(min_X, max_X, 16),
    density=True,
    label="New distribution of X",
)
ax.set_xlabel("x")
ax.set_title("Density of X")
ax.legend()
plt.tight_layout()

It is not reasonable to expect that our model will maintain the same performance as before. The estimation of the sklearn.metrics.r2_score was made based on the original distribution of $X$, which has now shifted.

$\oint$ We will delve into this in more depth in a future post in this series, but essentially, the previous model was trained to identify a function $h$ that minimizes the expected squared error in the distribution $(X_{\textrm{old}}, Y)$. Mathematically, this can be represented as:

\[h* = \arg\min_{h\in\mathcal{H}}\,\mathbb{E}_{(X_{\textrm{old}}, Y)} \left(\left(h(X) - Y\right)^2\right),\]

This was done approximately, using the sample, by computing the empirical mean squared error. However, now, we are dealing with new data. Ideally, we should be minimizing:

\[\mathbb{E}_{(X_{\textrm{new}}, Y)} \left(\left(h(X) - Y\right)^2\right).\]

That is, we are targeting the expected error in a different distribution.

from sklearn.metrics import r2_score

x_plot_new = np.linspace(min_X, max_X, 1000).reshape(-1, 1)

fig, ax = plt.subplots(figsize=(7, 3))
ax.scatter(X_past, Y_past, alpha=0.2, label="Old sample")
ax.scatter(X_new, Y_new, alpha=0.6, label="New sample")
ax.plot(x_plot_new, f(x_plot_new), c="k", alpha=0.2, label="f(x)")
ax.plot(
    x_plot_new,
    grid_search.best_estimator_.predict(x_plot_new),
    c="r",
    label="Decision tree estimator trained on old sample",
)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(loc="lower left")
plt.tight_layout()

r2_score(Y_new, grid_search.best_estimator_.predict(X_new))

0.059081313039643146

As anticipated, the model's performance deteriorates when applied to the new data. It's important to remember that the relationship between $Y$ and $X$ has remained the same; only the distribution of $X$ has shifted.

Identifying covariate shift

With the initial problem established, our challenge can be summarized as follows:

Let $X$ and $Z$ be random variables (or vectors). Assume you independently sample $X$ $N\in\mathbb{N}^*$ times and $Z$ $M\in \mathbb{N}^*$ times, resulting in the samples $\{x_1, x_2, \cdots, x_N \} $ and $\{z_1, z_2, \cdots, z_M \} $. How can we determine if $X\sim Z$ using only these two samples? Specifically, in the context of covariate shift, we'll be comparing samples of covariates from the training phase with those in production.

In general, monitoring the distribution of covariates needs to be easy to implement. Simple methods are preferred over complex ones to prioritize computational efficiency. Moreover, analysis is typically performed on each covariate, identifying shifts in these marginal distributions. Among the classic univariate methods, the most prominent are:

Comparison of statistics: means, variances, select sample quantiles etc;
Comparison of frequencies for discrete distributions and categorical data;
Kolmogorov-Smirnov test;
Kullback-Leibler divergence.

This monitoring is often accompanied by analysis of the model's output distribution. For instance, if our model previously suggested that 10% of the data belonged to one class, and now it indicates 20%, we have a solid hint that the input distribution has shifted.

In this series of posts, I plan to introduce some slightly more unconventional methods for identifying covariate shift. Subsequently, we'll explore the problem through Vapnik's empirical risk minimization framework. From there, we'll derive an elegant method to address it, using a technique that will serve as a diagnostic tool for identifying dataset shift.

$\oint$ Keep in mind that this is just one of the crucial elements when it comes to monitoring machine learning models. For a comprehensive guide that addresses the main potential issues, I recommend the references [2, 3].

Bibliography

[1] Dataset Shift in Machine Learning. The MIT Press. Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence.

[2] Monitoring Machine Learning Models in Production. Towards Data Science. Emeli Dral.

[3] A Guide to Monitoring Machine Learning Models in Production. NVIDIA Developer Blog. Kurtis Pykes.

You can find all files and environments for reproducing the experiments in the repository of this post.

Vitali Set

Evaluating ranking in regression

Spearman’s Correlation

Kendall’s Tau Correlation

ROCAUC for Classification

Estimating the ROCAUC via the Wilcoxon-Mann-Whitney statistic

ROCAUC for Regression

Ranking Curve

Final considerations

The R² score does not vary between 0 and 1

Mean Squared Error and the choice of a constant model

R² as a comparison between your model and a constant model

Generalization of R² beyond MSE

Final considerations

Conformal prediction in CATE estimation

Brief review of confounder control

Creating the dataset

Positivity assumption

Conformalized Quantile Regression

Using the T-learner

Evaluating the conformal regression

Joining confidence intervals

Prediction interval of CATE

Final considerations

Conditional Density Estimation

Creating the dataset

Histograms

Kernel Density Estimation

Evaluation metrics for conditional density estimation methods

LeafNeighbors

FlexCode

Using FlexCode in Python

Practical application

Final considerations

Hyperparameter search with threshold-dependent metrics

Optimizing f1 in a naive way

Tuning the threshold

Back to hyperparameters search

1. Optimizing a metric that works and is related to the desired metric

2. Leak the threshold search

3. Tuning the threshold during gridsearch on a chunk of the training set

Tuning the threshold for the best hyperparameters combination

tl;dr

Meta K-Means: um ensemble de K-Means

Testando a ideia no dataset de dígitos

Considerações finais

Implementação grosseira da classe do estimador

Uma utilização crítica do Boruta

Motivando a construção do Boruta

Medindo a importância de uma variável

Selecionando as K “melhores variáveis”

Selecionando as K melhores variáveis com ponto de corte sugerido por uma variável aleatória

Ideias gerais do Boruta

O boruta.BorutaPy

Trade-off de “qualidade da seleção” vs “tempo” quando damos um undersample

Usando o Boruta na prática e algumas alternativas

Conclusão

Covariate Shift: Classificador Binário

Entendendo o problema de classificação

Construindo e avaliando o classificador binário

Entendendo a mudança na distribuição a partir do classificador

Caso sem mudança

Pontos de atenção e considerações finais

Covariate Shift: QQ-plot

$\alpha$-quantis de uma variável aleatória

Cálculo da função quantil de uma variável aleatória contínua

Cálculo da função quantil de uma variável aleatória discreta

QQ-plot

$\alpha$-quantis para amostras

QQ-plot para duas amostras

Sugestão de métrica quantitativa

Problemas e considerações finais

Covariate Shift: Introduction

Example of dataset shift between training data and production data

Identifying covariate shift

Selecionando as `K` “melhores variáveis”