The R² score does not vary between 0 and 1

Este texto tem uma versão em português que pode ser encontrada no repositório de experimentos.

The coefficient of determination, known as $R^2$, is a fundamental metric in regression analyses. However, its definition and interpretation are not always straightforward. Indeed, there are several ways to define the $R^2$ and, although all are equivalent, each offers a different interpretative nuance. Some of these interpretations are more intuitive, facilitating an immediate understanding of the possible values, while others can lead to misunderstandings.

The current version of scikit-learn, in its docstring for sklearn.metrics.r2_score, mentions that the $R^2$ can range from negative infinity to 1. However, it's not uncommon to find data scientists claiming that the range of possible values for $R^2$ is strictly between 0 and 1. One of the reasons for this discrepancy lies in the classical interpretation of $R^2$, which is traditionally understood as the proportion of variance explained by the model relative to the total variance of the target variable [1].

Throughout this text, I will address the interpretation that I consider most enlightening and relevant. With it, I hope to clarify some peculiarities of the $R^2$ and highlight its importance as a robust metric, frequently referred to in regression problems.

Mean Squared Error and the choice of a constant model

The $R^2$ is a common metric in regression. However, often the first metric introduced for regression problems is the Mean Squared Error (MSE). The MSE of a model $h$ on a dataset $S = { (x_i, y_i) }_{i=1}^n$ is defined by

\[\textrm{MSE}(h) = \frac{1}{n} \sum_{i=1}^n \left(y_i - h(x_i)\right)^2,\]

where we chose not to denote the dependence on $S$ in order to keep the notation more streamlined.

Given this definition, an intriguing question arises: if you had to create a model that was merely a constant, which value would you choose? Many might answer that they would choose the mean, which is indeed one of the correct answers. However, why not consider the median, mode, or some other descriptive statistic?

The answer to this question is intrinsically linked to the cost function we wish to optimize. This choice is, in fact, a problem of decision theory [2]. For instance, if the goal is to optimize the MSE, then we would need to choose an $\alpha \in \mathbb{R}$ such that $h_\alpha(x) = \alpha$ minimizes the $\textrm{MSE}(h_\alpha)$. Mathematically, this is expressed as

\[\alpha^* = \arg\min_{\alpha \in \mathbb{R}} \textrm{MSE}(h_\alpha) = \arg\min_{\alpha \in \mathbb{R}} \left( \frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 \right).\]

This function may seem complex at first glance, but it becomes simpler when considering only $\alpha$ as the free variable, which is how we approach this optimization problem. By expanding the square and performing the summation, we have a polynomial function of degree 2 in $\alpha$ in the form

\[\frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i^2 -2\alpha y_i + \alpha^2 \right) = \alpha^2 + \left(\frac{-2}{n} \sum_{i=1}^n y_i\right) \alpha+ \left(\frac{1}{n} \sum_{i=1}^n y_i^2\right).\]

In a quadratic function of the form $(a\,\alpha^2 + b\,\alpha + c)$, where $a>0$, the minimum occurs at the vertex of the parabola, located at $\frac{-b}{2a}$. Thus, in our context, the minimum is

\[\alpha^* = \frac{\left(\frac{-2}{n} \sum_{i=1}^n y_i\right)}{-2} = \frac{1}{n} \sum_{i=1}^n y_i = \bar{y}.\]

This means that, when minimizing the MSE, the optimal constant value is the average of the target $\bar{y}$ for this set. I encourage validating this result using other unrestricted optimization techniques such as: identifying critical points followed by analyzing the concavity of the function.

This behavior changes when considering other metrics [3]. For example, to minimize the Mean Absolute Error (MAE), the constant value that optimizes it is the median, while the value that optimizes accuracy is the mode, and for pinball loss, it's the associated quantile. It's important to emphasize that if we consider sample_weight, all these statistics should be computed in a weighted manner.

$\oint$ This is used in defining prediction values for the nodes of decision trees. Looking at the scikit-learn code for trees, we notice that, depending on the criterion, the node_value can vary. It's adjusted to reflect the value that minimizes the loss when the node makes a constant prediction. For example, for the MSE criterion, the leaf's prediction is the average of the target of the training samples that fall in that leaf, while for the MAE criterion, it's the median.

$\oint$ In practice, a model that predicts the target's average isn't feasible because to calculate the average of the test set, you would need to know the $y_i$ values of that sample. However, this perspective is useful for comparing a basic model with your model, as we will discuss next.

R² as a comparison between your model and a constant model

Suppose I develop a model to predict a person's age based on their online behavior and obtain an MSE of 25 years squared. This number on its own might not be very informative. One way to interpret it is to calculate the Root Mean Squared Error, that is, $\textrm{RMSE} = \sqrt{\textrm{MSE}}$, resulting in an error of about 5 years. This value is more intuitive (I admit that, internally, I tend to think in terms of MAE), but it still doesn't provide a relative comparison like "is it possible to get a value significantly lower than this?". The $R^2$ might not answer this question directly, but it aids in this evaluation.

We've already discussed a simple model that can serve as a benchmark. Imagine that the mean-based model already produces an MSE of 30 years squared. Suddenly, our previous model, which might have seemed excellent, doesn't stand out as much. If a simple model already achieves an MSE just slightly higher than the current model, is it worth implementing the more complex model in a production environment?

The interpretation I have of $R^2$ is precisely this comparison. Its formula is

\[R^2(h) = 1 - \frac{\textrm{MSE}(h)}{\textrm{MSE}(\bar{y})},\]

where $\bar{y}$ represents the average of the target in the set $S$ in which we are evaluating the model.

With this, we can understand the possible values of $R^2$:

If $R^2 = 1$, it means that $\textrm{MSE}(h) = 0$; that is, the model is perfect.
If $R^2 = 0$, we have $\textrm{MSE}(h) = \textrm{MSE}(\bar{y})$, indicating that our model is as effective as a model that simply provides the target's average.
For an $R^2$ between 0 and 1, we have $0 < \textrm{MSE}(h) < \textrm{MSE}(\bar{y})$. This indicates that the model has an error greater than zero, but less than that of a constant model based on the average.
A negative $R^2$ suggests that $\textrm{MSE}(h) > \textrm{MSE}(\bar{y})$, meaning our model is less accurate than one that always provides the average.

This interpretation helps in understanding the values obtained when using the function sklearn.metrics.r2_score. In the previous example, we would have an $R^2$ of $(1 - 25/30) \approx 0.17$, indicating a model that surpasses the simple model, but not very significantly.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    *fetch_california_housing(return_X_y=True),
    test_size=0.33,
    random_state=42,
)

lr = LinearRegression().fit(X_train, y_train)

def evaluate_model(y_true, y_pred):
    print(f"MSE: {mean_squared_error(y_true, y_pred)}")
    print(f"R^2: {r2_score(y_true, y_pred)}")
    
y_pred_lr =  lr.predict(X_test)
evaluate_model(y_test, y_pred_lr)

MSE: 0.5369686543372444
R^2: 0.5970494128783965

y_mean_test = y_test.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_test)

MSE: 1.3325918152222385
R^2: 0.0

y_pred_terrible_model = np.zeros_like(y_test)
evaluate_model(y_test, y_pred_terrible_model)

MSE: 5.6276808369101445
R^2: -3.2231092616846126

Although a model with an $R^2$ of zero might seem like the lowest achievable threshold, in reality, this metric uses a baseline model with data leakage. In practice, we build our models using training data, and in scenarios subject to "dataset shift," there can be significant changes in fundamental statistics, such as the average.

y_mean_train = y_train.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_train)

MSE: 1.3326257277946882
R^2: -2.5448582275933163e-05

Regardless of these nuances, interpreting the $R^2$ in this way offers a valuable comparative mindset. It's always essential to compare your model with simple baselines, whether with established business rules or with more basic models, like a constant.

Generalization of R² beyond MSE

The notion of comparison with a basic or simple model can easily be generalized to other metrics, as long as we know which statistics to use as a baseline. Considering this, I propose extending this idea to the MAE using the median $\tilde{y}$ as the baseline model

\[R^2_{\textrm{MAE}}(h) = 1 - \frac{\textrm{MAE}(h)}{\textrm{MAE}(\tilde{y})},\]

where

\[\textrm{MAE}(h) = \frac{1}{n} \sum_{i=1}^n \left| y_i - h(x_i) \right|.\]

Thus, the $R^2_{\textrm{MAE}}$ provides a way to evaluate the model's performance relative to a simple baseline, using the MAE as the error metric.

from sklearn.metrics import mean_absolute_error

def r2_score_mae(y_true, y_pred, *args, **kwargs):
    mae_model = mean_absolute_error(y_true=y_true, y_pred=y_pred, *args, **kwargs)
    y_median_true = np.median(y_true) * np.ones_like(y_true)
    mae_median = mean_absolute_error(
        y_true=y_true, y_pred=y_median_true, *args, **kwargs
    )
    return 1 - mae_model / mae_median

def evaluate_model_mae(y_true, y_pred):
    print(f"MAE: {mean_absolute_error(y_true, y_pred)}")
    print(f"R^2_MAE: {r2_score_mae(y_true, y_pred)}")

evaluate_model_mae(y_test, y_pred_lr)

MAE: 0.5295710106684688
R^2_MAE: 0.40256278728026484

y_median_test = np.median(y_test) * np.ones_like(y_test)
evaluate_model_mae(y_test, y_median_test)

MAE: 0.8864044612448619
R^2_MAE: 0.0

Final considerations

The misconception that $R^2$ varies only between 0 and 1 originates from a simplified interpretation of its most common meaning: the proportion of the target's variance that is explained by the independent variables, which suggests that the value lies between 0% and 100%. In practice, in many cases, $R^2$ indeed falls within this range. However, in situations where the model is inferior to a simple horizontal model (i.e., a straight line representing the average), $R^2$ can have negative values. This negative scenario is often underestimated by the statistical community, as it is usually associated with overfitting situations. Rarely will a linear regression that tends to suffer underfitting be inferior to the horizontal model included in the hypothesis space of linear regression.

Throughout this post, we analyzed some of the reasons why $R^2$ is such an interesting metric and widely used in regression problems. By understanding the implicit comparison with a baseline model, we gain a valuable perspective on the relative performance of our model, normalizing the less informative values of MSE when viewed in isolation. Moreover, the interpretation proposed here truly allows us to understand the resulting values in a clear and objective manner.

Bibliography

[1] Coefficient of determination. Wikipedia.

[2] Introdução à Teoria da Decisão. Fundamentos de Inferência Bayesiana. Victor Fossaluza e Luís Gustavo Esteves.

[3] Estimação Pontual. Fundamentos de Inferência Bayesiana. Victor Fossaluza e Luís Gustavo Esteves.

You can find all files and environments for reproducing the experiments in the repository of this post.