The R² score does not vary between 0 and 1

The coefficient of determination, known as $R^2$, is a fundamental metric in regression analyses. However, its definition and interpretation are not always straightforward. Indeed, there are several ways to define the $R^2$ and, although all are equivalent, each offers a different interpretative nuance. Some of these interpretations are more intuitive, facilitating an immediate understanding of the possible values, while others can lead to misunderstandings.

O coeficiente de determinação, conhecido como $R^2$, é uma métrica fundamental em análises de regressão. Contudo, sua definição e interpretação nem sempre são diretas. De fato, existem várias maneiras de definir o $R^2$ e, embora todas sejam equivalentes, cada uma apresenta uma nuance interpretativa diferente. Algumas dessas interpretações são mais intuitivas, facilitando uma compreensão imediata dos valores possíveis, enquanto outras podem levar a equívocos.

The current version of scikit-learn, in its docstring for sklearn.metrics.r2_score, mentions that the $R^2$ can range from negative infinity to 1. However, it's not uncommon to find data scientists claiming that the range of possible values for $R^2$ is strictly between 0 and 1. One of the reasons for this discrepancy lies in the classical interpretation of $R^2$, which is traditionally understood as the proportion of variance explained by the model relative to the total variance of the target variable [1].

A versão atual do scikit-learn, em sua docstring do sklearn.metrics.r2_score, menciona que o $R^2$ pode variar de menos infinito até 1. Contudo, não é raro encontrar cientistas de dados que afirmam que a amplitude de valores possíveis para o $R^2$ está estritamente entre 0 e 1. Uma das razões para essa discrepância está na interpretação clássica do $R^2$, que é tradicionalmente entendida como a proporção da variância explicada pelo modelo em relação à variância total da variável alvo [1].

Throughout this text, I will address the interpretation that I consider most enlightening and relevant. With it, I hope to clarify some peculiarities of the $R^2$ and highlight its importance as a robust metric, frequently referred to in regression problems.

Ao longo deste texto, abordarei a interpretação que considero mais esclarecedora e relevante. Com ela, espero elucidar algumas peculiaridades do $R^2$ e destacar sua importância como uma métrica robusta, frequentemente consultada em problemas de regressão.

Mean Squared Error and the choice of a constant model

Erro quadrático médio e a escolha de um modelo constante

The $R^2$ is a common metric in regression. However, often the first metric introduced for regression problems is the Mean Squared Error (MSE). The MSE of a model $h$ on a dataset $S = \{ (x_i, y_i) \}_{i=1}^n$ is defined by

O $R^2$ é uma métrica comum em regressão. No entanto, frequentemente a primeira métrica introduzida para problemas de regressão é o Erro Quadrático Médio (MSE, do inglês Mean Squared Error). O MSE de um modelo $h$ em um conjunto de dados $S = \{ (x_i, y_i) \}_{i=1}^n$ é definido por

\[\textrm{MSE}(h) = \frac{1}{n} \sum_{i=1}^n \left(y_i - h(x_i)\right)^2,\]

where we chose not to denote the dependence on $S$ in order to keep the notation more streamlined.

onde optamos por não denotar a dependência de $S$ a fim de manter a notação mais simplificada.

Given this definition, an intriguing question arises: if you had to create a model that was merely a constant, which value would you choose? Many might answer that they would choose the mean, which is indeed one of the correct answers. However, why not consider the median, mode, or some other descriptive statistic?

Diante dessa definição, surge uma pergunta intrigante: se você tivesse que criar um modelo que fosse meramente uma constante, qual valor você escolheria? Muitos poderiam responder que escolheriam a média, que é, de fato, uma das respostas corretas. No entanto, por que não considerar a mediana, a moda ou alguma outra estatística descritiva?

The answer to this question is intrinsically linked to the cost function we wish to optimize. This choice is, in fact, a problem of decision theory [2]. For instance, if the goal is to optimize the MSE, then we would need to choose an $\alpha \in \mathbb{R}$ such that $h_\alpha(x) = \alpha$ minimizes the $\textrm{MSE}(h_\alpha)$. Mathematically, this is expressed as

A resposta a essa questão está intrinsecamente ligada à função de custo que desejamos otimizar. Essa escolha é, de fato, um problema da teoria de decisão [2]. Por exemplo, se o objetivo é otimizar o MSE, então precisaríamos escolher um $\alpha \in \mathbb{R}$ tal que $h_\alpha(x) = \alpha$ minimize o $\textrm{MSE}(h_\alpha)$. Matematicamente, isso é expresso como

\[\alpha^* = \arg\min_{\alpha \in \mathbb{R}} \textrm{MSE}(h_\alpha) = \arg\min_{\alpha \in \mathbb{R}} \left( \frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 \right).\]

This function may seem complex at first glance, but it becomes simpler when considering only $\alpha$ as the free variable, which is how we approach this optimization problem. By expanding the square and performing the summation, we have a polynomial function of degree 2 in $\alpha$ in the form

Essa função pode parecer complexa à primeira vista, mas se torna mais simples ao considerar apenas $\alpha$ como variável livre, que é como abordamos esse problema de otimização. Ao expandir o quadrado e realizar o somatório, temos uma função polinomial de grau 2 em $\alpha$ da forma

\[\frac{1}{n} \sum_{i=1}^n \left(y_i - \alpha\right)^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i^2 -2\alpha y_i + \alpha^2 \right) = \alpha^2 + \left(\frac{-2}{n} \sum_{i=1}^n y_i\right) \alpha+ \left(\frac{1}{n} \sum_{i=1}^n y_i^2\right).\]

In a quadratic function of the form $(a\,\alpha^2 + b\,\alpha + c)$, where $a>0$, the minimum occurs at the vertex of the parabola, located at $\frac{-b}{2a}$. Thus, in our context, the minimum is

Em uma função quadrática da forma $(a\,\alpha^2 + b\,\alpha + c)$, onde $a>0$, o mínimo ocorre no vértice da parábola, localizado em $\frac{-b}{2a}$. Assim, no nosso contexto, o mínimo é

\[\alpha^* = \frac{\left(\frac{-2}{n} \sum_{i=1}^n y_i\right)}{-2} = \frac{1}{n} \sum_{i=1}^n y_i = \bar{y}.\]

This means that, when minimizing the MSE, the optimal constant value is the average of the target $\bar{y}$ for this set. I encourage validating this result using other unconstrained optimization techniques such as identifying critical points followed by analyzing the concavity of the function.

Isso significa que, ao minimizar o MSE, o valor constante ótimo é a média do target $\bar{y}$ para esse conjunto. Encorajo a validação desse resultado a partir da utilização de outras técnicas de otimização irrestrita como: identificação de pontos críticos sguida da análise da concavidade da função.

This behavior changes when considering other metrics [3]. For example, to minimize the Mean Absolute Error (MAE), the constant value that optimizes it is the median, while the value that optimizes accuracy is the mode, and for pinball loss, it's the associated quantile. It's important to emphasize that if we consider sample_weight, all these statistics should be computed in a weighted manner.

Este comportamento muda ao considerarmos outras métricas [3]. Por exemplo, para minimizar o Mean Absolute Error (MAE), o valor constante que o otimiza é a mediana, enquanto o valor que otimiza a acurácia é a moda, e para a pinball loss é o quantil associado. Importante ressaltar que, se considerarmos sample_weight, todas essas estatísticas devem ser calculadas de forma ponderada.

$\oint$ This is used in defining prediction values for the nodes of decision trees. Looking at the scikit-learn code for trees, we notice that, depending on the criterion, the node_value can vary. It's adjusted to reflect the value that minimizes the loss when the node makes a constant prediction. For example, for the MSE criterion, the leaf's prediction is the average of the target of the training samples that fall in that leaf, while for the MAE criterion, it's the median.

$\oint$ Isso é usado na definição de valores para os nós das árvores de decisão. Observando o código do scikit-learn para árvores, notamos que, dependendo do critério, o node_value pode variar. Ele é ajustado para refletir o valor que minimiza a perda quando o nó faz uma previsão constante. Por exemplo, para o critério MSE a previsão da folha é a média do target dos exemplos de treinamento que caem nessa folha, enquanto para o critério MAE é a mediana.

$\oint$ In practice, a model that predicts the target's average isn't feasible because to calculate the average of the test set, you would need to know the $y_i$ values of that sample. However, this perspective is useful for comparing a basic model with your model, as we will discuss next.

$\oint$ Na prática, um modelo que prevê a média do target não é viável porque para calcular a média do conjunto de teste você precisaria conhecer os valores de $y_i$ dessa amostra. No entanto, essa perspectiva é útil para comparar um modelo básico com o seu modelo, como discutiremos a seguir.

R² as a comparison between your model and a constant model

R² como comparação entre seu modelo e um modelo simples

Suppose I develop a model to predict a person's age based on their online behavior and obtain an MSE of 25 years squared. This number on its own might not be very informative. One way to interpret it is to calculate the Root Mean Squared Error, that is, $\textrm{RMSE} = \sqrt{\textrm{MSE}}$, resulting in an error of about 5 years. This value is more intuitive (I admit that, internally, I tend to think in terms of MAE), but it still doesn't provide a relative comparison like "is it possible to get a value significantly lower than this?". The $R^2$ might not answer this question directly, but it aids in this evaluation.

Suponha que eu desenvolva um modelo para prever a idade de uma pessoa com base em seu comportamento online e obtenha um MSE de 25 anos ao quadrado. Esse número isoladamente pode não ser muito informativo. Uma maneira de interpretá-lo é calcular o Root Mean Squared Error, ou seja, $\textrm{RMSE} = \sqrt{\textrm{MSE}}$, resultando em um erro de aproximadamente 5 anos. Esse valor é mais intuitivo (confesso que, internamente, costumo pensar em termos de MAE), mas ainda não fornece uma comparação relativa como "será que é possível obter um valor significativamente menor do que este?". O $R^2$ pode não responder essa pergunta diretamente, mas ajuda nessa avaliação.

We've already discussed a simple model that can serve as a benchmark. Imagine that the mean-based model already produces an MSE of 30 years squared. Suddenly, our previous model, which might have seemed excellent, doesn't stand out as much. If a simple model already achieves an MSE just slightly higher than the current model, is it worth implementing the more complex model in a production environment?

Já discutimos um modelo simples que pode servir como referência. Imagine que o modelo baseado na média já produza um MSE de 30 anos ao quadrado. Subitamente, nosso modelo anterior, que poderia parecer excelente, não se destaca tanto. Se um modelo simples já alcança um MSE apenas um pouco maior que o modelo atual, vale a pena implementar o modelo mais complexo em um ambiente de produção?

The interpretation I have of $R^2$ is precisely this comparison. Its formula is

A interpretação que faço do $R^2$ é justamente essa comparação. Sua fórmula é

\[R^2(h) = 1 - \frac{\textrm{MSE}(h)}{\textrm{MSE}(\bar{y})},\]

where $\bar{y}$ represents the average of the target in the set $S$ in which we are evaluating the model.

onde $\bar{y}$ representa a média do target no conjunto $S$ em que estamos avaliando o modelo.

With this, we can understand the possible values of $R^2$:

Com isso, podemos entender os valores possíveis de $R^2$:

If $R^2 = 1$, it means that $\textrm{MSE}(h) = 0$; that is, the model is perfect.
If $R^2 = 0$, we have $\textrm{MSE}(h) = \textrm{MSE}(\bar{y})$, indicating that our model is as effective as a model that simply provides the target's average.
For an $R^2$ between 0 and 1, we have $0 < \textrm{MSE}(h) < \textrm{MSE}(\bar{y})$. This indicates that the model has an error greater than zero, but less than that of a constant model based on the average.
A negative $R^2$ suggests that $\textrm{MSE}(h) > \textrm{MSE}(\bar{y})$, meaning our model is less accurate than one that always provides the average.

Se $R^2 = 1$, significa que $\textrm{MSE}(h) = 0$; ou seja, o modelo é perfeito.
Se $R^2 = 0$, temos $\textrm{MSE}(h) = \textrm{MSE}(\bar{y})$, indicando que nosso modelo é tão eficaz quanto um modelo que simplesmente fornece a média do target.
Para um $R^2$ entre 0 e 1, temos $0 < \textrm{MSE}(h) < \textrm{MSE}(\bar{y})$. Isso indica que o modelo tem um erro maior que zero, mas menor que o de um modelo constante baseado na média.
Um $R^2$ negativo sugere que $\textrm{MSE}(h) > \textrm{MSE}(\bar{y})$, ou seja, nosso modelo é menos preciso do que um que sempre fornece a média.

This interpretation helps in understanding the values obtained when using the function sklearn.metrics.r2_score. In the previous example, we would have an $R^2$ of $(1 - 25/30) \approx 0.17$, indicating a model that surpasses the simple model, but not very significantly.

Essa interpretação auxilia na compreensão dos valores obtidos ao usar a função sklearn.metrics.r2_score. No exemplo anterior, teríamos um $R^2$ de $(1 - 25/30) \approx 0.17$, indicando um modelo que supera o modelo simples, mas não de forma muito significativa.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    *fetch_california_housing(return_X_y=True),
    test_size=0.33,
    random_state=42,
)

lr = LinearRegression().fit(X_train, y_train)

def evaluate_model(y_true, y_pred):
    print(f"MSE: {mean_squared_error(y_true, y_pred)}")
    print(f"R^2: {r2_score(y_true, y_pred)}")
    
y_pred_lr =  lr.predict(X_test)
evaluate_model(y_test, y_pred_lr)

MSE: 0.5369686543372444
R^2: 0.5970494128783965

y_mean_test = y_test.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_test)

MSE: 1.3325918152222385
R^2: 0.0

y_pred_terrible_model = np.zeros_like(y_test)
evaluate_model(y_test, y_pred_terrible_model)

MSE: 5.6276808369101445
R^2: -3.2231092616846126

Although a model with an $R^2$ of zero might seem like the lowest achievable threshold, in reality, this metric uses a baseline model with data leakage. In practice, we build our models using training data, and in scenarios subject to "dataset shift," there can be significant changes in fundamental statistics, such as the average.

Embora um modelo com $R^2$ igual a zero possa parecer o patamar mínimo alcançável, na realidade, essa métrica se utiliza de um modelo baseline com vazamento de dados. Na prática, construímos nossos modelos usando dados de treinamento e, em cenários sujeitos a "dataset shift", pode haver mudanças significativas em estatísticas fundamentais, como a média.

y_mean_train = y_train.mean() * np.ones_like(y_test)
evaluate_model(y_test, y_mean_train)

MSE: 1.3326257277946882
R^2: -2.5448582275933163e-05

Regardless of these nuances, interpreting the $R^2$ in this way offers a valuable comparative mindset. It's always essential to compare your model with simple baselines, whether with established business rules or with more basic models, like a constant.

Independentemente dessas nuances, interpretar o $R^2$ dessa forma oferece um valioso mindset de comparação. É sempre fundamental comparar seu modelo com baselines simples, seja com regras de negócio estabelecidas ou com modelos mais básicos, como uma constante.

Generalization of R² beyond MSE

Generalização do R² além do MSE

The notion of comparison with a basic or simple model can easily be generalized to other metrics, as long as we know which statistics to use as a baseline. Considering this, I propose extending this idea to the MAE using the median $\tilde{y}$ as the baseline model

A noção de comparação com um modelo básico ou simples pode ser facilmente generalizada para outras métricas, desde que saibamos quais estatísticas usar como baseline. Considerando isso, proponho a extensão dessa ideia para o MAE utilizando a mediana $\tilde{y}$ como modelo baseline

\[R^2_{\textrm{MAE}}(h) = 1 - \frac{\textrm{MAE}(h)}{\textrm{MAE}(\tilde{y})},\]

where

onde

\[\textrm{MAE}(h) = \frac{1}{n} \sum_{i=1}^n \left| y_i - h(x_i) \right|.\]

Thus, the $R^2_{\textrm{MAE}}$ provides a way to evaluate the model's performance relative to a simple baseline, using the MAE as the error metric.

Assim, o $R^2_{\textrm{MAE}}$ oferece uma forma de avaliar o desempenho do modelo em relação a um baseline simples, usando o MAE como métrica de erro.

from sklearn.metrics import mean_absolute_error

def r2_score_mae(y_true, y_pred, *args, **kwargs):
    mae_model = mean_absolute_error(y_true=y_true, y_pred=y_pred, *args, **kwargs)
    y_median_true = np.median(y_true) * np.ones_like(y_true)
    mae_median = mean_absolute_error(
        y_true=y_true, y_pred=y_median_true, *args, **kwargs
    )
    return 1 - mae_model / mae_median

def evaluate_model_mae(y_true, y_pred):
    print(f"MAE: {mean_absolute_error(y_true, y_pred)}")
    print(f"R^2_MAE: {r2_score_mae(y_true, y_pred)}")

evaluate_model_mae(y_test, y_pred_lr)

MAE: 0.5295710106684688
R^2_MAE: 0.40256278728026484

y_median_test = np.median(y_test) * np.ones_like(y_test)
evaluate_model_mae(y_test, y_median_test)

MAE: 0.8864044612448619
R^2_MAE: 0.0

Final considerations

Considerações finais

The misconception that $R^2$ varies only between 0 and 1 originates from a simplified interpretation of its most common meaning: the proportion of the target's variance that is explained by the independent variables, which suggests that the value lies between 0% and 100%. In practice, in many cases, $R^2$ indeed falls within this range. However, in situations where the model is inferior to a simple horizontal model (i.e., a straight line representing the average), $R^2$ can have negative values. This negative scenario is often underestimated by the statistical community, as it is usually associated with overfitting situations. Rarely will a linear regression that tends to suffer from underfitting be inferior to the horizontal model included in the hypothesis space of linear regression.

O equívoco de que o $R^2$ varia somente entre 0 e 1 origina-se de uma interpretação simplificada do seu significado mais comum: a proporção da variância do alvo que é explicada pelas variáveis independentes, o que sugere que o valor esteja entre 0% e 100%. Na prática, em muitos casos, o $R^2$ realmente se encontra nesse intervalo. Contudo, em situações nas quais o modelo é inferior a um simples modelo horizontal (isto é, uma linha reta que representa a média), o $R^2$ pode ter valores negativos. Esse cenário negativo é frequentemente subestimado pela comunidade estatística, pois geralmente está associado a situações de overfitting. Raramente, uma regressão linear que tende a sofrer underfitting será inferior ao modelo horizontal que está incluído no espaço de hipóteses da regressão linear.

Throughout this post, we analyzed some of the reasons why $R^2$ is such an interesting metric and widely used in regression problems. By understanding the implicit comparison with a baseline model, we gain a valuable perspective on the relative performance of our model, normalizing the less informative values of MSE when viewed in isolation. Moreover, the interpretation proposed here truly allows us to understand the resulting values in a clear and objective manner.

Durante esse post, analisamos alguns dos motivos pelos quais o $R^2$ é uma métrica tão interessante e amplamente utilizada em problemas de regressão. Ao compreender a comparação implícita com um modelo baseline, obtemos uma perspectiva valiosa sobre o desempenho relativo do nosso modelo normalizando os valores menos informativos do MSE quando vistos isoladamente. Além disso, a interpretação proposta aqui realmente nos permite entender os valores resultantes de maneira clara e objetiva.