The coefficient of determination also known as R-squared (R2), is a statistic that measures how well a regression model fits the data. R-squared shows the proportion of variation in the response variable that can be explained by the predictors in the model. Calculating R-squared is simple once you understand the basic formula and components.
In this comprehensive guide, I’ll walk you through the step-by-step process of calculating R-squared using its formula I’ll also provide examples using regression analysis in statistics software to demonstrate how to interpret R-squared Let’s get started!
What is the Coefficient of Determination?
The coefficient of determination (R2) measures the degree to which the variance of the response variable Y is explained by the predictors (X variables) in a regression model
R-squared is defined as the proportion of total variation in Y that is explained by the regression model. It ranges from 0 to 1, with higher values indicating more of the response variable variation is accounted for by the predictors.
- R2 near 0 means the model does not fit the data well (low explanatory power).
- R2 near 1 indicates the regression model fits the data very well.
R-squared is also called the explained variation because it represents the amount of variation in Y that is explained by the X variables in the model. The remaining unexplained variation is captured by the error term.
R-Squared Formula
R-squared is calculated using the formula:
R2=SSTSSR
Where:
- SSR = Regression sum of squares (variation explained by model)
- SST = Total sum of squares (total variation in Y)
This formula is based on partitioning the total sum of squares (SST) into:
- Sum of squares regression (SSR): Variation explained by X variables
- Sum of squares error (SSE): Unexplained variation (residuals)
SST=SSR+SSE
Substituting SST in the formula gives:
R2=SSR+SSESSR
Which is the proportion of explained variation out of total variation.
How to Calculate R-Squared Step-by-Step
Follow these 5 steps to calculate the coefficient of determination R-squared:
Step 1: Perform Regression Analysis
First, perform a regression analysis between the response (Y) and predictor variables (X). This gives the regression equation relating X and Y.
For example, let’s say we perform a simple linear regression of Y on X. This gives:
Y=b0+b1X
Step 2: Calculate SST
Find the total sum of squares (SST) using the formula:
SST=∑i=1n(yi−yˉ)2
Where:
- y = response value
- $bar{y}$ = mean of response
This measures the total deviation of each y value from the mean.
Step 3: Calculate SSR
Next, calculate the regression sum of squares (SSR) using:
SSR=∑i=1n(y^i−yˉ)2
Where $hat{y}_i$ are the predicted y values from the regression equation.
SSR measures the variation explained by the model.
Step 4: Calculate SSE
The sum of squares error (SSE) is found by:
SSE=∑i=1n(yi−y^i)2
SSE represents the residual variation not explained by the model.
Step 5: Compute R-Squared
Finally, compute R-squared using its formula:
R2=SSTSSR=SSR+SSESSR
Substitute the values calculated for SSR and SST. This gives the coefficient of determination.
And we’re done! Those are the key steps involved in calculating R-squared manually from a regression analysis. Now let’s look at some examples.
R-Squared Example in Excel
Let’s see how to find R-squared for a simple linear regression example in Excel.
Given this X and Y data:
X | Y |
---|---|
1 | 3 |
2 | 5 |
3 | 7 |
4 | 8 |
5 | 12 |
We perform a linear regression of Y on X, giving regression equation:
Y=−0.2+2.6X
Use Excel to find the R-squared value as follows:
-
Input the X and Y data
-
Click the Data Analysis button and select Regression
-
Select the Y and X input ranges
-
Check the Residuals box
-
Click OK
This gives the following regression output with R-squared and the sums of squares:
Regression Statistics | |
R-squared | 0.969 |
Observations | 5 |
ANOVA | |
SS | |
Regression | 104.8 |
Residual | 3.2 |
Total | 108 |
- SST (Total SS) = 108
- SSR (Regression SS) = 104.8
- SSE (Residual SS) = 3.2
Finally, calculate R-squared using the formula:
R2=SSTSSR=108104.8=0.969
We get the same R-squared value of 0.969 as in the Excel output. This confirms our manual R-squared calculation.
R-Squared in R Programming
Calculating R-squared in R is straightforward using the lm()
function for linear regression.
Let’s use the same X and Y data from the Excel example:
# Store data in vectors X <- c(1, 2, 3, 4, 5) Y <- c(3, 5, 7, 8, 12)# Linear modelmodel <- lm(Y ~ X)# Summary gives R-squaredsummary(model)$r.squared
This prints the R-squared value of 0.969. The summary()
function applied on the linear model returns a detailed table including R-squared.
We don’t have to manually calculate the sums of squares – R computes them automatically behind the scenes!
R-Squared in Python
Here is how to find R-squared in Python using sklearn
:
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score# X and Y data X = [[1], [2], [3], [4], [5]] y = [3, 5, 7, 8, 12]# Create and fit modelmodel = LinearRegression()model.fit(X, y) # R-squared scorer_squared = model.score(X, y)print(r_squared)
The LinearRegression.score()
method returns the R-squared value. We get 0.969 again for this data.
Interpreting R-Squared
When interpreting R-squared, consider:
-
Scale: R-squared is measured on a 0 to 1 scale. Values close to 1 indicate a better model fit.
-
Context: The acceptability of an R-squared value depends on the problem context. Higher is better, but may not be possible or expected.
-
Use cautiously: While R-squared indicates model fit, a higher R-squared does not necessarily mean the model has better predictive performance. It also does not indicate causation between X and Y variables.
-
Multiple regression: With multiple X variables, adjusted R-squared is used to account for the number of predictors in the model. But interpretation remains similar.
Overall, R-squared gives the percentage of variation explained by the model – a valuable statistic for evaluating and comparing regression analyses. Used properly, it can aid model selection, improvement, and predictive accuracy.
Common Questions about R-Squared
Here are answers to some frequently asked questions about the coefficient of determination:
How is R-squared calculated for multiple regression?
The R-squared formula remains the same for multiple regression models with several X variables. The process for calculating SST, SSR and SSE is identical.
What is the difference between R-squared and Adjusted R-squared?
Adjusted R-squared penalizes model complexity, so will always be lower than R-squared. It is used more for comparing models rather than measuring fit.
Can R-squared be negative?
What is the coefficient of determination?
The coefficient of determination (R²) measures how well a statistical model predicts an outcome. The outcome is represented by the model’s dependent variable.
The R2 ranges from 0 to 1, if the result is 0 then the outcome of the model is not good, and vice versa.
How to use the coefficient of determination calculator?
- Enter the data set X
- Enter the data set Y
- Values must be comma separated.
- Click on the calculate button.
- You can erase all inputs by clicking on the reset button.
Give Feedback What do you think of coefficient of determination calculator?