Statistical Data Science: Basics of Linear Regression

Carlos Trucíos

Introduction

Linear Regression Analysis (LRA) is one of the most popular and useful statistical learning techniques and is helpful when we are interesting in explainig/predicting the variable $y$ using a set of $k$ explainable variables $x_1, \ldots, x_k$.

Basically, we are saying that the $k$ explainable variables can help us to understand the behaviour of $y$ and, in a linear regression framework, the relation between $y$ and the $x$’s is given by a linear funcion of the form,

\[y = \underbrace{\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k}_{f(x_1, \ldots, x_k)} + u,\] where $u$ is an error term.

OLS estimation

In practice we never know $\beta = [\beta_0, \beta_1, \ldots, \beta_k]'$ and we have to estimate them using data, for which purpose there are several methods, being OLS (Ordinary Least Squares) the most commonly used¹.

The OLS estimator is given by \[\hat{\beta}_{OLS} = (X'X)^{-1}X'Y,\] with its respective covariance matrix (conditionally on $X$) given by \[V(\hat{\beta}_{OLS}|X) = \sigma^2(X'X)^{-1},\] where $Y = [y_1, \ldots, y_n]'$ and $X = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,k} \\ \vdots & \vdots & \cdots & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,k} \end{bmatrix}.$

$\sigma^2$ is never known, so we use $\hat{\sigma}^2 = \dfrac{ \sum_{i=1}^n \hat{u}_i^2}{n-k-1}$ instead, which is an unbiased estimator of $\sigma^2$ ($E(\hat{\sigma}^2) = \sigma^2$).
So, in practice we always use $\widehat{V}(\hat{\beta}_{OLS}|X) = \hat{\sigma}^2(X'X)^{-1}$ instead of $V(\hat{\beta}_{OLS}|X)$.
The Standard errors, usually reported by many econometrics/statistical softwares, are the square root of the diagonal elements of $\widehat{V}(\hat{\beta}_{OLS}|X)$

The Gaus–Markov theorem states that, under some assumptions (known as Gauss-Markov hipotheses), $\hat{\beta}_{OLS}$ is the Best Linear Unbiased Estimator (BLUE), i.e. for any other unbiased linear estimator² $\tilde{\beta}$, \[V(\tilde{\beta}|X) \geq V(\hat{\beta}_{OLS}|X).\]

Figure 1 displays an example of the regression line $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$ obtained by OLS.

Figure 1: OLS regression line example

R implementation

Using linear regression in R is straightforward, to see how to implement a linear regression in R let’s use the hprice1 dataset from the wooldridge R package.

To perform the linear regression \[price = \beta_0 + \beta_1 bdrms + \beta_2 lotsize + \beta_3 sqrft + \beta_4 colonial + u,\] we use

library(wooldridge)
model = lm(price~bdrms+lotsize+sqrft+colonial, data = hprice1)
model


Call:
lm(formula = price ~ bdrms + lotsize + sqrft + colonial, data = hprice1)

Coefficients:
(Intercept)        bdrms      lotsize        sqrft     colonial  
 -24.126528    11.004292     0.002076     0.124237    13.715542

A better output, which includes the standard deviation of $\hat{\beta}$, t-test, F-test, $R^2$ and p-values can be easily obtained by

summary(model)


Call:
lm(formula = price ~ bdrms + lotsize + sqrft + colonial, data = hprice1)

Residuals:
     Min       1Q   Median       3Q      Max 
-122.268  -38.271   -6.545   28.210  218.040 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.413e+01  2.960e+01  -0.815  0.41741    
bdrms        1.100e+01  9.515e+00   1.156  0.25080    
lotsize      2.076e-03  6.427e-04   3.230  0.00177 ** 
sqrft        1.242e-01  1.334e-02   9.314 1.53e-14 ***
colonial     1.372e+01  1.464e+01   0.937  0.35146    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.88 on 83 degrees of freedom
Multiple R-squared:  0.6758,    Adjusted R-squared:  0.6602 
F-statistic: 43.25 on 4 and 83 DF,  p-value: < 2.2e-16

Interpretation

Before interpreting the results, it is very important to know our dataset and have totally understanding about the variables we are using. Thus, let’s have a glimpse in our dataset

library(dplyr)
hprice1 %>% select(price, bdrms, lotsize, sqrft, colonial) %>% glimpse()

Rows: 88
Columns: 5
$ price    <dbl> 300.000, 370.000, 191.000, 195.000, 373.000, 466.2…
$ bdrms    <int> 4, 3, 3, 3, 4, 5, 3, 3, 3, 3, 4, 5, 3, 3, 3, 4, 4,…
$ lotsize  <dbl> 6126, 9903, 5200, 4600, 6095, 8566, 9000, 6210, 60…
$ sqrft    <int> 2438, 2076, 1374, 1448, 2514, 2754, 2067, 1731, 17…
$ colonial <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,…

The description of the variables is given below:

Variavel	Description
price	house price, $1000s
bdrms	number of bedrooms
lotsize	size of lot in square feet
sqrft	size of house in square feet
colonial	Dummy (=1 if home is colonial style)

There are several points to be addressed looking at the output of our regression model provided by the summary( ) function:

$\approx 66\%$ of price’s variability is explained by our model³.
Using the T-test ($H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0$), only lotsize and sqrft are statistical signficant (rejection of $H_0$) at significance level 5%.
The increase of 481 square feet in size lot, implies the increase (on average) of a thousand⁴ USD in house prices (when other factors remain constant).
The increase of 8 square feet in size house, implies the increase (on average) of a thousand⁵ USD in house prices (when other factors remain constant).

Finally, the summary( )’s output also provides useful information to test jointly \[H_0: \beta_{bdrms}=0,\beta_{lotsize}=0,\beta_{sqrft}=0,\beta_{colonial}=0\] versus \[H_1: H_0 \text{ is not true. }\] Using the F-test, we reject $H_0$ (p-value $\approx$ 0, F-statistics = 43.25)

Of course, the inpretetation was made assuming that the classical linear model hypothesis were verified. If we have evidence of non-verication of some hypothesis we need to improve/correct our model and only interpret it when the classical linear model hypothesis have been verified.

In Wooldridge’s book⁶ we find an interesting discussion about model interpretation depending whether $\log(\cdot)$ transformation is used or not, which can be summarised as:

Dependent variable	Independent variable	Interpretation of $\beta$
$y$	$x$	$\Delta y = \beta \Delta x$
$y$	$\log(x)$	$\Delta y = \big(\beta/100 \big) \% \Delta x$
$\log(y)$	$x$	$\% \Delta y = 100\beta \Delta x$
$\log(y)$	$\log(x)$	$\% \Delta y = \beta \% \Delta x$

Conclusions

LRA is a powerful and easy to implemente statistical learning technique which can provides interesting insights about our data.
R provides an easy way to perform linear regression as well as information useful to interpret the results. However, it is important to take care about the hypothesis assumed in LRA (which is the topic of other post), the non-verification of those hypothesis can have an strong influence in the results obtained by our model.

Bonus

Python implementation

import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices

url = "https://raw.githubusercontent.com/ctruciosm/statblog/master/datasets/hprice1.csv"
hprice1 = pd.read_csv(url)

y, X = dmatrices('price ~ bdrms + lotsize + sqrft + colonial', 
                  data = hprice1, return_type = 'dataframe')
# Describe model
model = sm.OLS(y, X)
# Fit model
model_fit = model.fit()
# Summarize model
print(model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.660
Method:                 Least Squares   F-statistic:                     43.25
Date:                Sáb, 27 Fev 2021   Prob (F-statistic):           1.45e-19
Time:                        18:16:43   Log-Likelihood:                -482.41
No. Observations:                  88   AIC:                             974.8
Df Residuals:                      83   BIC:                             987.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -24.1265     29.603     -0.815      0.417     -83.007      34.754
bdrms         11.0043      9.515      1.156      0.251      -7.921      29.930
lotsize        0.0021      0.001      3.230      0.002       0.001       0.003
sqrft          0.1242      0.013      9.314      0.000       0.098       0.151
colonial      13.7155     14.637      0.937      0.351     -15.397      42.828
==============================================================================
Omnibus:                       24.904   Durbin-Watson:                   2.117
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.677
Skew:                           1.091   Prob(JB):                     1.21e-10
Kurtosis:                       5.774   Cond. No.                     6.43e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.43e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The basic idea of OLS is to find the values $\hat{\beta}$’s that minimize the sum squared prediction errors.↩︎
A linear estimator is an estimator of the form $\tilde{\beta} = A'Y$ where the matrix $A$ is a $n \times k+1$ function of $X$↩︎
Usually, we prefer $R^2_{Adjusted}$ rather than $R^2$↩︎
$2.076e-03*481 = 0.998556 \approx 1$↩︎
$0.1242*8 = 0.9936 \approx 1$↩︎
Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.↩︎

Comment on this article Share:

Dependent variable	Independent variable	Interpretation of \(\beta\)
\(y\)	\(x\)	\(\Delta y = \beta \Delta x\)
\(y\)	\(\log(x)\)	\(\Delta y = \big(\beta/100 \big) \% \Delta x\)
\(\log(y)\)	\(x\)	\(\% \Delta y = 100\beta \Delta x\)
\(\log(y)\)	\(\log(x)\)	\(\% \Delta y = \beta \% \Delta x\)

Basics of Linear Regression