Basics of Linear Regression

Linear Regression R Python

A quick introduction to Linear Regression Analysis and how to use R (and Python as well) to perform it.

Carlos Trucíos https://ctruciosm.github.io
02-25-2021

Introduction

Linear Regression Analysis (LRA) is one of the most popular and useful statistical learning techniques and is helpful when we are interesting in explainig/predicting the variable \(y\) using a set of \(k\) explainable variables \(x_1, \ldots, x_k\).

Basically, we are saying that the \(k\) explainable variables can help us to understand the behaviour of \(y\) and, in a linear regression framework, the relation between \(y\) and the \(x\)’s is given by a linear funcion of the form,

\[y = \underbrace{\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k}_{f(x_1, \ldots, x_k)} + u,\] where \(u\) is an error term.

OLS estimation

In practice we never know \(\beta = [\beta_0, \beta_1, \ldots, \beta_k]'\) and we have to estimate them using data, for which purpose there are several methods, being OLS (Ordinary Least Squares) the most commonly used1.

The OLS estimator is given by \[\hat{\beta}_{OLS} = (X'X)^{-1}X'Y,\] with its respective covariance matrix (conditionally on \(X\)) given by \[V(\hat{\beta}_{OLS}|X) = \sigma^2(X'X)^{-1},\] where \(Y = [y_1, \ldots, y_n]'\) and \(X = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,k} \\ \vdots & \vdots & \cdots & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,k} \end{bmatrix}.\)

The Gaus–Markov theorem states that, under some assumptions (known as Gauss-Markov hipotheses), \(\hat{\beta}_{OLS}\) is the Best Linear Unbiased Estimator (BLUE), i.e. for any other unbiased linear estimator2 \(\tilde{\beta}\), \[V(\tilde{\beta}|X) \geq V(\hat{\beta}_{OLS}|X).\]

Figure 1 displays an example of the regression line \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\) obtained by OLS.

OLS regression line example

Figure 1: OLS regression line example

R implementation

Using linear regression in R is straightforward, to see how to implement a linear regression in R let’s use the hprice1 dataset from the wooldridge R package.

To perform the linear regression \[price = \beta_0 + \beta_1 bdrms + \beta_2 lotsize + \beta_3 sqrft + \beta_4 colonial + u,\] we use

library(wooldridge)
model = lm(price~bdrms+lotsize+sqrft+colonial, data = hprice1)
model

Call:
lm(formula = price ~ bdrms + lotsize + sqrft + colonial, data = hprice1)

Coefficients:
(Intercept)        bdrms      lotsize        sqrft     colonial  
 -24.126528    11.004292     0.002076     0.124237    13.715542  

A better output, which includes the standard deviation of \(\hat{\beta}\), t-test, F-test, \(R^2\) and p-values can be easily obtained by

summary(model)

Call:
lm(formula = price ~ bdrms + lotsize + sqrft + colonial, data = hprice1)

Residuals:
     Min       1Q   Median       3Q      Max 
-122.268  -38.271   -6.545   28.210  218.040 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.413e+01  2.960e+01  -0.815  0.41741    
bdrms        1.100e+01  9.515e+00   1.156  0.25080    
lotsize      2.076e-03  6.427e-04   3.230  0.00177 ** 
sqrft        1.242e-01  1.334e-02   9.314 1.53e-14 ***
colonial     1.372e+01  1.464e+01   0.937  0.35146    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.88 on 83 degrees of freedom
Multiple R-squared:  0.6758,    Adjusted R-squared:  0.6602 
F-statistic: 43.25 on 4 and 83 DF,  p-value: < 2.2e-16

Interpretation

Before interpreting the results, it is very important to know our dataset and have totally understanding about the variables we are using. Thus, let’s have a glimpse in our dataset

library(dplyr)
hprice1 %>% select(price, bdrms, lotsize, sqrft, colonial) %>% glimpse()
Rows: 88
Columns: 5
$ price    <dbl> 300.000, 370.000, 191.000, 195.000, 373.000, 466.2…
$ bdrms    <int> 4, 3, 3, 3, 4, 5, 3, 3, 3, 3, 4, 5, 3, 3, 3, 4, 4,…
$ lotsize  <dbl> 6126, 9903, 5200, 4600, 6095, 8566, 9000, 6210, 60…
$ sqrft    <int> 2438, 2076, 1374, 1448, 2514, 2754, 2067, 1731, 17…
$ colonial <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,…

The description of the variables is given below:

Variavel Description
price house price, $1000s
bdrms number of bedrooms
lotsize size of lot in square feet
sqrft size of house in square feet
colonial Dummy (=1 if home is colonial style)

There are several points to be addressed looking at the output of our regression model provided by the summary( ) function:

Finally, the summary( )’s output also provides useful information to test jointly \[H_0: \beta_{bdrms}=0,\beta_{lotsize}=0,\beta_{sqrft}=0,\beta_{colonial}=0\] versus \[H_1: H_0 \text{ is not true. }\] Using the F-test, we reject \(H_0\) (p-value \(\approx\) 0, F-statistics = 43.25)

Of course, the inpretetation was made assuming that the classical linear model hypothesis were verified. If we have evidence of non-verication of some hypothesis we need to improve/correct our model and only interpret it when the classical linear model hypothesis have been verified.

In Wooldridge’s book6 we find an interesting discussion about model interpretation depending whether \(\log(\cdot)\) transformation is used or not, which can be summarised as:

Dependent variable Independent variable Interpretation of \(\beta\)
\(y\) \(x\) \(\Delta y = \beta \Delta x\)
\(y\) \(\log(x)\) \(\Delta y = \big(\beta/100 \big) \% \Delta x\)
\(\log(y)\) \(x\) \(\% \Delta y = 100\beta \Delta x\)
\(\log(y)\) \(\log(x)\) \(\% \Delta y = \beta \% \Delta x\)

Conclusions

Bonus

Python implementation

import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices

url = "https://raw.githubusercontent.com/ctruciosm/statblog/master/datasets/hprice1.csv"
hprice1 = pd.read_csv(url)

y, X = dmatrices('price ~ bdrms + lotsize + sqrft + colonial', 
                  data = hprice1, return_type = 'dataframe')
# Describe model
model = sm.OLS(y, X)
# Fit model
model_fit = model.fit()
# Summarize model
print(model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.660
Method:                 Least Squares   F-statistic:                     43.25
Date:                Sáb, 27 Fev 2021   Prob (F-statistic):           1.45e-19
Time:                        18:16:43   Log-Likelihood:                -482.41
No. Observations:                  88   AIC:                             974.8
Df Residuals:                      83   BIC:                             987.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -24.1265     29.603     -0.815      0.417     -83.007      34.754
bdrms         11.0043      9.515      1.156      0.251      -7.921      29.930
lotsize        0.0021      0.001      3.230      0.002       0.001       0.003
sqrft          0.1242      0.013      9.314      0.000       0.098       0.151
colonial      13.7155     14.637      0.937      0.351     -15.397      42.828
==============================================================================
Omnibus:                       24.904   Durbin-Watson:                   2.117
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.677
Skew:                           1.091   Prob(JB):                     1.21e-10
Kurtosis:                       5.774   Cond. No.                     6.43e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.43e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

  1. The basic idea of OLS is to find the values \(\hat{\beta}\)’s that minimize the sum squared prediction errors.↩︎

  2. A linear estimator is an estimator of the form \(\tilde{\beta} = A'Y\) where the matrix \(A\) is a \(n \times k+1\) function of \(X\)↩︎

  3. Usually, we prefer \(R^2_{Adjusted}\) rather than \(R^2\)↩︎

  4. \(2.076e-03*481 = 0.998556 \approx 1\)↩︎

  5. \(0.1242*8 = 0.9936 \approx 1\)↩︎

  6. Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.↩︎

Citation

For attribution, please cite this work as

Trucíos (2021, Feb. 25). Statistical Data Science: Basics of Linear Regression. Retrieved from https://ctruciosm.github.io/statblog/posts/2021-02-25-basics-of-linear-regression/

BibTeX citation

@misc{trucíos2021basics,
  author = {Trucíos, Carlos},
  title = {Statistical Data Science: Basics of Linear Regression},
  url = {https://ctruciosm.github.io/statblog/posts/2021-02-25-basics-of-linear-regression/},
  year = {2021}
}