Statistical Learning:

class: center, middle, inverse, title-slide

# Statistical Learning:
## Regressão Linear
### Prof. Carlos Trucíos <br> FACC/UFRJ <br><br> <a href="http://ctruciosm.github.io"> <i class="fa fa-desktop fa-fw"></i>  ctruciosm.github.io</a><br> <a href="mailto:carlos.trucios@facc.ufrj.br"><i class="fa fa-paper-plane fa-fw"></i>  carlos.trucios@facc.ufrj.br</a><br>
### Grupo de Estudos CIA, </br> – Causal Inference and Analytics –
### 2021-06-18

---

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(imagens/CIA_logo.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

## Intuição

Suponha que temos duas variáveis ( `$Y$` e `$X$`) e estamos interessados em entender/explicar o comportamento de `$Y$` em função de `$X$`.

---

## Intuição

> Tracejar uma reta de forma que "acompanhe" a relação que existe entre X e Y parece ser uma boa ideia para entendermos a relação entre as variáveis.

---

## Intuição

> Mas qual reta utilizar? de fato podemos obter infinitas retas!!!!

.center[.blue[**Escolheremos uma reta que minimize a distância entre o ponto observado e a reta tracejada.**]]

---
class: inverse, center, middle
# Regressão Linear
---

## Regressão Linear

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[
**Regressão Linear** é um dos métodos de aprendizado supervisionado mais antigos e utilizados. Baseia-se, entre outras suposições, na relação linear entre a variavel dependente `$y$` e um conjunto de `$k$` variaveis explicativas `$x_1, x_2, \cdots, x_k$`. `$$y_i = \underbrace{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}}_{f(x)} + u$$`

A simplicidade do método faz com que seja acessivel e fácil de entender/implementar. Contudo, está simplicidade tem feito que o método seja utilizado de forma inapropriada em muitas situações. 
]

Como de praxe, nosso objetivo é estimar `$\hat{f}(x)$` que, no caso do modelo linear, reduce-se a obter `$\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k$`

😕 **Como obter os `$\hat{\beta}$`s?** 😕

O [Método de Mínimos Quadrados Ordinários (MQO)](https://ctruciosm.github.io/statblog/posts/2021-04-01-posts2021-04-01-minimos-quadrados-ordinarios/) estimas os `$\beta$`s de forma que minimizem a soma de quadrados dos resíduos.

---
class: inverse, center, middle
# Regressão Linear no R
---

## Regressão Linear no R

O _dataset_ [Advertising](https://raw.githubusercontent.com/ctruciosm/ISLR/master/dataset/Advertising.csv) contém informação de 200 lojas e 4 variaveis.

```r
library(dplyr)
uri <- "https://raw.githubusercontent.com/ctruciosm/ISLR/master/dataset/Advertising.csv"
advertising <- read.csv(uri)
glimpse(advertising)
```

```
## Rows: 200
## Columns: 5
## $ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ TV        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.…
## $ Radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5.…
## $ Newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, 2…
## $ Sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6…
```

Queremos um modelo preditivo para prever o valor de `Sales` utilizando a informação contida nas variáveis  `TV`, `Radio` e `Newspaper`

```r
# X é apenas um Index, removemos ele do dataset
advertising <- advertising %>% select(-X)
```

---

## Regressão Linear no R

Queremos um modelo da forma:

`$$Sales = \beta_0 + \beta_1 \rm{TV} + \beta_2 \rm{Radio} + \beta_3 \rm{Newspaper} + u$$`
--

.panelset[

.panel[.panel-name[Splitting data]

```r
library(rsample)
*set.seed(1234)
data_split <- initial_split(advertising, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
```

]

.panel[.panel-name[Forma fácil]

```r
# Treinamos o modelo SEMPRE utilizando os train_data
modelo <- lm(Sales~TV + Radio + Newspaper, data = train_data)
yhat_test <- predict(modelo, newdata = test_data)
test_data$yhat <- yhat_test
```

]

.panel[.panel-name[Tidymodels]

```r
library(tidymodels)
model_spec <- linear_reg() %>% 
              set_engine("lm")
model_fit <- model_spec %>% 
             fit(Sales ~ TV + Radio + Newspaper, data = train_data)
yhat_test_tm <- predict(model_fit, new_data = test_data)
test_data$yhat_tm <- yhat_test_tm$.pred
```

]

.panel[.panel-name[Avaliando o modelo]

O pacote [yardstick](https://yardstick.tidymodels.org) que está embutido no `todymodels` nos ajudará para avaliar o modelo

```r
# Podemos usar yhat ou yhat_tm os resultados são os mesmos.
test_data %>% mae(Sales, yhat) 
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard        1.28
```

```r
test_data %>% rmse(Sales, yhat)
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.53
```

]]

---

## Regressão Linear no R

#### .red[Hands-on:]

Faça as mudanças necessárias no código para fazer as seguintes regressões:

1. Sales ~ TV
2. Sales ~ TV + Radio
3. Sales ~ TV + Radio + Newspaper
4. Sales ~ TV + TV `$^2$` + Radio + Newspaper

---

## Hands-on: Solução

```r
# Split
set.seed(1234)
data_split <- initial_split(advertising, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
# Fit
model_spec <- linear_reg() %>% set_engine("lm")
model_01_fit <- model_spec %>% fit(Sales ~ TV, data = train_data) 
model_02_fit <- model_spec %>% fit(Sales ~ TV + Radio, data = train_data) 
model_03_fit <- model_spec %>% fit(Sales ~ TV + Radio + Newspaper, data = train_data) 
model_04_fit <- model_spec %>% fit(Sales ~ TV + I(TV^2) + Radio +  Newspaper, data = train_data) 
# Predict
yhat_test_01 <- predict(model_01_fit, new_data = test_data) 
yhat_test_02 <- predict(model_02_fit, new_data = test_data) 
yhat_test_03 <- predict(model_03_fit, new_data = test_data) 
yhat_test_04 <- predict(model_04_fit, new_data = test_data)

test_data <- test_data %>% mutate(yhat_01 = yhat_test_01$.pred,
                                  yhat_02 = yhat_test_02$.pred,
                                  yhat_03 = yhat_test_03$.pred,
                                  yhat_04 = yhat_test_04$.pred)
```

---

## Hands-on: Solução

```r
# Evaluate
test_data %>% mae(Sales,yhat_01)
test_data %>% mae(Sales,yhat_02)
test_data %>% mae(Sales,yhat_03)
test_data %>% mae(Sales,yhat_04)

test_data %>% rmse(Sales,yhat_01)
test_data %>% rmse(Sales,yhat_02)
test_data %>% rmse(Sales,yhat_03)
test_data %>% rmse(Sales,yhat_04)
```

---

## Data-Tips:

.left-column[ 
![](https://octodex.github.com/images/minertocat.png)
]

.right-column[

- O MRL pode ser muito mais explorado através da inferência estatística (não apenas predição) - **ACA228**

- O MRL (como todos os outros modelos) é baseado em um conjunto de hipóteses. A não verificação dessas hipóteses tem um grande impacto na performance do modelo.

- Aqui temos utilizado MAE e RMSE para avaliar a predição, mas existem outras medidas que podem ser utilizadas.

.red[$$MAE =  \dfrac{\sum_{i=1}^n |y_i - \hat{y}_i| }{n}$$]

.red[$$RMSE = \sqrt{\dfrac{\sum_{i=1}^n (y_i - \hat{y}_i)^2 }{n}}$$]

]

---

## Referências:

- [James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York: Springer.](https://www.statlearning.com) Chapter 3

### .blue[Quer saber mais um pouco do MRL?]

- Trucíos (2021, Feb. 25). Carlos Trucíos: Intro à Regressão Linear. Retrieved from [https://ctruciosm.github.io/posts/2021-02-25-intro-regressao-linear/](https://ctruciosm.github.io/posts/2021-02-25-intro-regressao-linear/)
- Trucíos (2021, April 1). Carlos Trucíos: Mínimos Quadrados Ordinários. Retrieved from [https://ctruciosm.github.io/posts/2021-04-01-minimos-quadrados-ordinarios/](https://ctruciosm.github.io/posts/2021-04-01-minimos-quadrados-ordinarios/)