class: center, middle, inverse, title-slide #
Crash Course de R
## Medidas Resumo e Manipulação de dados ### Prof. Carlos Trucíos
ctruciosm.github.io
carlos.trucios@facc.ufrj.br
### Faculdade de Administração e Ciências Contábeis, Universidade Federal de Rio de Janeiro --- layout: true <a class="footer-link" href="http://ctruciosm.github.io">ctruciosm.github.io — Carlos Trucíos (FACC/UFRJ)</a>
--- class: inverse, right, middle # Medidas resumo --- # Medidas resumo ```r max(variavel) # maximo min(variavel) # mínimo mean(variavel) # média median(variavel) # mediana quantile(variavel, prob = k/100) # k-ésimo percentil IQR(variavel) # amplitude inter-quartil var(variavel) # variância amostral sd(variavel) # desvio padrão amostral cov(variavel_1, variavel_2) # covariância cor(variavel_1, variavel_2) # correlação de Pearson summary(dataset_ou_variavel) # Algumas estatística resumo table(variavel_categorica) # Frequências absolutas prop.table(table(var_categorica)) # Frequências relativas ``` -- ```r boxplot(variavel) # Boxplot hist(variavel) # Histograma barplot(table(variavel)) # Gráfico de barras plot(variavel_1, variavel_2) # Gráfico de dispersão ``` --- # Medidas resumo ### Hands-on: .pull-left[ ![madagascar](https://media3.giphy.com/media/Ch31IjylFWM8M/giphy.gif) ] .pull-right[ Utilizaremos o _dataset_ `penguins` do pacote `palmerpenguins` para fazer uma análise explotarória de dados. ] --- class: inverse, right, middle # Manipulação de dados --- ## Manipulação de dados .center[ <img src="imagens/dplyr.png" width="15%" /> [O pacote dplyr](https://dplyr.tidyverse.org/index.html) ] #### Comando básicos: - `%>%`: Pipe - `mutate()`: cria novas variáveis. - `select()`: seleciona um conjunto de variaveis. - `filter()`: filtra casos. - `arrange()`: ordena os dados. - `glimpse()`: parecido com `head()` --- ## Manipulação de dados ### Importando os dados ```r uri <-"https://raw.githubusercontent.com/ctruciosm/ISLR/master/dataset/Advertising.csv" dados_advertising <- read.csv(uri) ``` -- ### Carregando o pacote + glimpse() ```r library(dplyr) glimpse(dados_advertising) ``` ``` ## Rows: 200 ## Columns: 5 ## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ TV <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.… ## $ Radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5.… ## $ Newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, 2… ## $ Sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6… ``` --- ## Manipulação de dados Suponha que estamos interessados apenas em `TV` e `Sales` ```r dados_advertising %>% * select(TV,Sales) ``` ``` ## TV Sales ## 1 230.1 22.1 ## 2 44.5 10.4 ## 3 17.2 9.3 ## 4 151.5 18.5 ## 5 180.8 12.9 ## 6 8.7 7.2 ## 7 57.5 11.8 ## 8 120.2 13.2 ## 9 8.6 4.8 ## 10 199.8 10.6 ## 11 66.1 8.6 ## 12 214.7 17.4 ## 13 23.8 9.2 ## 14 97.5 9.7 ## 15 204.1 19.0 ## 16 195.4 22.4 ## 17 67.8 12.5 ## 18 281.4 24.4 ## 19 69.2 11.3 ## 20 147.3 14.6 ## 21 218.4 18.0 ## 22 237.4 12.5 ## 23 13.2 5.6 ## 24 228.3 15.5 ## 25 62.3 9.7 ## 26 262.9 12.0 ## 27 142.9 15.0 ## 28 240.1 15.9 ## 29 248.8 18.9 ## 30 70.6 10.5 ## 31 292.9 21.4 ## 32 112.9 11.9 ## 33 97.2 9.6 ## 34 265.6 17.4 ## 35 95.7 9.5 ## 36 290.7 12.8 ## 37 266.9 25.4 ## 38 74.7 14.7 ## 39 43.1 10.1 ## 40 228.0 21.5 ## 41 202.5 16.6 ## 42 177.0 17.1 ## 43 293.6 20.7 ## 44 206.9 12.9 ## 45 25.1 8.5 ## 46 175.1 14.9 ## 47 89.7 10.6 ## 48 239.9 23.2 ## 49 227.2 14.8 ## 50 66.9 9.7 ## 51 199.8 11.4 ## 52 100.4 10.7 ## 53 216.4 22.6 ## 54 182.6 21.2 ## 55 262.7 20.2 ## 56 198.9 23.7 ## 57 7.3 5.5 ## 58 136.2 13.2 ## 59 210.8 23.8 ## 60 210.7 18.4 ## 61 53.5 8.1 ## 62 261.3 24.2 ## 63 239.3 15.7 ## 64 102.7 14.0 ## 65 131.1 18.0 ## 66 69.0 9.3 ## 67 31.5 9.5 ## 68 139.3 13.4 ## 69 237.4 18.9 ## 70 216.8 22.3 ## 71 199.1 18.3 ## 72 109.8 12.4 ## 73 26.8 8.8 ## 74 129.4 11.0 ## 75 213.4 17.0 ## 76 16.9 8.7 ## 77 27.5 6.9 ## 78 120.5 14.2 ## 79 5.4 5.3 ## 80 116.0 11.0 ## 81 76.4 11.8 ## 82 239.8 12.3 ## 83 75.3 11.3 ## 84 68.4 13.6 ## 85 213.5 21.7 ## 86 193.2 15.2 ## 87 76.3 12.0 ## 88 110.7 16.0 ## 89 88.3 12.9 ## 90 109.8 16.7 ## 91 134.3 11.2 ## 92 28.6 7.3 ## 93 217.7 19.4 ## 94 250.9 22.2 ## 95 107.4 11.5 ## 96 163.3 16.9 ## 97 197.6 11.7 ## 98 184.9 15.5 ## 99 289.7 25.4 ## 100 135.2 17.2 ## 101 222.4 11.7 ## 102 296.4 23.8 ## 103 280.2 14.8 ## 104 187.9 14.7 ## 105 238.2 20.7 ## 106 137.9 19.2 ## 107 25.0 7.2 ## 108 90.4 8.7 ## 109 13.1 5.3 ## 110 255.4 19.8 ## 111 225.8 13.4 ## 112 241.7 21.8 ## 113 175.7 14.1 ## 114 209.6 15.9 ## 115 78.2 14.6 ## 116 75.1 12.6 ## 117 139.2 12.2 ## 118 76.4 9.4 ## 119 125.7 15.9 ## 120 19.4 6.6 ## 121 141.3 15.5 ## 122 18.8 7.0 ## 123 224.0 11.6 ## 124 123.1 15.2 ## 125 229.5 19.7 ## 126 87.2 10.6 ## 127 7.8 6.6 ## 128 80.2 8.8 ## 129 220.3 24.7 ## 130 59.6 9.7 ## 131 0.7 1.6 ## 132 265.2 12.7 ## 133 8.4 5.7 ## 134 219.8 19.6 ## 135 36.9 10.8 ## 136 48.3 11.6 ## 137 25.6 9.5 ## 138 273.7 20.8 ## 139 43.0 9.6 ## 140 184.9 20.7 ## 141 73.4 10.9 ## 142 193.7 19.2 ## 143 220.5 20.1 ## 144 104.6 10.4 ## 145 96.2 11.4 ## 146 140.3 10.3 ## 147 240.1 13.2 ## 148 243.2 25.4 ## 149 38.0 10.9 ## 150 44.7 10.1 ## 151 280.7 16.1 ## 152 121.0 11.6 ## 153 197.6 16.6 ## 154 171.3 19.0 ## 155 187.8 15.6 ## 156 4.1 3.2 ## 157 93.9 15.3 ## 158 149.8 10.1 ## 159 11.7 7.3 ## 160 131.7 12.9 ## 161 172.5 14.4 ## 162 85.7 13.3 ## 163 188.4 14.9 ## 164 163.5 18.0 ## 165 117.2 11.9 ## 166 234.5 11.9 ## 167 17.9 8.0 ## 168 206.8 12.2 ## 169 215.4 17.1 ## 170 284.3 15.0 ## 171 50.0 8.4 ## 172 164.5 14.5 ## 173 19.6 7.6 ## 174 168.4 11.7 ## 175 222.4 11.5 ## 176 276.9 27.0 ## 177 248.4 20.2 ## 178 170.2 11.7 ## 179 276.7 11.8 ## 180 165.6 12.6 ## 181 156.6 10.5 ## 182 218.5 12.2 ## 183 56.2 8.7 ## 184 287.6 26.2 ## 185 253.8 17.6 ## 186 205.0 22.6 ## 187 139.5 10.3 ## 188 191.1 17.3 ## 189 286.0 15.9 ## 190 18.7 6.7 ## 191 39.5 10.8 ## 192 75.5 9.9 ## 193 17.2 5.9 ## 194 166.8 19.6 ## 195 149.7 17.3 ## 196 38.2 7.6 ## 197 94.2 9.7 ## 198 177.0 12.8 ## 199 283.6 25.5 ## 200 232.1 13.4 ``` --- ## Manipulação de dados Suponha que estamos interessados em `TV`, `Sales` e `Sales`$^2$ ```r dados_advertising %>% select(TV,Sales) %>% * mutate(Sales2 = Sales^2) ``` ``` ## TV Sales Sales2 ## 1 230.1 22.1 488.41 ## 2 44.5 10.4 108.16 ## 3 17.2 9.3 86.49 ## 4 151.5 18.5 342.25 ## 5 180.8 12.9 166.41 ## 6 8.7 7.2 51.84 ## 7 57.5 11.8 139.24 ## 8 120.2 13.2 174.24 ## 9 8.6 4.8 23.04 ## 10 199.8 10.6 112.36 ## 11 66.1 8.6 73.96 ## 12 214.7 17.4 302.76 ## 13 23.8 9.2 84.64 ## 14 97.5 9.7 94.09 ## 15 204.1 19.0 361.00 ## 16 195.4 22.4 501.76 ## 17 67.8 12.5 156.25 ## 18 281.4 24.4 595.36 ## 19 69.2 11.3 127.69 ## 20 147.3 14.6 213.16 ## 21 218.4 18.0 324.00 ## 22 237.4 12.5 156.25 ## 23 13.2 5.6 31.36 ## 24 228.3 15.5 240.25 ## 25 62.3 9.7 94.09 ## 26 262.9 12.0 144.00 ## 27 142.9 15.0 225.00 ## 28 240.1 15.9 252.81 ## 29 248.8 18.9 357.21 ## 30 70.6 10.5 110.25 ## 31 292.9 21.4 457.96 ## 32 112.9 11.9 141.61 ## 33 97.2 9.6 92.16 ## 34 265.6 17.4 302.76 ## 35 95.7 9.5 90.25 ## 36 290.7 12.8 163.84 ## 37 266.9 25.4 645.16 ## 38 74.7 14.7 216.09 ## 39 43.1 10.1 102.01 ## 40 228.0 21.5 462.25 ## 41 202.5 16.6 275.56 ## 42 177.0 17.1 292.41 ## 43 293.6 20.7 428.49 ## 44 206.9 12.9 166.41 ## 45 25.1 8.5 72.25 ## 46 175.1 14.9 222.01 ## 47 89.7 10.6 112.36 ## 48 239.9 23.2 538.24 ## 49 227.2 14.8 219.04 ## 50 66.9 9.7 94.09 ## 51 199.8 11.4 129.96 ## 52 100.4 10.7 114.49 ## 53 216.4 22.6 510.76 ## 54 182.6 21.2 449.44 ## 55 262.7 20.2 408.04 ## 56 198.9 23.7 561.69 ## 57 7.3 5.5 30.25 ## 58 136.2 13.2 174.24 ## 59 210.8 23.8 566.44 ## 60 210.7 18.4 338.56 ## 61 53.5 8.1 65.61 ## 62 261.3 24.2 585.64 ## 63 239.3 15.7 246.49 ## 64 102.7 14.0 196.00 ## 65 131.1 18.0 324.00 ## 66 69.0 9.3 86.49 ## 67 31.5 9.5 90.25 ## 68 139.3 13.4 179.56 ## 69 237.4 18.9 357.21 ## 70 216.8 22.3 497.29 ## 71 199.1 18.3 334.89 ## 72 109.8 12.4 153.76 ## 73 26.8 8.8 77.44 ## 74 129.4 11.0 121.00 ## 75 213.4 17.0 289.00 ## 76 16.9 8.7 75.69 ## 77 27.5 6.9 47.61 ## 78 120.5 14.2 201.64 ## 79 5.4 5.3 28.09 ## 80 116.0 11.0 121.00 ## 81 76.4 11.8 139.24 ## 82 239.8 12.3 151.29 ## 83 75.3 11.3 127.69 ## 84 68.4 13.6 184.96 ## 85 213.5 21.7 470.89 ## 86 193.2 15.2 231.04 ## 87 76.3 12.0 144.00 ## 88 110.7 16.0 256.00 ## 89 88.3 12.9 166.41 ## 90 109.8 16.7 278.89 ## 91 134.3 11.2 125.44 ## 92 28.6 7.3 53.29 ## 93 217.7 19.4 376.36 ## 94 250.9 22.2 492.84 ## 95 107.4 11.5 132.25 ## 96 163.3 16.9 285.61 ## 97 197.6 11.7 136.89 ## 98 184.9 15.5 240.25 ## 99 289.7 25.4 645.16 ## 100 135.2 17.2 295.84 ## 101 222.4 11.7 136.89 ## 102 296.4 23.8 566.44 ## 103 280.2 14.8 219.04 ## 104 187.9 14.7 216.09 ## 105 238.2 20.7 428.49 ## 106 137.9 19.2 368.64 ## 107 25.0 7.2 51.84 ## 108 90.4 8.7 75.69 ## 109 13.1 5.3 28.09 ## 110 255.4 19.8 392.04 ## 111 225.8 13.4 179.56 ## 112 241.7 21.8 475.24 ## 113 175.7 14.1 198.81 ## 114 209.6 15.9 252.81 ## 115 78.2 14.6 213.16 ## 116 75.1 12.6 158.76 ## 117 139.2 12.2 148.84 ## 118 76.4 9.4 88.36 ## 119 125.7 15.9 252.81 ## 120 19.4 6.6 43.56 ## 121 141.3 15.5 240.25 ## 122 18.8 7.0 49.00 ## 123 224.0 11.6 134.56 ## 124 123.1 15.2 231.04 ## 125 229.5 19.7 388.09 ## 126 87.2 10.6 112.36 ## 127 7.8 6.6 43.56 ## 128 80.2 8.8 77.44 ## 129 220.3 24.7 610.09 ## 130 59.6 9.7 94.09 ## 131 0.7 1.6 2.56 ## 132 265.2 12.7 161.29 ## 133 8.4 5.7 32.49 ## 134 219.8 19.6 384.16 ## 135 36.9 10.8 116.64 ## 136 48.3 11.6 134.56 ## 137 25.6 9.5 90.25 ## 138 273.7 20.8 432.64 ## 139 43.0 9.6 92.16 ## 140 184.9 20.7 428.49 ## 141 73.4 10.9 118.81 ## 142 193.7 19.2 368.64 ## 143 220.5 20.1 404.01 ## 144 104.6 10.4 108.16 ## 145 96.2 11.4 129.96 ## 146 140.3 10.3 106.09 ## 147 240.1 13.2 174.24 ## 148 243.2 25.4 645.16 ## 149 38.0 10.9 118.81 ## 150 44.7 10.1 102.01 ## 151 280.7 16.1 259.21 ## 152 121.0 11.6 134.56 ## 153 197.6 16.6 275.56 ## 154 171.3 19.0 361.00 ## 155 187.8 15.6 243.36 ## 156 4.1 3.2 10.24 ## 157 93.9 15.3 234.09 ## 158 149.8 10.1 102.01 ## 159 11.7 7.3 53.29 ## 160 131.7 12.9 166.41 ## 161 172.5 14.4 207.36 ## 162 85.7 13.3 176.89 ## 163 188.4 14.9 222.01 ## 164 163.5 18.0 324.00 ## 165 117.2 11.9 141.61 ## 166 234.5 11.9 141.61 ## 167 17.9 8.0 64.00 ## 168 206.8 12.2 148.84 ## 169 215.4 17.1 292.41 ## 170 284.3 15.0 225.00 ## 171 50.0 8.4 70.56 ## 172 164.5 14.5 210.25 ## 173 19.6 7.6 57.76 ## 174 168.4 11.7 136.89 ## 175 222.4 11.5 132.25 ## 176 276.9 27.0 729.00 ## 177 248.4 20.2 408.04 ## 178 170.2 11.7 136.89 ## 179 276.7 11.8 139.24 ## 180 165.6 12.6 158.76 ## 181 156.6 10.5 110.25 ## 182 218.5 12.2 148.84 ## 183 56.2 8.7 75.69 ## 184 287.6 26.2 686.44 ## 185 253.8 17.6 309.76 ## 186 205.0 22.6 510.76 ## 187 139.5 10.3 106.09 ## 188 191.1 17.3 299.29 ## 189 286.0 15.9 252.81 ## 190 18.7 6.7 44.89 ## 191 39.5 10.8 116.64 ## 192 75.5 9.9 98.01 ## 193 17.2 5.9 34.81 ## 194 166.8 19.6 384.16 ## 195 149.7 17.3 299.29 ## 196 38.2 7.6 57.76 ## 197 94.2 9.7 94.09 ## 198 177.0 12.8 163.84 ## 199 283.6 25.5 650.25 ## 200 232.1 13.4 179.56 ``` --- ## Manipulação de dados E se quisermos `TV`, `Sales` e `Sales2` .red[para os valores nos quais as vendas foram `\(>15\)`?] ```r dados_advertising %>% select(TV,Sales) %>% mutate(Sales2 = Sales^2) %>% * filter(Sales>15) ``` ``` ## TV Sales Sales2 ## 1 230.1 22.1 488.41 ## 2 151.5 18.5 342.25 ## 3 214.7 17.4 302.76 ## 4 204.1 19.0 361.00 ## 5 195.4 22.4 501.76 ## 6 281.4 24.4 595.36 ## 7 218.4 18.0 324.00 ## 8 228.3 15.5 240.25 ## 9 240.1 15.9 252.81 ## 10 248.8 18.9 357.21 ## 11 292.9 21.4 457.96 ## 12 265.6 17.4 302.76 ## 13 266.9 25.4 645.16 ## 14 228.0 21.5 462.25 ## 15 202.5 16.6 275.56 ## 16 177.0 17.1 292.41 ## 17 293.6 20.7 428.49 ## 18 239.9 23.2 538.24 ## 19 216.4 22.6 510.76 ## 20 182.6 21.2 449.44 ## 21 262.7 20.2 408.04 ## 22 198.9 23.7 561.69 ## 23 210.8 23.8 566.44 ## 24 210.7 18.4 338.56 ## 25 261.3 24.2 585.64 ## 26 239.3 15.7 246.49 ## 27 131.1 18.0 324.00 ## 28 237.4 18.9 357.21 ## 29 216.8 22.3 497.29 ## 30 199.1 18.3 334.89 ## 31 213.4 17.0 289.00 ## 32 213.5 21.7 470.89 ## 33 193.2 15.2 231.04 ## 34 110.7 16.0 256.00 ## 35 109.8 16.7 278.89 ## 36 217.7 19.4 376.36 ## 37 250.9 22.2 492.84 ## 38 163.3 16.9 285.61 ## 39 184.9 15.5 240.25 ## 40 289.7 25.4 645.16 ## 41 135.2 17.2 295.84 ## 42 296.4 23.8 566.44 ## 43 238.2 20.7 428.49 ## 44 137.9 19.2 368.64 ## 45 255.4 19.8 392.04 ## 46 241.7 21.8 475.24 ## 47 209.6 15.9 252.81 ## 48 125.7 15.9 252.81 ## 49 141.3 15.5 240.25 ## 50 123.1 15.2 231.04 ## 51 229.5 19.7 388.09 ## 52 220.3 24.7 610.09 ## 53 219.8 19.6 384.16 ## 54 273.7 20.8 432.64 ## 55 184.9 20.7 428.49 ## 56 193.7 19.2 368.64 ## 57 220.5 20.1 404.01 ## 58 243.2 25.4 645.16 ## 59 280.7 16.1 259.21 ## 60 197.6 16.6 275.56 ## 61 171.3 19.0 361.00 ## 62 187.8 15.6 243.36 ## 63 93.9 15.3 234.09 ## 64 163.5 18.0 324.00 ## 65 215.4 17.1 292.41 ## 66 276.9 27.0 729.00 ## 67 248.4 20.2 408.04 ## 68 287.6 26.2 686.44 ## 69 253.8 17.6 309.76 ## 70 205.0 22.6 510.76 ## 71 191.1 17.3 299.29 ## 72 286.0 15.9 252.81 ## 73 166.8 19.6 384.16 ## 74 149.7 17.3 299.29 ## 75 283.6 25.5 650.25 ``` --- ## Manipulação de dados E se quisermos os valores ordenados (de menor a maior) por `Sales`? ```r dados_advertising %>% select(TV,Sales) %>% mutate(Sales2 = Sales^2) %>% filter(Sales>15) %>% * arrange(Sales) ``` ``` ## TV Sales Sales2 ## 1 193.2 15.2 231.04 ## 2 123.1 15.2 231.04 ## 3 93.9 15.3 234.09 ## 4 228.3 15.5 240.25 ## 5 184.9 15.5 240.25 ## 6 141.3 15.5 240.25 ## 7 187.8 15.6 243.36 ## 8 239.3 15.7 246.49 ## 9 240.1 15.9 252.81 ## 10 209.6 15.9 252.81 ## 11 125.7 15.9 252.81 ## 12 286.0 15.9 252.81 ## 13 110.7 16.0 256.00 ## 14 280.7 16.1 259.21 ## 15 202.5 16.6 275.56 ## 16 197.6 16.6 275.56 ## 17 109.8 16.7 278.89 ## 18 163.3 16.9 285.61 ## 19 213.4 17.0 289.00 ## 20 177.0 17.1 292.41 ## 21 215.4 17.1 292.41 ## 22 135.2 17.2 295.84 ## 23 191.1 17.3 299.29 ## 24 149.7 17.3 299.29 ## 25 214.7 17.4 302.76 ## 26 265.6 17.4 302.76 ## 27 253.8 17.6 309.76 ## 28 218.4 18.0 324.00 ## 29 131.1 18.0 324.00 ## 30 163.5 18.0 324.00 ## 31 199.1 18.3 334.89 ## 32 210.7 18.4 338.56 ## 33 151.5 18.5 342.25 ## 34 248.8 18.9 357.21 ## 35 237.4 18.9 357.21 ## 36 204.1 19.0 361.00 ## 37 171.3 19.0 361.00 ## 38 137.9 19.2 368.64 ## 39 193.7 19.2 368.64 ## 40 217.7 19.4 376.36 ## 41 219.8 19.6 384.16 ## 42 166.8 19.6 384.16 ## 43 229.5 19.7 388.09 ## 44 255.4 19.8 392.04 ## 45 220.5 20.1 404.01 ## 46 262.7 20.2 408.04 ## 47 248.4 20.2 408.04 ## 48 293.6 20.7 428.49 ## 49 238.2 20.7 428.49 ## 50 184.9 20.7 428.49 ## 51 273.7 20.8 432.64 ## 52 182.6 21.2 449.44 ## 53 292.9 21.4 457.96 ## 54 228.0 21.5 462.25 ## 55 213.5 21.7 470.89 ## 56 241.7 21.8 475.24 ## 57 230.1 22.1 488.41 ## 58 250.9 22.2 492.84 ## 59 216.8 22.3 497.29 ## 60 195.4 22.4 501.76 ## 61 216.4 22.6 510.76 ## 62 205.0 22.6 510.76 ## 63 239.9 23.2 538.24 ## 64 198.9 23.7 561.69 ## 65 210.8 23.8 566.44 ## 66 296.4 23.8 566.44 ## 67 261.3 24.2 585.64 ## 68 281.4 24.4 595.36 ## 69 220.3 24.7 610.09 ## 70 266.9 25.4 645.16 ## 71 289.7 25.4 645.16 ## 72 243.2 25.4 645.16 ## 73 283.6 25.5 650.25 ## 74 287.6 26.2 686.44 ## 75 276.9 27.0 729.00 ``` --- ## Manipulação de dados E se quisermos ordenados de maior a a menor? ```r dados_advertising %>% select(TV,Sales) %>% mutate(Sales2 = Sales^2) %>% filter(Sales>15) %>% * arrange(desc(Sales)) ``` ``` ## TV Sales Sales2 ## 1 276.9 27.0 729.00 ## 2 287.6 26.2 686.44 ## 3 283.6 25.5 650.25 ## 4 266.9 25.4 645.16 ## 5 289.7 25.4 645.16 ## 6 243.2 25.4 645.16 ## 7 220.3 24.7 610.09 ## 8 281.4 24.4 595.36 ## 9 261.3 24.2 585.64 ## 10 210.8 23.8 566.44 ## 11 296.4 23.8 566.44 ## 12 198.9 23.7 561.69 ## 13 239.9 23.2 538.24 ## 14 216.4 22.6 510.76 ## 15 205.0 22.6 510.76 ## 16 195.4 22.4 501.76 ## 17 216.8 22.3 497.29 ## 18 250.9 22.2 492.84 ## 19 230.1 22.1 488.41 ## 20 241.7 21.8 475.24 ## 21 213.5 21.7 470.89 ## 22 228.0 21.5 462.25 ## 23 292.9 21.4 457.96 ## 24 182.6 21.2 449.44 ## 25 273.7 20.8 432.64 ## 26 293.6 20.7 428.49 ## 27 238.2 20.7 428.49 ## 28 184.9 20.7 428.49 ## 29 262.7 20.2 408.04 ## 30 248.4 20.2 408.04 ## 31 220.5 20.1 404.01 ## 32 255.4 19.8 392.04 ## 33 229.5 19.7 388.09 ## 34 219.8 19.6 384.16 ## 35 166.8 19.6 384.16 ## 36 217.7 19.4 376.36 ## 37 137.9 19.2 368.64 ## 38 193.7 19.2 368.64 ## 39 204.1 19.0 361.00 ## 40 171.3 19.0 361.00 ## 41 248.8 18.9 357.21 ## 42 237.4 18.9 357.21 ## 43 151.5 18.5 342.25 ## 44 210.7 18.4 338.56 ## 45 199.1 18.3 334.89 ## 46 218.4 18.0 324.00 ## 47 131.1 18.0 324.00 ## 48 163.5 18.0 324.00 ## 49 253.8 17.6 309.76 ## 50 214.7 17.4 302.76 ## 51 265.6 17.4 302.76 ## 52 191.1 17.3 299.29 ## 53 149.7 17.3 299.29 ## 54 135.2 17.2 295.84 ## 55 177.0 17.1 292.41 ## 56 215.4 17.1 292.41 ## 57 213.4 17.0 289.00 ## 58 163.3 16.9 285.61 ## 59 109.8 16.7 278.89 ## 60 202.5 16.6 275.56 ## 61 197.6 16.6 275.56 ## 62 280.7 16.1 259.21 ## 63 110.7 16.0 256.00 ## 64 240.1 15.9 252.81 ## 65 209.6 15.9 252.81 ## 66 125.7 15.9 252.81 ## 67 286.0 15.9 252.81 ## 68 239.3 15.7 246.49 ## 69 187.8 15.6 243.36 ## 70 228.3 15.5 240.25 ## 71 184.9 15.5 240.25 ## 72 141.3 15.5 240.25 ## 73 93.9 15.3 234.09 ## 74 193.2 15.2 231.04 ## 75 123.1 15.2 231.04 ``` --- ## Manipulação de dados - O `%>%` nos ajuda a fazer nosso código mais fácil de entender. -- - Se formos quebrar o código em váriaslinhas, o `%>%` deve ir **sempre** no final da linha. -- #### Podemos também calcular algumas estatísticas ```r dados_advertising %>% select(TV,Sales) %>% mutate(Sales2 = Sales^2) %>% filter(Sales>15) %>% arrange(desc(Sales)) %>% * summarise(media_TV = mean(TV), media_Sales = mean(Sales)) ``` ``` ## media_TV media_Sales ## 1 213.9013 19.61067 ``` --- ## Manipulação de dados #### Hands-on Utilize o dataset `dados_advertising`, filtre os dados para considerarmos unicamente os casos em que `Sales <= median(Sales)`, selecione apenas as variáveis `TV`, `Radio` e `Newspaper`, calcule a media e desvio padrão desses dados. -- #### Gabarito ```r dados_advertising %>% filter(Sales<= median(Sales)) %>% select(TV, Radio, Newspaper) %>% summarise(media_tv = mean(TV), sd_tv = sd(TV), media_radio = mean(Radio), sd_radio = sd(Radio), media_paper = mean(Newspaper), sd_paper = sd(Newspaper)) ``` ``` ## media_tv sd_tv media_radio sd_radio media_paper sd_paper ## 1 94.07843 75.12462 15.53627 13.58036 27.88529 20.76878 ``` -- .center[**Cuidado com a ordem dos comandos!**] --- ## Manipulação de dados #### Gabarito ```r # Este código da erro! Por quê? dados_advertising %>% select(TV, Radio, Newspaper) %>% filter(Sales<= median(Sales)) %>% summarise(media_tv = mean(TV), sd_tv = sd(TV), media_radio = mean(Radio), sd_radio = sd(Radio), media_paper = mean(Newspaper), sd_paper = sd(Newspaper)) ``` .center[** Por quê?**] --- ## Manipulação de dados #### Mais comandos: - summarise() - top_n() - group_by() - contains() - rename() --- ## Manipulação de dados Podemos calcular estatísticas por grupos. ```r dados_advertising %>% * select(TV, Sales, Radio, Newspaper) %>% * group_by(Sales > median(Sales)) %>% summarise(media_TV = mean(TV), * media_Radio = mean(Radio), * mean_Newspaper = mean(Newspaper)) ``` ``` ## # A tibble: 2 × 4 ## `Sales > median(Sales)` media_TV media_Radio mean_Newspaper ## <lgl> <dbl> <dbl> <dbl> ## 1 FALSE 94.1 15.5 27.9 ## 2 TRUE 202. 31.3 33.3 ``` --- ## Manipulação de dados Se quisermos as 5 lojas com mais vendas? ```r dados_advertising %>% select(TV, Sales, Radio, Newspaper) %>% * top_n(5,Sales) ``` ``` ## TV Sales Radio Newspaper ## 1 266.9 25.4 43.8 5.0 ## 2 289.7 25.4 42.3 51.2 ## 3 243.2 25.4 49.0 44.3 ## 4 276.9 27.0 48.9 41.8 ## 5 287.6 26.2 43.0 71.8 ## 6 283.6 25.5 42.0 66.2 ``` --- ## Manipulação de dados Se quisermos as 5 lojas com menos vendas? ```r dados_advertising %>% select(TV, Sales, Radio, Newspaper) %>% * top_n(-5,Sales) ``` ``` ## TV Sales Radio Newspaper ## 1 8.6 4.8 2.1 1.0 ## 2 5.4 5.3 29.9 9.4 ## 3 13.1 5.3 0.4 25.6 ## 4 0.7 1.6 39.6 8.7 ## 5 4.1 3.2 11.6 5.7 ``` --- ## Manipulação de dados Podemos selecionar variáveis por alguma caracteristica em especial: ```r dados_advertising %>% select(contains("a")) %>% head() ``` ``` ## Radio Newspaper Sales ## 1 37.8 69.2 22.1 ## 2 39.3 45.1 10.4 ## 3 45.9 69.3 9.3 ## 4 41.3 58.5 18.5 ## 5 10.8 58.4 12.9 ## 6 48.9 75.0 7.2 ``` --- ## Manipulação de dados Podemos renomear as variáveis ```r dados_advertising %>% * rename(radio_gastos = Radio, * newspaper_gastos = Newspaper, * tv_gastos = TV) %>% head() ``` ``` ## X tv_gastos radio_gastos newspaper_gastos Sales ## 1 1 230.1 37.8 69.2 22.1 ## 2 2 44.5 39.3 45.1 10.4 ## 3 3 17.2 45.9 69.3 9.3 ## 4 4 151.5 41.3 58.5 18.5 ## 5 5 180.8 10.8 58.4 12.9 ## 6 6 8.7 48.9 75.0 7.2 ``` --- ## Manipulação de dados #### Hands-on Utilize o _dataset_ `dados_advertising` e: 1. Apage a coluna `\(X\)` (Dica: `select(-X)`) 2. Calcule a média (TV, Radio e Newspaper) e o número de elementos por grupo (`Sales > mean(Sales)`). Dica: para calcular o número de elementos use `n()`. -- #### Gabarito ```r dados_advertising %>% select(-X) %>% group_by(Sales > mean(Sales)) %>% summarise(m_tv = mean(TV), m_radio = mean(Radio), m_npaper = mean(Newspaper), Total = n()) ```