8  Tidy Data Analysis

In this section we introduce the tidy-way of data analysis. This is different from the base-R way of subsetting and manipulating data, which haven’t covered thoroughly anyway. Because much of the data analysis will be performed following the tidyver-approach, we briefly introduced the base-R way of working with data frames.

8.1 What’s in tidyverse

tidyverse is a collection of R packages for data science. You can install tidyverse with

install.packages("tidyverse")

or use the Rstudio visual interface.

Once installed, you can load it like any other R packages

library(tidyverse)

The list of packages in this collection along with that they do is summarized below

List of Packages in tidyverse
Package What it does
dplyr the flagship package which provides grammar of data manipulation with a consistet set of verbs
ggplot2 for creating beautiful graphics based on Grammar of Graphics
tidyr for tyding your data set
readr reading rectangular data
forcats for working with factor data types
stringr makes working with strings/characters easier
tibble a modern data-frame evolved from the base data.frame
purr enhances functional programming in R

8.2 Data transformation

dplyr verbs

dplyr verbs for data wrangling
Verb What it does
filter() subsetting rows/observations in a data frame
select() selects subset of columns by their names
mutate() creates new columns/variables off of existing columns/variables
arrange() reorders rows by column/variable
summarize() creates summary statistics
group_by() splits data by by group for processing/summarizing/manipulating

8.3 Data

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
mpg %>% glimpse()
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Now, if you haven’t loaded tidyverse library on the console, or interactively on the Rmd file, then ? mpg will show an error > ?mpg No documentation for ‘mpg’ in specified packages and libraries: you could try ‘??mpg’

The error message suggests you to try ?? mpg instead. Note, the double question mark. This will search the entire R ecosystem currently available in your system including the libraries that are installed but not loaded.

When running ?? mpg on the console, we get this

Search mpg in all installed libraries

Now we see from the image above that the mpg is in the ggplot2 library. Please note how do I know that it is in the ggplot2 library. Becasue it is written as ggplot2::mpg this indicates that the mpg object is in the ggplot2 library.

You can call the mpg object if you either loaded tidyverse pckage or the ggplot2 packahge using either library(tidyverse) or library(ggolot2) on the notebook or on the console.

If you do not want to load the library, yet want to use the mpg data set, you have to call it by ggplot2::mpg. Likewise, if you run ?ggplot2::mpg, the data profile will be shown.

?ggplot2::mpg

When you execute the above, you will see under the Help tab, the details about the mpg data set. This is the same data set as the mtcars but with some enhancements.

8.4 Basic Usage of dplyr verbs

First, lets start with the mpg data set available in ggplot2 package.

mpg %>% glimpse()
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

There are 234 rows and 11 columns or variables in the data set. This also confirms with the output produced when running glimpse().

8.5 Selecting columns using select()

As the name suggests, we select or subset columns using select(). To select model, year and drive train from the mpg data set, use select(model, year, drv). No need to wrap the column names in quotations.

mpg %>% 
  select(model, year, drv) %>% 
  glimpse()
Rows: 234
Columns: 3
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quat…
$ year  <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008…
$ drv   <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4",…

Let me show you how you would do this using base R tools.

8.6 Filter rows with filter()

8.6.1 Filtering based on one condition

Let us filter all the rows of the data where the vehicle is a front-wheel drive. The variable is drv for drivetrain. First, let us see a simple table showing the values in the data and their frequencies

table(mpg$drv)

  4   f   r 
103 106  25 

Alternatively, we can alternatively run

mpg %>% 
  select(drv) %>% 
  table()
drv
  4   f   r 
103 106  25 

There are three types of vehicles, four-wheel drive (represented by drv = 4), front-wheel drive (drv = f) and rear-wheel drive (drv = r).

To select only the front-wheel drive vehicles, we use filter() function with the condition that the drv == 'f'. Notice that the variable name drv doesn’t need to be wrapped with quotation, but the actual value of the variable must be put within quotation. Either single or double quotation would work.

mpg %>% 
  filter(drv == 'f')
# A tibble: 106 × 11
   manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class
   <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
 1 audi         a4       1.8  1999     4 auto(l5)  f        18    29 p     comp…
 2 audi         a4       1.8  1999     4 manual(m… f        21    29 p     comp…
 3 audi         a4       2    2008     4 manual(m… f        20    31 p     comp…
 4 audi         a4       2    2008     4 auto(av)  f        21    30 p     comp…
 5 audi         a4       2.8  1999     6 auto(l5)  f        16    26 p     comp…
 6 audi         a4       2.8  1999     6 manual(m… f        18    26 p     comp…
 7 audi         a4       3.1  2008     6 auto(av)  f        18    27 p     comp…
 8 chevrolet    malibu   2.4  1999     4 auto(l4)  f        19    27 r     mids…
 9 chevrolet    malibu   2.4  2008     4 auto(l4)  f        22    30 r     mids…
10 chevrolet    malibu   3.1  1999     6 auto(l4)  f        18    26 r     mids…
# … with 96 more rows

8.6.2 Filtering based on multiple conditions

Suppose now that we want to filter vehicles that are front-wheel drive and has a 5 cylinders

table(mpg$cyl)

 4  5  6  8 
81  4 79 70 
mpg %>% 
  filter(drv == 'f', cyl == 5)
# A tibble: 4 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 volkswagen   jetta        2.5  2008     5 auto(… f        21    29 r     comp…
2 volkswagen   jetta        2.5  2008     5 manua… f        21    29 r     comp…
3 volkswagen   new beetle   2.5  2008     5 manua… f        20    28 r     subc…
4 volkswagen   new beetle   2.5  2008     5 auto(… f        20    29 r     subc…

How about the vehicles that are either front-wheel drive or with 5 cylinders?

mpg %>% 
  filter(drv == 'f' | cyl == 5)
# A tibble: 106 × 11
   manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class
   <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
 1 audi         a4       1.8  1999     4 auto(l5)  f        18    29 p     comp…
 2 audi         a4       1.8  1999     4 manual(m… f        21    29 p     comp…
 3 audi         a4       2    2008     4 manual(m… f        20    31 p     comp…
 4 audi         a4       2    2008     4 auto(av)  f        21    30 p     comp…
 5 audi         a4       2.8  1999     6 auto(l5)  f        16    26 p     comp…
 6 audi         a4       2.8  1999     6 manual(m… f        18    26 p     comp…
 7 audi         a4       3.1  2008     6 auto(av)  f        18    27 p     comp…
 8 chevrolet    malibu   2.4  1999     4 auto(l4)  f        19    27 r     mids…
 9 chevrolet    malibu   2.4  2008     4 auto(l4)  f        22    30 r     mids…
10 chevrolet    malibu   3.1  1999     6 auto(l4)  f        18    26 r     mids…
# … with 96 more rows

How about either front wheel drive and city fuel efficiency of 20 or 24 miles per gallon?

First try to filter vehicles whose city mileage is 20 or 24 mpg. Then we will combine this wit the condition that the vehicle is a front-wheel drive.

mpg %>% 
  filter(cty == c(24, 20)) %>% 
  select(cty) %>% 
  table()
cty
20 24 
 6  3 

To combine these two conditions, note that they are in an OR relationship, not an AND relationship. Either is the keyword here. In mathematics and statistics, it means either or. Which also means if one condition is satisifed, the other, or both are satisfied.

Lets implement it.

mpg %>% 
  filter(drv == 'f' | cty == c(24, 20))  # OR operator '|' is used
# A tibble: 110 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
 9 chevrolet    malibu       2.4  1999     4 auto… f        19    27 r     mids…
10 chevrolet    malibu       2.4  2008     4 auto… f        22    30 r     mids…
# … with 100 more rows

To verify that we’ve got the rows we wanted, run a frequency table

mpg %>% 
  filter(drv == 'f' | cty == c(24, 20)) %>% 
  select(drv) %>% 
  table() # got some 4-wheel and mostly front-wheel drive vehicles
drv
  4   f 
  4 106 
mpg %>% 
  filter(drv == 'f' | cty == c(24, 20)) %>% 
  select(cty) %>% 
  table() # got 20 mpg, 24 mpg as well as other!! 
cty
11 15 16 17 18 19 20 21 22 23 24 25 26 28 29 33 35 
 1  2 10 10 20 14  9 22  4  3  5  2  3  2  1  1  1 

Notice that, we got vehicles that are front-wheel drive and some that are all-wheel drive. Likewise, we got vehicles whose city mpg is not only 20 mpg or 24 mpg but also other. Why is that? Can you explain?

8.7 Other useful functions

8.7.1 Which function

mpg %>% 
  slice(
    which(cyl == 6)  
  )
# A tibble: 79 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 2 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 3 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 4 audi         a4 quattro   2.8  1999     6 auto… 4        15    25 p     comp…
 5 audi         a4 quattro   2.8  1999     6 manu… 4        17    25 p     comp…
 6 audi         a4 quattro   3.1  2008     6 auto… 4        17    25 p     comp…
 7 audi         a4 quattro   3.1  2008     6 manu… 4        15    25 p     comp…
 8 audi         a6 quattro   2.8  1999     6 auto… 4        15    24 p     mids…
 9 audi         a6 quattro   3.1  2008     6 auto… 4        17    25 p     mids…
10 chevrolet    malibu       3.1  1999     6 auto… f        18    26 r     mids…
# … with 69 more rows
which(mpg$cyl == 5)
[1] 218 219 226 227
mpg %>% slice_head() # selects the first row
# A tibble: 1 × 11
  manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class  
  <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>  
1 audi         a4      1.8  1999     4 auto(l5) f        18    29 p     compact
mpg %>% slice_tail() # selects the last row
# A tibble: 1 × 11
  manufacturer model  displ  year   cyl trans    drv     cty   hwy fl    class  
  <chr>        <chr>  <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>  
1 volkswagen   passat   3.6  2008     6 auto(s6) f        17    26 p     midsize
mpg %>% slice_min(order_by = cty)
# A tibble: 5 × 11
  manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
  <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 dodge        dakota pic…   4.7  2008     8 auto… 4         9    12 e     pick…
2 dodge        durango 4wd   4.7  2008     8 auto… 4         9    12 e     suv  
3 dodge        ram 1500 p…   4.7  2008     8 auto… 4         9    12 e     pick…
4 dodge        ram 1500 p…   4.7  2008     8 manu… 4         9    12 e     pick…
5 jeep         grand cher…   4.7  2008     8 auto… 4         9    12 e     suv  
mpg %>% slice_max(order_by = cty)
# A tibble: 1 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 volkswagen   new beetle   1.9  1999     4 manua… f        35    44 d     subc…
mpg %>% slice_sample(n = 10)
# A tibble: 10 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 2 toyota       land crui…   5.7  2008     8 auto… 4        13    18 r     suv  
 3 toyota       toyota ta…   4    2008     6 manu… 4        15    18 r     pick…
 4 volkswagen   gti          2    2008     4 manu… f        21    29 p     comp…
 5 dodge        dakota pi…   4.7  2008     8 auto… 4         9    12 e     pick…
 6 chevrolet    malibu       3.6  2008     6 auto… f        17    26 r     mids…
 7 lincoln      navigator…   5.4  2008     8 auto… r        12    18 r     suv  
 8 land rover   range rov…   4.2  2008     8 auto… 4        12    18 r     suv  
 9 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
10 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…

References