# Exploratory data analysis

STAT 167 Introduction to Data Science
Data transformation with dplyr
Wenxiu Ma
[email protected]
4/26/2022
© W Ma 2022 1/ 44
Exploratory data analysis – data transformation
A typical data science project workflow
I Visualization is an important tool for insight generation, but it is rare that
you get the data in exactly the right form you need.
I Often you’ll need to create some new variables or summaries, or maybe you
just want to rename the variables or reorder the observations in order to
make the data a little easier to work with.
© W Ma 2022 2/ 44
Data transformation with dplyr
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.
” 1st Edition (2017)
I Chapter 3 “Data Transformation with dplyr” in print.
I Chapter 5 “Data transformation” @
I CRAN R Vignette: https://cran.rproject.org/web/packages/dplyr/vignettes/dplyr.html
I dplyr cheatsheet: https://raw.githubusercontent.com/rstudio/
cheatsheets/main/data-transformation.pdf
© W Ma 2022 3/ 44
What can you do with dplyr?
© W Ma 2022 4/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 5/ 44
All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result –
piping (%>%)!
© W Ma 2022 6/ 44
Example: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
install.packages(“nycflights13”)
library(nycflights13)
help(
package=“nycflights13”)
?flights
# full documentation of flights
# View(flights) # see the data in RStudio Viewer
flights
## # A tibble: 336,776 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 , ## # ## # carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour

© W Ma 2022 7/ 44
Type of variables
I int stands for integers.
I dbl stands for doubles, or real numbers.
I chr stands for character vectors, or strings.
I dttm stands for date-times (a date + a time).
Some other common types of variables:
I lgl stands for logical, vectors that contain only TRUE or
FALSE.
I fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
I ord stands for ordered factors
I date stands for dates.
© W Ma 2022 8/ 44
What can you do with dplyr?
I Pick observations by their values – ‘filter()’
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 9/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The first argument is the name of the data frame.
I The second and subsequent arguments are the expressions that
filter the data frame.
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 , ## # ## # carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour

© W Ma 2022 10/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The output of filter() is a new data frame
# print out the new data frame only
filter(flights, month == 1, day == 1)
# save the new data frame to a variable, no print out
jan1 <- filter(flights, month == 1, day == 1)
# save and print out the new data frame
(jan1 <- filter(flights, month == 1, day == 1))
© W Ma 2022 11/ 44
Selection criteria
Select observations using comparisons
I >, >=, <, <=, != (not equal), and == (equal).
Multiple arguments to
filter() are combined with logical
operations
I & is “and”, | is “or”, and ! is “not”
Complete set of boolean operations
© W Ma 2022 12/ 44
Combination of selection criteria
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(
11, 12))
filter(flights, !(arr_delay >
120 | dep_delay > 120))
filter(flights, arr_delay <=
120, dep_delay <= 120)
© W Ma 2022 13/ 44
Peculiarities of floating-point numbers
I The R floating-point data type is a double, a.k.a. numeric
I The more bits in the fraction part, the more precision
I Rounding errors tend to accumulate in long calculations
I When results should be 0, errors can flip signs
0.45 == 3*0.15
## [1] FALSE
0.45 3*0.15
## [1] 5.551115e-17
sqrt(22 == 2
## [1] FALSE
sqrt(22 2
## [1] 4.440892e-16
1/49*49 == 1
## [1] FALSE
1/49*49 1
## [1] -1.110223e-16
© W Ma 2022 14/ 44
I Usually better to use all.equal() or near() instead of exact
comparison
0.45 == 3*0.15
## [1] FALSE
all.equal(0.45, 3*0.15)
## [1] TRUE
near(0.45, 3*0.15)
## [1] TRUE
© W Ma 2022 15/ 44
Missing values – NA (“not available”)
NA represents an unknown value so missing values are “contagious”
I Almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
NA == NA
## [1] NA
I If you want to determine if a value is missing, use is.na()
x
<- NA
is.na(x)
## [1] TRUE
© W Ma 2022 16/ 44
filter() with NA values
I filter() only includes rows where the condition is TRUE
I It excludes both FALSE and NA values.
I If you want to preserve missing values, ask for them explicitly:
(df <- tibble(x = c(1, NA, 3)))
## # A tibble: 3 x 1

 ## ## ## 1 ## 2 ## 3 x 1 NA 3 filter(df, x > 1) ## # A tibble: 1 x 1 ## ## ## 1 x 3 filter(df, is.na(x) | x > 1) ## # A tibble: 2 x 1 ## ## ## 1 ## 2 x NA 3

© W Ma 2022 17/ 44
Count the number of NAs
(df <- tibble(x = c(1, NA, 3, NA)))
## # A tibble: 4 x 1

 ## ## ## 1 ## 2 ## 3 ## 4 x 1 NA 3 NA filter(df, is.na(x)) %>% count() ## # A tibble: 1 x 1 ## ## ## 1 n 2

© W Ma 2022 18/ 44
Combine multiple operations with the pipe %>%
I Piping focuses on the transformations, not what’s being
transformed, which makes the code easier to read.
I x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z)
turns into g(f(x, y), z) and so on
df <- tibble(x = c(1, NA, 3, NA))
# step-by-step transformation
df_na <- filter(df, is.na(x))
count(df_na)
## # A tibble: 1 x 1
## n
## <int>
## 1 2
# PIPING
# filter(df, is.na(x)) %>% count()
df %>% filter(is.na(x)) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2
© W Ma 2022 19/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – ‘arrange()’
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 20/ 44
Reorder the rows – arrange()
I arrange() takes a data frame and a set of column names (or
more complicated expressions) to order by.
I When more than one column name are given, each additional
column will be used to break ties in the values of preceding
columns.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 , ## # ## # carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour

© W Ma 2022 21/ 44
Descending order – desc()
I By default, arrange() sorts the columns in an ascending order
I To re-order by a column in descending order, use desc()
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched 641 1432 1121 1139 845 1100 2321 959 2257 756 900 1935 1635 1845 1600 1900 810 1900 759 1700 1301 1137 1126 1014 1005 960 911 899 898 896 1242 1607 1239 1457 1044 1342 135 1236 121 1058 ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 1 6 1 9 7 4 3 6 7 12 9 15 10 20 22 10 17 27 22 5 ## # … with 336,766 more rows, and 11 more variables: arr_delay , ## # ## # carrier , flight , tailnum , origin , dest , distance , hour , minute , time_ho

© W Ma 2022 22/ 44
Descending order – desc()
I desc() transforms one single vector only
I To sort multiple columns, use desc() on each column
# arrange(flights, desc(year, month, day)) # doesn’t work!
arrange(flights, desc(year), desc(month), desc(day))
## # A tibble: 336,776 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched 13 18 26 459 514 549 550 552 553 554 2359 2359 2245 500 515 551 600 600 600 550 14 19 101 -1 -1 -2 -10 -8 -7 4 439 449 129 655 814 925 725 811 741 1024 ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 12 12 12 12 12 12 12 12 12 12 31 31 31 31 31 31 31 31 31 31 ## # … with 336,766 more rows, and 11 more variables: arr_delay , ## # ## # carrier , flight , tailnum , origin , dest , distance , hour , minute , time_ho

© W Ma 2022 23/ 44
Reorder the rows with NAs
I Missing values are always sorted at the end
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1

 ## ## ## 1 ## 2 ## 3 x 2 5 NA arrange(df, desc(x)) ## # A tibble: 3 x 1 ## ## ## 1 ## 2 ## 3 x 5 2 NA

© W Ma 2022 24/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 25/ 44
Pick variables by their names – select()
I When you have large dataset (hundreds or more variables), the
first challenge is often narrowing in on the variables you’re
actually interested in.
I select() allows you to rapidly zoom in on a useful subset
using operations based on the names of the variables.
© W Ma 2022 26/ 44
Pick variables by their names – select()
# Select columns by name
# select(flights, year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
© W Ma 2022 27/ 44
Pick variables by their names – select()
# Remove columns by names
# select(data, -year, -month, -day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16

 ## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_dela 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 819 830 850 1022 837 728 854 723 846 745 , ## # ## # tailnum , origin , dest , air_time , distance hour , minute , time_hour

© W Ma 2022 28/ 44
Pick variables by their names – select()
There are a number of functions you can use within select():
I starts_with(“abc”): matches names that begin with “abc”.
I ends_with(“xyz”): matches names that end with “xyz”.
I contains(“ijk”): matches names that contain “ijk”.
I matches(“(.)\1”): selects variables that match a regular
expression. This one matches any variables that contain
stringr @
I num_range(“x”, 1:3): matches x1, x2 and x3. It is useful
when numbers were included in column names.
I one_of(…): selects columns names that are from a group of
names. It is useful when columns are named as a vector or
character string.
I everything(): selects all columns.
See ’?select’ for more details.
© W Ma 2022 29/ 44
Rename variables – rename()
rename(), which is a variant of select() that keeps all the
variables that aren’t explicitly mentioned.
# select(flights, tail_num = tailnum) # only one variable left
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time sched 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## # … with 336,766 more rows, and 11 more variables: arr_delay , ## # ## # carrier , flight , tail_num , origin , dest < air_time , distance , hour , minute , time_ho

© W Ma 2022 30/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
‘mutate()’
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 31/ 44
Create new variables – mutate()
I mutate() adds new columns with functions of existing columns
I mutate() always adds new columns at the end of your dataset
flights_sml <- select(flights, month:day, ends_with(“delay”),
distance, air_time)
mutate(flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 8

 ## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 month day dep_delay arr_delay distance air_time gain speed 2 4 2 -1 -6 -4 -5 -3 -3 -2 11 20 33 -18 -25 12 19 -14 -8 8 1400 1416 1089 1576 762 719 1065 229 944 733 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 227 227 160 183 116 150 158 53 140 138 9 370. 16 374. 31 408. -17 517. -19 394. 16 288. 24 404. -11 259. -5 405. 10 319.

## # … with 336,766 more rows
© W Ma 2022 32/ 44
Create new variables – transmute()
If you only want to keep the new variables, use transmute()
transmute(flights,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3

 ## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 gain hours gain_per_hour 9 3.78 16 3.78 31 2.67 -17 3.05 -19 1.93 16 2.5 24 2.63 -11 0.883 -5 2.33 10 2.3 2.38 4.23 11.6 -5.57 -9.83 6.4 9.11 -12.5 -2.14 4.35

## # … with 336,766 more rows
© W Ma 2022 33/ 44
Useful creation functions
I Arithmetic operators: +, , *, /, ˆ
I Modular arithmetic: %/% (integer division) and %% (remainder),
where
x == y * (x %/% y) + (x %% y).
I It is a handy tool to break integers up into pieces.
transmute(flights, dep_time, hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
© W Ma 2022 34/ 44
Useful creation functions
I Logs: log(), log2(), log10()
I Offsets: lead() and lag() allow you to refer to leading or
lagging values
I This allows you to compute running differences (e.g. x –
lag(x)
) or find when values change (x != lag(x)).
I They are most useful in conjunction with group_by().
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 2 3 4 5 6 7 8 9 10 NA
## [1] 3 4 5 6 7 8 9 10 NA NA
lag(x, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
## [1] NA NA 1 2 3 4 5 6 7 8
x – lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
© W Ma 2022 35/ 44
Useful creation functions
I Cumulative and rolling aggregates
I R provides functions for running sums, products, mins and
maxes:
cumsum(), cumprod(), cummin(), cummax()
I dplyr provides cummean() for cumulative means.
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
© W Ma 2022 36/ 44
Useful creation functions
I Logical comparisons, <, <=, >, >=, ==, !=.
I Ranking: there are a number of ranking functions, but you
min_rank().
I Look at the variants row_number(), dense_rank(),
percent_rank(), cume_dist(), ntile()
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
desc(y) # transform a vector into a format that will be sorted in descen
## [1] -1 -2 -2 NA -3 -4
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
© W Ma 2022 37/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary – ‘summarise()’
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 38/ 44
Grouped summaries with summarise()
I summarise() collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
© W Ma 2022 39/ 44
Grouped summaries with summarise()
I summarize() is useful when pair it with group_by()
by_day <- group_by(flights, year, month, day)
summarise(by_day,
delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 4

 ## # Groups: year, month [12] ## ## year month day delay ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 1 1 1 1 1 1 1 1 1 1 1 11.5 2 13.9 3 11.0 4 8.95 5 5.73 6 7.15 7 5.42 8 2.55 9 2.28 10 2.84

## # … with 355 more rows
© W Ma 2022 40/ 44
Useful summary functions
I Measures of location: mean(x), median(x)
I sometimes useful to combine aggregation with logical
subsetting.
flights %>% group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay, na.rm = T),
# the average positive delay
avg_delay2 = mean(arr_delay[arr_delay > 0], na.rm = T))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 5

 ## # Groups: year, month [12] ## ## year month day avg_delay1 avg_delay2 41/ 44 12.7 12.7 5.73 -1.93 -1.53 4.24 -4.95 -3.23 -0.264 32.5 32.0 27.7 28.3 22.6 24.4 27.8 20.8 25.6 ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 © W Ma 2022 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9

Useful summary functions
I Measures of spread: sd(x), IQR(x) (interquartile range),
I Measures of rank: min(x), quantile(x, 0.25), max(x)
I Measures of position: first(x), nth(x, 2), last(x)
I These work similarly to x[1], x[2], and x[length(x)] but let
you set a default value if that position does not exist
I Counts: n(), sum(!is.na(x)), n_distinct(x)
I Combine summary functions with logical values: sum(x > 10),
mean(y == 0).
© W Ma 2022 42/ 44
Summarise multiple columns
# count # of NAs per column
colSums(is.na(flights))

 ## ## ## ## ## ## ## ## year month 0 day 0 dep_time sched_dep_time 0 8255 arr_delay 9430 dest 0 carrier 0 air_time 9430 dep_delay 8255 flight arr_time sched_arr_time 8713 tailnum 2512 hour 0 0 origin 0 minute 0 0 0 distance time_hour 0 0 flights %>% summarise_all(funs(sum(is.na(.)))) %>% print(width=Inf) ## # A tibble: 1 x 19 ## ## ## 1 ## ## ## 1 ## ## ## 1 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 8255 0 8255 8713 0 0 0 0 arr_delay carrier flight tailnum origin dest air_time distance hour minut 9430 time_hour 0 9430

© W Ma 2022 43/ 44
Summary – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 44/ 44