Exploratory data analysis

STAT 167 Introduction to Data Science
Data transformation with dplyr
Wenxiu Ma
[email protected]
4/26/2022
© W Ma 2022 1/ 44
Exploratory data analysis – data transformation
A typical data science project workflow
I Visualization is an important tool for insight generation, but it is rare that
you get the data in exactly the right form you need.
I Often you’ll need to create some new variables or summaries, or maybe you
just want to rename the variables or reorder the observations in order to
make the data a little easier to work with.
© W Ma 2022 2/ 44
Data transformation with dplyr
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.
” 1st Edition (2017)
I Chapter 3 “Data Transformation with dplyr” in print.
I Chapter 5 “Data transformation” @
http://r4ds.had.co.nz/transform.html
Additional readings
I CRAN R Vignette: https://cran.rproject.org/web/packages/dplyr/vignettes/dplyr.html
I dplyr cheatsheet: https://raw.githubusercontent.com/rstudio/
cheatsheets/main/data-transformation.pdf
© W Ma 2022 3/ 44
What can you do with dplyr?
© W Ma 2022 4/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 5/ 44
All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result –
piping (%>%)!
© W Ma 2022 6/ 44
Example: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
install.packages(“nycflights13”)
library(nycflights13)
help(
package=“nycflights13”)
?flights
# full documentation of flights
# View(flights) # see the data in RStudio Viewer
flights
## # A tibble: 336,776 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim
<int> <int> <int> <int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
<int
81
83
85
102
83
72
85
72
84
74
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 7/ 44
Type of variables
I int stands for integers.
I dbl stands for doubles, or real numbers.
I chr stands for character vectors, or strings.
I dttm stands for date-times (a date + a time).
Some other common types of variables:
I lgl stands for logical, vectors that contain only TRUE or
FALSE.
I fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
I ord stands for ordered factors
I date stands for dates.
© W Ma 2022 8/ 44
What can you do with dplyr?
I Pick observations by their values – ‘filter()’
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 9/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The first argument is the name of the data frame.
I The second and subsequent arguments are the expressions that
filter the data frame.
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim
<int> <int> <int> <int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
<int
81
83
85
102
83
72
85
72
84
74
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 10/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The output of filter() is a new data frame
# print out the new data frame only
filter(flights, month == 1, day == 1)
# save the new data frame to a variable, no print out
jan1 <- filter(flights, month == 1, day == 1)
# save and print out the new data frame
(jan1 <- filter(flights, month == 1, day == 1))
© W Ma 2022 11/ 44
Selection criteria
Select observations using comparisons
I >, >=, <, <=, != (not equal), and == (equal).
Multiple arguments to
filter() are combined with logical
operations
I & is “and”, | is “or”, and ! is “not”
Complete set of boolean operations
© W Ma 2022 12/ 44
Combination of selection criteria
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(
11, 12))
filter(flights, !(arr_delay >
120 | dep_delay > 120))
filter(flights, arr_delay <=
120, dep_delay <= 120)
© W Ma 2022 13/ 44
Peculiarities of floating-point numbers
I The R floating-point data type is a double, a.k.a. numeric
I The more bits in the fraction part, the more precision
I Rounding errors tend to accumulate in long calculations
I When results should be 0, errors can flip signs
0.45 == 3*0.15
## [1] FALSE
0.45 3*0.15
## [1] 5.551115e-17
sqrt(22 == 2
## [1] FALSE
sqrt(22 2
## [1] 4.440892e-16
1/49*49 == 1
## [1] FALSE
1/49*49 1
## [1] -1.110223e-16
© W Ma 2022 14/ 44
I Usually better to use all.equal() or near() instead of exact
comparison
0.45 == 3*0.15
## [1] FALSE
all.equal(0.45, 3*0.15)
## [1] TRUE
near(0.45, 3*0.15)
## [1] TRUE
© W Ma 2022 15/ 44
Missing values – NA (“not available”)
NA represents an unknown value so missing values are “contagious”
I Almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
NA == NA
## [1] NA
I If you want to determine if a value is missing, use is.na()
x
<- NA
is.na(x)
## [1] TRUE
© W Ma 2022 16/ 44
filter() with NA values
I filter() only includes rows where the condition is TRUE
I It excludes both FALSE and NA values.
I If you want to preserve missing values, ask for them explicitly:
(df <- tibble(x = c(1, NA, 3)))
## # A tibble: 3 x 1

##
##
## 1
## 2
## 3
x
<dbl>
1
NA
3
filter(df, x > 1)
## # A tibble: 1 x 1
##
##
## 1
x
<dbl>
3
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
##
##
## 1
## 2
x
<dbl>
NA
3

© W Ma 2022 17/ 44
Count the number of NAs
(df <- tibble(x = c(1, NA, 3, NA)))
## # A tibble: 4 x 1

##
##
## 1
## 2
## 3
## 4
x
<dbl>
1
NA
3
NA
filter(df, is.na(x)) %>% count()
## # A tibble: 1 x 1
##
##
## 1
n
<int>
2

© W Ma 2022 18/ 44
Combine multiple operations with the pipe %>%
I Piping focuses on the transformations, not what’s being
transformed, which makes the code easier to read.
I x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z)
turns into g(f(x, y), z) and so on
df <- tibble(x = c(1, NA, 3, NA))
# step-by-step transformation
df_na <- filter(df, is.na(x))
count(df_na)
## # A tibble: 1 x 1
## n
## <int>
## 1 2
# PIPING
# filter(df, is.na(x)) %>% count()
df %>% filter(is.na(x)) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2
© W Ma 2022 19/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – ‘arrange()’
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 20/ 44
Reorder the rows – arrange()
I arrange() takes a data frame and a set of column names (or
more complicated expressions) to order by.
I When more than one column name are given, each additional
column will be used to break ties in the values of preceding
columns.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_tim
<int> <int> <int> <int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
<int
81
83
85
102
83
72
85
72
84
74
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 21/ 44
Descending order – desc()
I By default, arrange() sorts the columns in an ascending order
I To re-order by a column in descending order, use desc()
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched
<int> <int> <int> <int>
641
1432
1121
1139
845
1100
2321
959
2257
756
<int>
900
1935
1635
1845
1600
1900
810
1900
759
1700
<dbl>
1301
1137
1126
1014
1005
960
911
899
898
896
<int>
1242
1607
1239
1457
1044
1342
135
1236
121
1058
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
6
1
9
7
4
3
6
7
12
9
15
10
20
22
10
17
27
22
5
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 22/ 44
Descending order – desc()
I desc() transforms one single vector only
I To sort multiple columns, use desc() on each column
# arrange(flights, desc(year, month, day)) # doesn’t work!
arrange(flights, desc(year), desc(month), desc(day))
## # A tibble: 336,776 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched
<int> <int> <int> <int>
13
18
26
459
514
549
550
552
553
554
<int>
2359
2359
2245
500
515
551
600
600
600
550
<dbl>
14
19
101
-1
-1
-2
-10
-8
-7
4
<int>
439
449
129
655
814
925
725
811
741
1024
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
12
12
12
12
12
12
12
12
12
12
31
31
31
31
31
31
31
31
31
31
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 23/ 44
Reorder the rows with NAs
I Missing values are always sorted at the end
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1

##
##
## 1
## 2
## 3
x
<dbl>
2
5
NA
arrange(df, desc(x))
## # A tibble: 3 x 1
##
##
## 1
## 2
## 3
x
<dbl>
5
2
NA

© W Ma 2022 24/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 25/ 44
Pick variables by their names – select()
I When you have large dataset (hundreds or more variables), the
first challenge is often narrowing in on the variables you’re
actually interested in.
I select() allows you to rapidly zoom in on a useful subset
using operations based on the names of the variables.
© W Ma 2022 26/ 44
Pick variables by their names – select()
# Select columns by name
# select(flights, year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
© W Ma 2022 27/ 44
Pick variables by their names – select()
# Remove columns by names
# select(data, -year, -month, -day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_dela
<int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
<int>
819
830
850
1022
837
728
854
723
846
745
<dbl
1
2
3
-1
-2
1
1
-1
## # … with 336,766 more rows, and 9 more variables: flight <int>,
## #
## #
tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance
hour <dbl>, minute <dbl>, time_hour <dttm>

© W Ma 2022 28/ 44
Pick variables by their names – select()
There are a number of functions you can use within select():
I starts_with(“abc”): matches names that begin with “abc”.
I ends_with(“xyz”): matches names that end with “xyz”.
I contains(“ijk”): matches names that contain “ijk”.
I matches(“(.)\1”): selects variables that match a regular
expression. This one matches any variables that contain
repeated characters. Learn more about
stringr @
https://r4ds.had.co.nz/strings.html
I num_range(“x”, 1:3): matches x1, x2 and x3. It is useful
when numbers were included in column names.
I one_of(…): selects columns names that are from a group of
names. It is useful when columns are named as a vector or
character string.
I everything(): selects all columns.
See ’?select’ for more details.
© W Ma 2022 29/ 44
Rename variables – rename()
rename(), which is a variant of select() that keeps all the
variables that aren’t explicitly mentioned.
# select(flights, tail_num = tailnum) # only one variable left
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time sched
<int> <int> <int> <int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
## #
carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 30/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
‘mutate()’
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 31/ 44
Create new variables – mutate()
I mutate() adds new columns with functions of existing columns
I mutate() always adds new columns at the end of your dataset
flights_sml <- select(flights, month:day, ends_with(“delay”),
distance, air_time)
mutate(flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 8

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
month day dep_delay arr_delay distance air_time gain speed
<int> <int> <dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<dbl>
11
20
33
-18
-25
12
19
-14
-8
8
<dbl>
1400
1416
1089
1576
762
719
1065
229
944
733
<dbl> <dbl> <dbl>
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
227
227
160
183
116
150
158
53
140
138
9 370.
16 374.
31 408.
-17 517.
-19 394.
16 288.
24 404.
-11 259.
-5 405.
10 319.

## # … with 336,766 more rows
© W Ma 2022 32/ 44
Create new variables – transmute()
If you only want to keep the new variables, use transmute()
transmute(flights,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
gain hours gain_per_hour
<dbl> <dbl>
9 3.78
16 3.78
31 2.67
-17 3.05
-19 1.93
16 2.5
24 2.63
-11 0.883
-5 2.33
10 2.3
<dbl>
2.38
4.23
11.6
-5.57
-9.83
6.4
9.11
-12.5
-2.14
4.35

## # … with 336,766 more rows
© W Ma 2022 33/ 44
Useful creation functions
I Arithmetic operators: +, , *, /, ˆ
I Modular arithmetic: %/% (integer division) and %% (remainder),
where
x == y * (x %/% y) + (x %% y).
I It is a handy tool to break integers up into pieces.
transmute(flights, dep_time, hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
© W Ma 2022 34/ 44
Useful creation functions
I Logs: log(), log2(), log10()
I Offsets: lead() and lag() allow you to refer to leading or
lagging values
I This allows you to compute running differences (e.g. x –
lag(x)
) or find when values change (x != lag(x)).
I They are most useful in conjunction with group_by().
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lead(x, 1)
## [1] 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
## [1] 3 4 5 6 7 8 9 10 NA NA
lag(x, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
## [1] NA NA 1 2 3 4 5 6 7 8
x – lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
© W Ma 2022 35/ 44
Useful creation functions
I Cumulative and rolling aggregates
I R provides functions for running sums, products, mins and
maxes:
cumsum(), cumprod(), cummin(), cummax()
I dplyr provides cummean() for cumulative means.
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
© W Ma 2022 36/ 44
Useful creation functions
I Logical comparisons, <, <=, >, >=, ==, !=.
I Ranking: there are a number of ranking functions, but you
should start with
min_rank().
I Look at the variants row_number(), dense_rank(),
percent_rank(), cume_dist(), ntile()
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
desc(y) # transform a vector into a format that will be sorted in descen
## [1] -1 -2 -2 NA -3 -4
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
© W Ma 2022 37/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary – ‘summarise()’
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 38/ 44
Grouped summaries with summarise()
I summarise() collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
© W Ma 2022 39/ 44
Grouped summaries with summarise()
I summarize() is useful when pair it with group_by()
by_day <- group_by(flights, year, month, day)
summarise(by_day,
delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 4

## # Groups: year, month [12]
##
##
year month day delay
<int> <int> <int> <dbl>
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1 11.5
2 13.9
3 11.0
4 8.95
5 5.73
6 7.15
7 5.42
8 2.55
9 2.28
10 2.84

## # … with 355 more rows
© W Ma 2022 40/ 44
Useful summary functions
I Measures of location: mean(x), median(x)
I sometimes useful to combine aggregation with logical
subsetting.
flights %>% group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay, na.rm = T),
# the average positive delay
avg_delay2 = mean(arr_delay[arr_delay > 0], na.rm = T))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 5

## # Groups: year, month [12]
##
##
year month day avg_delay1 avg_delay2 41/ 44
<int> <int> <int> <dbl>
12.7
12.7
5.73
-1.93
-1.53
4.24
-4.95
-3.23
-0.264
<dbl>
32.5
32.0
27.7
28.3
22.6
24.4
27.8
20.8
25.6
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
© W Ma 2022
1
1
1
1
1
1
1
1
1
1
2
3
4
5
6
7
8
9

Useful summary functions
I Measures of spread: sd(x), IQR(x) (interquartile range),
mad(x) (median absolute deviation)
I Measures of rank: min(x), quantile(x, 0.25), max(x)
I Measures of position: first(x), nth(x, 2), last(x)
I These work similarly to x[1], x[2], and x[length(x)] but let
you set a default value if that position does not exist
I Counts: n(), sum(!is.na(x)), n_distinct(x)
I Combine summary functions with logical values: sum(x > 10),
mean(y == 0).
© W Ma 2022 42/ 44
Summarise multiple columns
# count # of NAs per column
colSums(is.na(flights))

##
##
##
##
##
##
##
##
year month
0
day
0
dep_time sched_dep_time
0 8255
arr_delay
9430
dest
0
carrier
0
air_time
9430
dep_delay
8255
flight
arr_time sched_arr_time
8713
tailnum
2512
hour
0
0
origin
0
minute
0
0 0
distance time_hour
0 0
flights %>% summarise_all(funs(sum(is.na(.)))) %>% print(width=Inf)
## # A tibble: 1 x 19
##
##
## 1
##
##
## 1
##
##
## 1
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int>
8255
<int>
0
<int>
8255
<int>
8713
<int>
0 0 0 0
arr_delay carrier flight tailnum origin dest air_time distance hour minut
<int>
9430
time_hour
<int>
0
<int> <int> <int> <int> <int> <int>
9430
<int> <int> <int
0 0 2512 0 0 0 0

© W Ma 2022 43/ 44
Summary – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 44/ 44