STAT 167 Introduction to Data Science
Data transformation with dplyr
Wenxiu Ma
[email protected]
4/26/2022
© W Ma 2022 1/ 44
Exploratory data analysis – data transformation
A typical data science project workflow
I Visualization is an important tool for insight generation, but it is rare that
you get the data in exactly the right form you need.
I Often you’ll need to create some new variables or summaries, or maybe you
just want to rename the variables or reorder the observations in order to
make the data a little easier to work with.
© W Ma 2022 2/ 44
Data transformation with dplyr
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.” 1st Edition (2017)
I Chapter 3 “Data Transformation with dplyr” in print.
I Chapter 5 “Data transformation” @
http://r4ds.had.co.nz/transform.html
Additional readings
I CRAN R Vignette: https://cran.rproject.org/web/packages/dplyr/vignettes/dplyr.html
I dplyr cheatsheet: https://raw.githubusercontent.com/rstudio/
cheatsheets/main/data-transformation.pdf
© W Ma 2022 3/ 44
What can you do with dplyr?
© W Ma 2022 4/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 5/ 44
All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result – piping (%>%)!
© W Ma 2022 6/ 44
Example: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
install.packages(“nycflights13”)
library(nycflights13)
help(package=“nycflights13”)
?flights # full documentation of flights
# View(flights) # see the data in RStudio Viewer
flights
## # A tibble: 336,776 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched_arr_tim | |||
<int> <int> <int> | <int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
<int 81 83 85 102 83 72 85 72 84 74 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
|||
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, | |||||
## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, | ||||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm |
© W Ma 2022 7/ 44
Type of variables
I int stands for integers.
I dbl stands for doubles, or real numbers.
I chr stands for character vectors, or strings.
I dttm stands for date-times (a date + a time).
Some other common types of variables:
I lgl stands for logical, vectors that contain only TRUE or
FALSE.
I fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
I ord stands for ordered factors
I date stands for dates.
© W Ma 2022 8/ 44
What can you do with dplyr?
I Pick observations by their values – ‘filter()’
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 9/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The first argument is the name of the data frame.
I The second and subsequent arguments are the expressions that
filter the data frame.
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched_arr_tim | |||
<int> <int> <int> | <int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
<int 81 83 85 102 83 72 85 72 84 74 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
|||
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, | |||||
## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, | ||||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm |
© W Ma 2022 10/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
I The output of filter() is a new data frame
# print out the new data frame only
filter(flights, month == 1, day == 1)
# save the new data frame to a variable, no print out
jan1 <- filter(flights, month == 1, day == 1)
# save and print out the new data frame
(jan1 <- filter(flights, month == 1, day == 1))
© W Ma 2022 11/ 44
Selection criteria
Select observations using comparisons
I >, >=, <, <=, != (not equal), and == (equal).
Multiple arguments to filter() are combined with logical
operations
I & is “and”, | is “or”, and ! is “not”
Complete set of boolean operations
© W Ma 2022 12/ 44
Combination of selection criteria
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
© W Ma 2022 13/ 44
Peculiarities of floating-point numbers
I The R floating-point data type is a double, a.k.a. numeric
I The more bits in the fraction part, the more precision
I Rounding errors tend to accumulate in long calculations
I When results should be ≈ 0, errors can flip signs
0.45 == 3*0.15
## [1] FALSE
0.45 – 3*0.15
## [1] 5.551115e-17
sqrt(2)ˆ2 == 2
## [1] FALSE
sqrt(2)ˆ2 – 2
## [1] 4.440892e-16
1/49*49 == 1
## [1] FALSE
1/49*49 – 1
## [1] -1.110223e-16
© W Ma 2022 14/ 44
I Usually better to use all.equal() or near() instead of exact
comparison
0.45 == 3*0.15
## [1] FALSE
all.equal(0.45, 3*0.15)
## [1] TRUE
near(0.45, 3*0.15)
## [1] TRUE
© W Ma 2022 15/ 44
Missing values – NA (“not available”)
NA represents an unknown value so missing values are “contagious”
I Almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
NA == NA
## [1] NA
I If you want to determine if a value is missing, use is.na()
x <- NA
is.na(x)
## [1] TRUE
© W Ma 2022 16/ 44
filter() with NA values
I filter() only includes rows where the condition is TRUE
I It excludes both FALSE and NA values.
I If you want to preserve missing values, ask for them explicitly:
(df <- tibble(x = c(1, NA, 3)))
## # A tibble: 3 x 1
## ## ## 1 ## 2 ## 3 |
x <dbl> 1 NA 3 |
filter(df, x > 1) ## # A tibble: 1 x 1 |
|
## ## ## 1 |
x <dbl> 3 |
filter(df, is.na(x) | x > 1) ## # A tibble: 2 x 1 |
|
## ## ## 1 ## 2 |
x <dbl> NA 3 |
© W Ma 2022 17/ 44
Count the number of NAs
(df <- tibble(x = c(1, NA, 3, NA)))
## # A tibble: 4 x 1
## ## ## 1 ## 2 ## 3 ## 4 |
x <dbl> 1 NA 3 NA |
filter(df, is.na(x)) %>% count() ## # A tibble: 1 x 1 |
|
## ## ## 1 |
n <int> 2 |
© W Ma 2022 18/ 44
Combine multiple operations with the pipe %>%
I Piping focuses on the transformations, not what’s being
transformed, which makes the code easier to read.
I x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z)
turns into g(f(x, y), z) and so on
df <- tibble(x = c(1, NA, 3, NA))
# step-by-step transformation
df_na <- filter(df, is.na(x))
count(df_na)
## # A tibble: 1 x 1
## n
## <int>
## 1 2
# PIPING
# filter(df, is.na(x)) %>% count()
df %>% filter(is.na(x)) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2
© W Ma 2022 19/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – ‘arrange()’
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 20/ 44
Reorder the rows – arrange()
I arrange() takes a data frame and a set of column names (or
more complicated expressions) to order by.
I When more than one column name are given, each additional
column will be used to break ties in the values of preceding
columns.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched_arr_tim | |||
<int> <int> <int> | <int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
<int 81 83 85 102 83 72 85 72 84 74 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
|||
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, | |||||
## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, | ||||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm |
© W Ma 2022 21/ 44
Descending order – desc()
I By default, arrange() sorts the columns in an ascending order
I To re-order by a column in descending order, use desc()
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched | ||
<int> <int> <int> | <int> 641 1432 1121 1139 845 1100 2321 959 2257 756 |
<int> 900 1935 1635 1845 1600 1900 810 1900 759 1700 |
<dbl> 1301 1137 1126 1014 1005 960 911 899 898 896 |
<int> 1242 1607 1239 1457 1044 1342 135 1236 121 1058 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 6 1 9 7 4 3 6 7 12 |
9 15 10 20 22 10 17 27 22 5 |
||
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, | ||||
## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c | |||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho |
© W Ma 2022 22/ 44
Descending order – desc()
I desc() transforms one single vector only
I To sort multiple columns, use desc() on each column
# arrange(flights, desc(year, month, day)) # doesn’t work!
arrange(flights, desc(year), desc(month), desc(day))
## # A tibble: 336,776 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched | ||
<int> <int> <int> | <int> 13 18 26 459 514 549 550 552 553 554 |
<int> 2359 2359 2245 500 515 551 600 600 600 550 |
<dbl> 14 19 101 -1 -1 -2 -10 -8 -7 4 |
<int> 439 449 129 655 814 925 725 811 741 1024 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
12 12 12 12 12 12 12 12 12 12 |
31 31 31 31 31 31 31 31 31 31 |
||
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, | ||||
## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c | |||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho |
© W Ma 2022 23/ 44
Reorder the rows with NAs
I Missing values are always sorted at the end
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## ## ## 1 ## 2 ## 3 |
x <dbl> 2 5 NA |
arrange(df, desc(x)) ## # A tibble: 3 x 1 |
|
## ## ## 1 ## 2 ## 3 |
x <dbl> 5 2 NA |
© W Ma 2022 24/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 25/ 44
Pick variables by their names – select()
I When you have large dataset (hundreds or more variables), the
first challenge is often narrowing in on the variables you’re
actually interested in.
I select() allows you to rapidly zoom in on a useful subset
using operations based on the names of the variables.
© W Ma 2022 26/ 44
Pick variables by their names – select()
# Select columns by name
# select(flights, year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
© W Ma 2022 27/ 44
Pick variables by their names – select()
# Remove columns by names
# select(data, -year, -month, -day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 |
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_dela | ||||
<int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
<int> 819 830 850 1022 837 728 854 723 846 745 |
<dbl 1 2 3 -1 -2 1 1 -1 – |
## # … with 336,766 more rows, and 9 more variables: flight <int>, | |||||
## # ## # |
tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance | ||||
hour <dbl>, minute <dbl>, time_hour <dttm> |
© W Ma 2022 28/ 44
Pick variables by their names – select()
There are a number of functions you can use within select():
I starts_with(“abc”): matches names that begin with “abc”.
I ends_with(“xyz”): matches names that end with “xyz”.
I contains(“ijk”): matches names that contain “ijk”.
I matches(“(.)\1”): selects variables that match a regular
expression. This one matches any variables that contain
repeated characters. Learn more about stringr @
https://r4ds.had.co.nz/strings.html
I num_range(“x”, 1:3): matches x1, x2 and x3. It is useful
when numbers were included in column names.
I one_of(…): selects columns names that are from a group of
names. It is useful when columns are named as a vector or
character string.
I everything(): selects all columns.
See ’?select’ for more details.
© W Ma 2022 29/ 44
Rename variables – rename()
rename(), which is a variant of select() that keeps all the
variables that aren’t explicitly mentioned.
# select(flights, tail_num = tailnum) # only one variable left
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time sched | ||
<int> <int> <int> | <int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
||
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, | ||||
## # ## # |
carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest < air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho |
© W Ma 2022 30/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
‘mutate()’
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 31/ 44
Create new variables – mutate()
I mutate() adds new columns with functions of existing columns
I mutate() always adds new columns at the end of your dataset
flights_sml <- select(flights, month:day, ends_with(“delay”),
distance, air_time)
mutate(flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 8
## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 |
month | day dep_delay arr_delay distance air_time gain speed | ||
<int> <int> | <dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<dbl> 11 20 33 -18 -25 12 19 -14 -8 8 |
<dbl> 1400 1416 1089 1576 762 719 1065 229 944 733 |
<dbl> <dbl> <dbl> |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
227 227 160 183 116 150 158 53 140 138 |
9 370. 16 374. 31 408. -17 517. -19 394. 16 288. 24 404. -11 259. -5 405. 10 319. |
## # … with 336,766 more rows
© W Ma 2022 32/ 44
Create new variables – transmute()
If you only want to keep the new variables, use transmute()
transmute(flights,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 |
gain hours gain_per_hour |
<dbl> <dbl> 9 3.78 16 3.78 31 2.67 -17 3.05 -19 1.93 16 2.5 24 2.63 -11 0.883 -5 2.33 10 2.3 |
<dbl> 2.38 4.23 11.6 -5.57 -9.83 6.4 9.11 -12.5 -2.14 4.35 |
## # … with 336,766 more rows
© W Ma 2022 33/ 44
Useful creation functions
I Arithmetic operators: +, –, *, /, ˆ
I Modular arithmetic: %/% (integer division) and %% (remainder),
where x == y * (x %/% y) + (x %% y).
I It is a handy tool to break integers up into pieces.
transmute(flights, dep_time, hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
© W Ma 2022 34/ 44
Useful creation functions
I Logs: log(), log2(), log10()
I Offsets: lead() and lag() allow you to refer to leading or
lagging values
I This allows you to compute running differences (e.g. x –
lag(x)) or find when values change (x != lag(x)).
I They are most useful in conjunction with group_by().
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lead(x, 1)
## [1] 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
## [1] 3 4 5 6 7 8 9 10 NA NA
lag(x, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
## [1] NA NA 1 2 3 4 5 6 7 8
x – lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
© W Ma 2022 35/ 44
Useful creation functions
I Cumulative and rolling aggregates
I R provides functions for running sums, products, mins and
maxes: cumsum(), cumprod(), cummin(), cummax()
I dplyr provides cummean() for cumulative means.
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
© W Ma 2022 36/ 44
Useful creation functions
I Logical comparisons, <, <=, >, >=, ==, !=.
I Ranking: there are a number of ranking functions, but you
should start with min_rank().
I Look at the variants row_number(), dense_rank(),
percent_rank(), cume_dist(), ntile()
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
desc(y) # transform a vector into a format that will be sorted in descen
## [1] -1 -2 -2 NA -3 -4
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
© W Ma 2022 37/ 44
What can you do with dplyr?
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary – ‘summarise()’
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 38/ 44
Grouped summaries with summarise()
I summarise() collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
© W Ma 2022 39/ 44
Grouped summaries with summarise()
I summarize() is useful when pair it with group_by()
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 4
## # Groups: | year, month [12] | |
## ## |
year month | day delay |
<int> <int> <int> <dbl> | ||
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 11.5 2 13.9 3 11.0 4 8.95 5 5.73 6 7.15 7 5.42 8 2.55 9 2.28 10 2.84 |
## # … with 355 more rows
© W Ma 2022 40/ 44
Useful summary functions
I Measures of location: mean(x), median(x)
I sometimes useful to combine aggregation with logical
subsetting.
flights %>% group_by(year, month, day) %>%
summarise(avg_delay1 = mean(arr_delay, na.rm = T),
# the average positive delay
avg_delay2 = mean(arr_delay[arr_delay > 0], na.rm = T))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 5
## # Groups: | year, month [12] | ||
## ## |
year month | day avg_delay1 avg_delay2 | 41/ 44 |
<int> <int> <int> | <dbl> 12.7 12.7 5.73 -1.93 -1.53 4.24 -4.95 -3.23 -0.264 |
<dbl> 32.5 32.0 27.7 28.3 22.6 24.4 27.8 20.8 25.6 |
|
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 © W Ma 2022 |
1 1 1 1 1 1 1 1 1 |
1 2 3 4 5 6 7 8 9 |
Useful summary functions
I Measures of spread: sd(x), IQR(x) (interquartile range),
mad(x) (median absolute deviation)
I Measures of rank: min(x), quantile(x, 0.25), max(x)
I Measures of position: first(x), nth(x, 2), last(x)
I These work similarly to x[1], x[2], and x[length(x)] but let
you set a default value if that position does not exist
I Counts: n(), sum(!is.na(x)), n_distinct(x)
I Combine summary functions with logical values: sum(x > 10),
mean(y == 0).
© W Ma 2022 42/ 44
Summarise multiple columns
# count # of NAs per column
colSums(is.na(flights))
## ## ## ## ## ## ## ## |
year | month 0 |
day 0 |
dep_time sched_dep_time | ||
0 | 8255 arr_delay 9430 dest |
0 carrier 0 air_time 9430 |
||||
dep_delay 8255 flight |
arr_time sched_arr_time | |||||
8713 tailnum 2512 hour 0 |
0 origin 0 minute 0 |
|||||
0 | 0 | |||||
distance | time_hour | |||||
0 | 0 | |||||
flights %>% summarise_all(funs(sum(is.na(.)))) %>% print(width=Inf) | ||||||
## # A tibble: 1 x 19 | ||||||
## ## ## 1 ## ## ## 1 ## ## ## 1 |
year month | day dep_time sched_dep_time dep_delay arr_time sched_arr_time | ||||
<int> <int> <int> | <int> 8255 |
<int> 0 |
<int> 8255 |
<int> 8713 |
<int> | |
0 | 0 | 0 | 0 | |||
arr_delay carrier flight tailnum origin dest air_time distance hour minut | ||||||
<int> 9430 time_hour <int> 0 |
<int> <int> | <int> <int> <int> | <int> 9430 |
<int> <int> <int | ||
0 | 0 | 2512 | 0 | 0 | 0 | 0 |
© W Ma 2022 43/ 44
Summary – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – ‘select()’
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 44/ 44