STAT 167 Introduction to Data Science
Exploratory data analysis
Wenxiu Ma
[email protected]
4/28/2022
© W Ma 2022 1/ 51
Exploratory data analysis
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.” 1st Edition (2017)
I Chapter 5 “Exploratory Data Analysis” in print.
I Chapter 7 “Exploratory Data Analysis” @
http://r4ds.had.co.nz/exploratory-data-analysis.html
© W Ma 2022 2/ 51
Recap – complete graphing template in ggplot2
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>, | # optional |
position = <POSITION> # optional | |
) + | |
<COORDINATE_FUNCTION> + # optional | |
<FACET_FUNCTION> + <SCALE_FUNCTION> + <THEME_FUNCTION> |
# optional # optional # optional |
© W Ma 2022 3/ 51
Recap – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data
manipulation/wrangling.
© W Ma 2022 4/ 51
Recap – All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result – piping (%>%)!
© W Ma 2022 5/ 51
Combining the power of ggplot2 & dplyr
A typical data science project workflow
I Data exploration is the art of looking at your data, rapidly
generating hypotheses, quickly testing them, then repeating
again and again and again.
© W Ma 2022 6/ 51
Data set: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time © W Ma 2022 7/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
© W Ma 2022 8/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 9/ 51
Calculate average delay per destination
by_dest <- group_by(flights, dest)
delays_by_dest <- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest
## # A tibble: 105 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 | 4.38 |
265 199 NA 439 143 NA 8 3370 -2.5 17215 757. NA 2439 1514. NA 275 584. NA 443 116 NA 375 378 NA 297 866. NA |
## # … with 95 more rows
© W Ma 2022 10/ 51
How many missing values?
filter(delays_by_dest, is.na(delay))
## # A tibble: 100 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|||
## 1 ACK ## 2 ALB ## 3 ATL ## 4 AUS ## 5 AVL ## 6 BDL ## 7 BGR ## 8 BHM ## 9 BNA ## 10 BOS |
265 199 439 143 17215 757. 2439 1514. 275 584. 443 116 375 378 297 866. 6333 758. 15508 191. |
NA NA NA NA NA NA NA NA NA NA |
||
## # … with 90 more rows | ||||
colSums(is.na(delays_by_dest)) # count # of NAs per column | ||||
## dest count dist delay | ||||
## | 0 | 0 | 0 | 100 |
© W Ma 2022 11/ 51
Deal with missing values
1. use na.rm = TRUE in the summary function
delays_by_dest1 <- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay, na.rm = T))
delays_by_dest1
## # A tibble: 105 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 265 199 |
4.38 4.85 |
439 143 14.4 8 3370 -2.5 17215 757. 11.3 2439 1514. 6.02 275 584. 8.00 |
||
443 116 375 378 |
7.05 8.03 |
|
297 866. 16.9 |
## # … with 95 more rows
© W Ma 2022 12/ 51
Deal with missing values
2. filter out missing values from flights
# count # of NAs per column
colSums(is.na(flights))
## ## |
year 0 |
month 0 dep_delay 8255 carrier 0 dest 0 minute 0 |
day 0 |
dep_time 8255 |
## sched_dep_time | arr_time sched_arr_time | |||
## ## ## ## ## ## ## |
0 arr_delay 9430 origin 0 hour 0 |
8713 flight 0 air_time 9430 time_hour 0 |
0 tailnum 2512 distance 0 |
© W Ma 2022 13/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay))
## # A tibble: 9,430 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time | ||
<int> <int> <int> | <int> 1525 1528 1740 1807 1939 1952 2016 NA NA NA |
<int> 1530 1459 1745 1738 1840 1930 1930 1630 1935 1500 |
<dbl> -5 29 -5 29 59 22 46 NA NA NA |
<int> 1934 2002 2158 2251 29 2358 NA NA NA NA |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
||
## # … with 9,420 more rows, and 12 more variables: | ||||
## # ## # ## # ## # |
sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, |
|||
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, | ||||
time_hour <dttm> |
© W Ma 2022 14/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay)) %>% select(dep_time:arr_delay)
## # A tibble: 9,430 x 6
## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 |
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay | ||||
<int> 1525 1528 1740 1807 1939 1952 2016 NA NA NA |
<int> 1530 1459 1745 1738 1840 1930 1930 1630 1935 1500 |
<dbl> -5 29 -5 29 59 22 46 NA NA NA |
<int> 1934 2002 2158 2251 29 2358 NA NA NA NA |
<int> 1805 1647 2020 2103 2151 2207 2220 1815 2240 1825 |
<dbl> NA NA NA NA NA NA NA NA NA NA |
## # … with 9,420 more rows
I NAs in dep_delay or arr_delay represent canceled flights
© W Ma 2022 15/ 51
Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
flights instead
# NAs in dep_delay and arr_delay represent canceled flights
(not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)))
## # A tibble: 327,346 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time | ||
<int> <int> <int> | <int> 517 533 542 544 554 554 555 557 557 558 |
<int> 515 529 540 545 600 558 600 600 600 600 |
<dbl> 2 4 2 -1 -6 -4 -5 -3 -3 -2 |
<int> 830 850 923 1004 812 740 913 709 838 753 |
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 |
1 1 1 1 1 1 1 1 1 1 |
1 1 1 1 1 1 1 1 1 1 |
||
## # … with 327,336 more rows, and 12 more variables: | ||||
## # ## # |
sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, |
|||
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, © W Ma 2022 16/ 51 |
||||
## # | ti |
Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
flights instead
not_canceled %>% group_by(dest) %>%
summarise(count = n(), dist = mean(distance),
delay = mean(arr_delay))
## # A tibble: 104 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 264 199 |
4.38 4.85 |
418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 |
||
412 116 358 378 |
7.05 8.03 |
|
269 866. 16.9 |
## # … with 94 more rows
© W Ma 2022 17/ 51
Summary – dealing with missing values
1. use na.rm = TRUE in the summary function
(delays_by_dest1 <- flights %>% group_by(dest) %>%
summarise(count = n(), dist = mean(distance),
delay = mean(arr_delay, na.rm = T)))
## # A tibble: 105 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 265 199 |
4.38 4.85 |
439 143 14.4 8 3370 -2.5 17215 757. 11.3 2439 1514. 6.02 275 584. 8.00 |
||
443 116 375 378 |
7.05 8.03 |
|
297 866. 16.9 |
## # … with 95 more rows
© W Ma 2022 18/ 51
Summary – dealing with missing values
2. filter out missing values from flights and use non-canceled flights
instead
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2 <- not_canceled %>% group_by(dest) %>%
summarise(count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 264 199 |
4.38 4.85 |
418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 |
||
412 116 358 378 |
7.05 8.03 |
|
269 866. 16.9 |
## # … with 94 more rows
© W Ma 2022 19/ 51
Are the results different?
I An inner join matches pairs of observations whenever their
keys are equal.
inner_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 104 x 7
## ## |
dest count.x dist.x delay.x count.y dist.y delay.y | |||
<chr> | <int> <dbl> 254 1826 |
<dbl> 4.38 4.85 14.4 -2.5 11.3 6.02 8.00 7.05 8.03 16.9 |
<int> <dbl> 254 1826 |
<dbl> 4.38 4.85 14.4 -2.5 11.3 6.02 8.00 7.05 8.03 16.9 |
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
||||
265 439 |
199 143 |
264 418 |
199 143 |
|
8 3370 | 8 3370 | |||
17215 | 757. | 16837 | 757. | |
2439 1514. | 2411 1514. | |||
275 443 375 297 |
584. 116 378 866. |
261 412 358 269 |
584. 116 378 866. |
## # … with 94 more rows
© W Ma 2022 20/ 51
Are the results different?
I anti_join(x, y) drops all observations in x that have a
match in y.
anti_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 1 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
||
## 1 LGA | 1 | 17 | NaN |
© W Ma 2022 21/ 51
Why is there a NaN?
filter(flights, dest == “LGA”)
## # A tibble: 1 x 19
## ## |
year month | day dep_time sched_dep_time dep_delay arr_time | ||
<int> <int> <int> | <int> NA |
<int> 106 |
<dbl> NA |
<int> NA |
## 1 2013 | 7 | 27 | ||
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, | ||||
## # ## # ## # |
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, |
|||
minute <dbl>, time_hour <dttm> |
© W Ma 2022 22/ 51
Why is there a NaN?
I na.rm = TRUE will produce NaN when all values are missing
# filter(flights, dest == “LGA”) %>% str()
filter(flights, dest == “LGA”) %>% glimpse()
## Rows: 1
## Columns: 19
## $ year <int> 2013
## $ month <int> 7
## $ day <int> 27
## $ dep_time <int> NA
## $ sched_dep_time <int> 106
## $ dep_delay <dbl> NA
## $ arr_time <int> NA
## $ sched_arr_time <int> 245
## $ arr_delay <dbl> NA
## $ carrier <chr> “US”
## $ flight <int> 1632
## $ tailnum <chr> NA
## $ origin <chr> “EWR”
## $ dest <chr> “LGA”
## $ air_time <dbl> NA
## $ distance <dbl> 17
## $ hour <dbl> 1
## $ minute <dbl> 6
## $ time_hour <dttm> 2013-07-27 01:00:00
© W Ma 2022 23/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 24/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2 <- not_canceled %>% group_by(dest) %>%
summarise(count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
|
## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM |
254 1826 264 199 |
4.38 4.85 |
418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 |
||
412 116 358 378 |
7.05 8.03 |
|
269 866. 16.9 |
## # … with 94 more rows
© W Ma 2022 25/ 51
ggplot(data = delays_by_dest2, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 26/ 51
What are the two points with largest distances?
arrange(delays_by_dest2, desc(dist))
## # A tibble: 104 x 4
## ## |
dest count dist <chr> <int> <dbl> |
delay <dbl> |
## 1 HNL ## 2 ANC ## 3 SFO ## 4 OAK ## 5 SJC ## 6 SMF ## 7 LAX ## 8 BUR ## 9 LGB ## 10 PDX |
701 4973. -1.37 8 3370 -2.5 13173 2578. 2.67 |
|
309 2576 328 2569 |
3.08 3.45 |
|
282 2521 12.1 16026 2469. 0.547 |
||
370 2465 | 8.18 | |
661 2465 -0.0620 1342 2446. 5.14 |
## # … with 94 more rows
© W Ma 2022 27/ 51
What are the two points with largest distances?
top_n(delays_by_dest2, 2, dist)
## # A tibble: 2 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
## 1 ANC ## 2 HNL |
8 3370 -2.5 701 4973. -1.37 |
© W Ma 2022 28/ 51
Detect outliers in delays_by_dest2$dist
ggplot(data = delays_by_dest2) +
geom_histogram(mapping = aes(x = dist))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
0 1000 2000 3000 4000 5000
dist
count
ggplot(data = delays_by_dest2) +
geom_boxplot(mapping = aes( x = “”, y = dist)) +
coord_flip()
0 1000 2000 3000 4000 5000
dist
© x
W Ma 2022 29/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_filtered <- not_canceled %>% group_by(dest) %>%
summarise(count = n(), dist = mean(distance),
delay = mean(arr_delay)) %>%
filter(dest != “HNL”, dest != “ANC”)
ggplot(data = delays_by_dest_filtered,
mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = F)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 30/ 51
Problem with small n
I Whenever you do any aggregation, it’s always a good idea to
include either a count n(), or a count of non-missing values
sum(!is.na(x)).
I That way you can check that you’re not drawing conclusions
based on very small amounts of data.
© W Ma 2022 31/ 51
Distribution of delays_by_dest_filtered$count
ggplot(delays_by_dest_filtered) +
geom_histogram(mapping=aes(x=count))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
15
20
0 5000 10000 15000
count
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(mapping = aes(x = “”, y = count)) + coord_flip()
0 5000 10000 15000
count
© x
W Ma 2022 32/ 51
Distribution of delays_by_dest_filtered$count
ggplot(delays_by_dest_filtered) +
geom_histogram(mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0.0
2.5
5.0
7.5
0 1 2 3 4
log10(count)
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2 3 4
log10(count)
© x
W Ma 2022 33/ 51
Detect small n in delays_by_dest_filtered$count
arrange(delays_by_dest_filtered, count)
## # A tibble: 102 x 4
## ## |
dest count dist delay <chr> <int> <dbl> <dbl> |
## 1 LEX ## 2 SBN ## 3 HDN ## 4 MTJ ## 5 EYW ## 6 PSP ## 7 JAC ## 8 BZN ## 9 CHO ## 10 MYR |
1 604 -22 |
10 645. 14 1728 14 1795 17 1207 |
6.5 2.14 1.79 6.35 |
18 2378 -12.7 21 1876. 28.1 |
|
35 1882 46 305 58 551. |
7.6 9.5 4.60 |
## # … with 92 more rows
© W Ma 2022 34/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
© W Ma 2022 35/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest <- group_by(not_canceled, dest)
delays_by_dest <- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest_filtered <- filter(delays_by_dest,
dest != “HNL”, dest != “ANC”)
delays_by_dest_cleaned <- filter(delays_by_dest_filtered, count >= 10)
© W Ma 2022 36/ 51
PIPING %>%
# original code – step by step
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest <- group_by(not_canceled, dest)
delays_by_dest <- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance), delay = mean(arr_delay))
delays_by_dest_filtered <- filter(delays_by_dest,
dest != “HNL”, dest != “ANC”)
delays_by_dest_cleaned <- filter(delays_by_dest_filtered, count >= 10)
# alternative code using piping %>%
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_cleaned <- not_canceled %>%
group_by(dest) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(dest != “HNL”, dest != “ANC”, count >= 10)
I You can even pipe your dplyr output into ggplot2!
© W Ma 2022 37/ 51
Finally, visualization with ggplot2
ggplot(data = delays_by_dest_cleaned,
mapping = aes(x = dist, y = delay)) +
geom_point(mapping = aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 38/ 51
Comparison – filtering out outliers and noisy points
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
0
20
40
0 1000 2000
dist
delay
© W Ma 2022 39/ 51
Exercise #2 – Study the average delay per plane
Imagine that we want to investigate the distribution of average
delays among different planes.
© W Ma 2022 40/ 51
Exercise #2 – Study the average delay per plane
flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year ## $ month ## $ day ## $ dep_time |
<int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201~ <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, ~ |
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, ~ | |
## $ dep_delay ## $ arr_time |
<dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, ~ <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,~ |
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,~ | |
## $ arr_delay ## $ carrier ## $ flight ## $ tailnum ## $ origin ## $ dest ## $ air_time ## $ distance ## $ hour ## $ minute ## $ time_hour |
<dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,~ <chr> “UA”, “UA”, “AA”, “B6”, “DL”, “UA”, “B6”, “EV~ <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, ~ <chr> “N14228”, “N24211”, “N619AA”, “N804JB”, “N668~ <chr> “EWR”, “LGA”, “JFK”, “JFK”, “LGA”, “EWR”, “EW~ <chr> “IAH”, “IAH”, “MIA”, “BQN”, “ATL”, “ORD”, “FL~ <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1~ <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, ~ <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~ <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0~ <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20~ |
© W Ma 2022 41/ 51
Exercise #2 – Study the average delay per plane
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and
average arrival delay per tailnum.
© W Ma 2022 42/ 51
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and average
arrival delay per tailnum.
(delays_by_tailnum <- not_canceled %>%
group_by(tailnum) %>%
summarise(count = n(), delay = mean(arr_delay)))
## # A tibble: 4,037 x 3
## tailnum count delay
## <chr> <int> <dbl>
## 1 D942DN 4 31.5
## 2 N0EGMQ 352 9.98
## 3 N10156 145 12.7
## 4 N102UW 48 2.94
## 5 N103US 46 -6.93
## 6 N104UW 46 1.80
## 7 N10575 269 20.7
## 8 N105UW 45 -0.267
## 9 N107US 41 -5.73
## 10 N108UW 60 -1.25
## # … with 4,027 more rows © W Ma 2022 43/ 51
Distribution of delays_by_tailnum$delay
ggplot(data = delays_by_tailnum, mapping = aes(x = delay)) +
geom_histogram(mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
© density
W Ma 2022 44/ 51
Problem with small n
ggplot(data = delays_by_tailnum,
mapping = aes(x = count, y = delay)) +
geom_point(alpha = .1)
0
100
200
300
0 200 400
count
© delay
W Ma 2022 45/ 51
Detect small n in delays_by_tailnum$count
ggplot(delays_by_tailnum) +
geom_histogram(mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0
100
200
300
0 1 2
log10(count)
count
ggplot(data = delays_by_tailnum) +
geom_boxplot(mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2
log10(count)
© x
W Ma 2022 46/ 51
Detect small n in delays_by_tailnum$count
delays_by_tailnum %>% count(count)
## # A tibble: 356 x 2
## count n
## <int> <int>
## 1 1 168
## 2 2 99
## 3 3 67
## 4 4 50
## 5 5 78
## 6 6 46
## 7 7 49
## 8 8 32
## 9 9 27
## 10 10 24
## # … with 346 more rows
I It was not a good idea to name a variable as an existing function name!!
© W Ma 2022 47/ 51
Remove noisy points
delays_by_tailnum_cleaned <- filter(delays_by_tailnum, count >=3)
ggplot(data = delays_by_tailnum_cleaned,
mapping = aes(x = count, y = delay)) +
geom_point(alpha = .1)
0
50
100
150
0 200 400
count
© delay
W Ma 2022 48/ 51
Remove noisy points
ggplot(data = delays_by_tailnum_cleaned, mapping = aes(x = delay)) +
geom_histogram(mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
© density
W Ma 2022 49/ 51
Comparison – filtering out noisy points
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
density
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
density
© W Ma 2022 50/ 51
Exploratory Data Analysis (EDA)
EDA is an iterative cycle.
You:
I Generate questions about your data.
I Search for answers by visualizing, transforming, and modeling
your data.
I Use what you learn to refine your questions and/or generate
new questions.
© W Ma 2022 51/ 51