# Introduction to Data Science

STAT 167 Introduction to Data Science
Exploratory data analysis
Wenxiu Ma
[email protected]
4/28/2022
© W Ma 2022 1/ 51
Exploratory data analysis
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.
” 1st Edition (2017)
I Chapter 5 “Exploratory Data Analysis” in print.
I Chapter 7 “Exploratory Data Analysis” @
© W Ma 2022 2/ 51
Recap – complete graphing template in ggplot2
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),

 stat = , # optional position = # optional ) + + # optional + + # optional # optional # optional

© W Ma 2022 3/ 51
Recap – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 4/ 51
Recap – All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result –
piping (%>%)!
© W Ma 2022 5/ 51
Combining the power of ggplot2 & dplyr
A typical data science project workflow
I Data exploration is the art of looking at your data, rapidly
generating hypotheses, quickly testing them, then repeating
again and again and again.
© W Ma 2022 6/ 51
Data set: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time
© W Ma 2022 7/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
© W Ma 2022 8/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 9/ 51
Calculate average delay per destination
by_dest <- group_by(flights, dest)
delays_by_dest
<- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest
## # A tibble: 105 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 4.38 265 199 NA 439 143 NA 8 3370 -2.5 17215 757. NA 2439 1514. NA 275 584. NA 443 116 NA 375 378 NA 297 866. NA

## # … with 95 more rows
© W Ma 2022 10/ 51
How many missing values?
filter(delays_by_dest, is.na(delay))
## # A tibble: 100 x 4

 ## ## dest count dist delay ## 1 ACK ## 2 ALB ## 3 ATL ## 4 AUS ## 5 AVL ## 6 BDL ## 7 BGR ## 8 BHM ## 9 BNA ## 10 BOS 265 199 439 143 17215 757. 2439 1514. 275 584. 443 116 375 378 297 866. 6333 758. 15508 191. NA NA NA NA NA NA NA NA NA NA ## # … with 90 more rows colSums(is.na(delays_by_dest)) # count # of NAs per column ## dest count dist delay ## 0 0 0 100

© W Ma 2022 11/ 51
Deal with missing values
1. use na.rm = TRUE in the summary function
delays_by_dest1 <- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay, na.rm = T))
delays_by_dest1
## # A tibble: 105 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 265 199 4.38 4.85 439 143 14.4 8 3370 -2.5 17215 757. 11.3 2439 1514. 6.02 275 584. 8.00 443 116 375 378 7.05 8.03 297 866. 16.9

## # … with 95 more rows
© W Ma 2022 12/ 51
Deal with missing values
2. filter out missing values from flights
# count # of NAs per column
colSums(is.na(flights))

 ## ## year 0 month 0 dep_delay 8255 carrier 0 dest 0 minute 0 day 0 dep_time 8255 ## sched_dep_time arr_time sched_arr_time ## ## ## ## ## ## ## 0 arr_delay 9430 origin 0 hour 0 8713 flight 0 air_time 9430 time_hour 0 0 tailnum 2512 distance 0

© W Ma 2022 13/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay))
## # A tibble: 9,430 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time 1525 1528 1740 1807 1939 1952 2016 NA NA NA 1530 1459 1745 1738 1840 1930 1930 1630 1935 1500 -5 29 -5 29 59 22 46 NA NA NA 1934 2002 2158 2251 29 2358 NA NA NA NA ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## # … with 9,420 more rows, and 12 more variables: ## # ## # ## # ## # sched_arr_time , arr_delay , carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour

© W Ma 2022 14/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay)) %>% select(dep_time:arr_delay)
## # A tibble: 9,430 x 6

 ## ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay 1525 1528 1740 1807 1939 1952 2016 NA NA NA 1530 1459 1745 1738 1840 1930 1930 1630 1935 1500 -5 29 -5 29 59 22 46 NA NA NA 1934 2002 2158 2251 29 2358 NA NA NA NA 1805 1647 2020 2103 2151 2207 2220 1815 2240 1825 NA NA NA NA NA NA NA NA NA NA

## # … with 9,420 more rows
I NAs in dep_delay or arr_delay represent canceled flights
© W Ma 2022 15/ 51
Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
# NAs in dep_delay and arr_delay represent canceled flights
(not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)))
## # A tibble: 327,346 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time 517 533 542 544 554 554 555 557 557 558 515 529 540 545 600 558 600 600 600 600 2 4 2 -1 -6 -4 -5 -3 -3 -2 830 850 923 1004 812 740 913 709 838 753 ## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## # … with 327,336 more rows, and 12 more variables: ## # ## # sched_arr_time , arr_delay , carrier , flight , tailnum , origin , dest , ## # air_time , distance , hour , minute , © W Ma 2022 16/ 51 ## # ti

Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay))
## # A tibble: 104 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 264 199 4.38 4.85 418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 412 116 358 378 7.05 8.03 269 866. 16.9

## # … with 94 more rows
© W Ma 2022 17/ 51
Summary – dealing with missing values
1. use na.rm = TRUE in the summary function
(delays_by_dest1 <- flights %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay, na.rm = T)))
## # A tibble: 105 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 265 199 4.38 4.85 439 143 14.4 8 3370 -2.5 17215 757. 11.3 2439 1514. 6.02 275 584. 8.00 443 116 375 378 7.05 8.03 297 866. 16.9

## # … with 95 more rows
© W Ma 2022 18/ 51
Summary – dealing with missing values
2. filter out missing values from flights and use non-canceled flights
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 264 199 4.38 4.85 418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 412 116 358 378 7.05 8.03 269 866. 16.9

## # … with 94 more rows
© W Ma 2022 19/ 51
Are the results different?
I An inner join matches pairs of observations whenever their
keys are equal.
inner_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 104 x 7

 ## ## dest count.x dist.x delay.x count.y dist.y delay.y 254 1826 4.38 4.85 14.4 -2.5 11.3 6.02 8.00 7.05 8.03 16.9 254 1826 4.38 4.85 14.4 -2.5 11.3 6.02 8.00 7.05 8.03 16.9 ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 265 439 199 143 264 418 199 143 8 3370 8 3370 17215 757. 16837 757. 2439 1514. 2411 1514. 275 443 375 297 584. 116 378 866. 261 412 358 269 584. 116 378 866.

## # … with 94 more rows
© W Ma 2022 20/ 51
Are the results different?
I anti_join(x, y) drops all observations in x that have a
match in
y.
anti_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 1 x 4

 ## ## dest count dist delay ## 1 LGA 1 17 NaN

© W Ma 2022 21/ 51
Why is there a NaN?
filter(flights, dest == “LGA”)
## # A tibble: 1 x 19

 ## ## year month day dep_time sched_dep_time dep_delay arr_time NA 106 NA NA ## 1 2013 7 27 ## # … with 12 more variables: sched_arr_time , arr_delay , ## # ## # ## # carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour

© W Ma 2022 22/ 51
Why is there a NaN?
I na.rm = TRUE will produce NaN when all values are missing
# filter(flights, dest == “LGA”) %>% str()
filter(flights, dest == “LGA”) %>% glimpse()
## Rows: 1
## Columns: 19
## \$ year <int> 2013
## \$ month <int> 7
## \$ day <int> 27
## \$ dep_time <int> NA
## \$ sched_dep_time <int> 106
## \$ dep_delay <dbl> NA
## \$ arr_time <int> NA
## \$ sched_arr_time <int> 245
## \$ arr_delay <dbl> NA
## \$ carrier <chr> “US”
## \$ flight <int> 1632
## \$ tailnum <chr> NA
## \$ origin <chr> “EWR”
## \$ dest <chr> “LGA”
## \$ air_time <dbl> NA
## \$ distance <dbl> 17
## \$ hour <dbl> 1
## \$ minute <dbl> 6
## \$ time_hour <dttm> 2013-07-27 01:00:00
© W Ma 2022 23/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 24/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4

 ## ## dest count dist delay ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM 254 1826 264 199 4.38 4.85 418 143 14.4 8 3370 -2.5 16837 757. 11.3 2411 1514. 6.02 261 584. 8.00 412 116 358 378 7.05 8.03 269 866. 16.9

## # … with 94 more rows
© W Ma 2022 25/ 51
ggplot(data = delays_by_dest2, mapping = aes(x = dist, y = delay)) +
geom_point(aes(
size = count), alpha = 1/3) +
geom_smooth(
se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 26/ 51
What are the two points with largest distances?
arrange(delays_by_dest2, desc(dist))
## # A tibble: 104 x 4

 ## ## dest count dist delay ## 1 HNL ## 2 ANC ## 3 SFO ## 4 OAK ## 5 SJC ## 6 SMF ## 7 LAX ## 8 BUR ## 9 LGB ## 10 PDX 701 4973. -1.37 8 3370 -2.5 13173 2578. 2.67 309 2576 328 2569 3.08 3.45 282 2521 12.1 16026 2469. 0.547 370 2465 8.18 661 2465 -0.0620 1342 2446. 5.14

## # … with 94 more rows
© W Ma 2022 27/ 51
What are the two points with largest distances?
top_n(delays_by_dest2, 2, dist)
## # A tibble: 2 x 4

 ## ## dest count dist delay ## 1 ANC ## 2 HNL 8 3370 -2.5 701 4973. -1.37

© W Ma 2022 28/ 51
Detect outliers in delays_by_dest2\$dist
ggplot(data = delays_by_dest2) +
geom_histogram(
mapping = aes(x = dist))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
0 1000 2000 3000 4000 5000
dist
count
ggplot(data = delays_by_dest2) +
geom_boxplot(
mapping = aes( x = “”, y = dist)) +
coord_flip()
0 1000 2000 3000 4000 5000
dist
W Ma 2022 29/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_filtered
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)) %>%
filter(dest !=
“HNL”, dest != “ANC”)
ggplot(
data = delays_by_dest_filtered,
mapping = aes(x = dist, y = delay)) +
geom_point(aes(
size = count), alpha = 1/3) + geom_smooth(se = F)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 30/ 51
Problem with small n
I Whenever you do any aggregation, it’s always a good idea to
include either a count
n(), or a count of non-missing values
sum(!is.na(x)).
I That way you can check that you’re not drawing conclusions
based on very small amounts of data.
© W Ma 2022 31/ 51
Distribution of delays_by_dest_filtered\$count
ggplot(delays_by_dest_filtered) +
geom_histogram(
mapping=aes(x=count))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
15
20
0 5000 10000 15000
count
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(
mapping = aes(x = “”, y = count)) + coord_flip()
0 5000 10000 15000
count
W Ma 2022 32/ 51
Distribution of delays_by_dest_filtered\$count
ggplot(delays_by_dest_filtered) +
geom_histogram(
mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0.0
2.5
5.0
7.5
0 1 2 3 4
log10(count)
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(
mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2 3 4
log10(count)
W Ma 2022 33/ 51
Detect small n in delays_by_dest_filtered\$count
arrange(delays_by_dest_filtered, count)
## # A tibble: 102 x 4

 ## ## dest count dist delay ## 1 LEX ## 2 SBN ## 3 HDN ## 4 MTJ ## 5 EYW ## 6 PSP ## 7 JAC ## 8 BZN ## 9 CHO ## 10 MYR 1 604 -22 10 645. 14 1728 14 1795 17 1207 6.5 2.14 1.79 6.35 18 2378 -12.7 21 1876. 28.1 35 1882 46 305 58 551. 7.6 9.5 4.60

## # … with 92 more rows
© W Ma 2022 34/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
© W Ma 2022 35/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest
<- group_by(not_canceled, dest)
delays_by_dest
<- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest_filtered
<- filter(delays_by_dest,
dest !=
“HNL”, dest != “ANC”)
delays_by_dest_cleaned
<- filter(delays_by_dest_filtered, count >= 10)
© W Ma 2022 36/ 51
PIPING %>%
# original code – step by step
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest
<- group_by(not_canceled, dest)
delays_by_dest
<- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance), delay = mean(arr_delay))
delays_by_dest_filtered
<- filter(delays_by_dest,
dest !=
“HNL”, dest != “ANC”)
delays_by_dest_cleaned
<- filter(delays_by_dest_filtered, count >= 10)
# alternative code using piping %>%
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_cleaned
<- not_canceled %>%
group_by(dest) %>%
summarise(
count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(dest !=
“HNL”, dest != “ANC”, count >= 10)
I You can even pipe your dplyr output into ggplot2!
© W Ma 2022 37/ 51
Finally, visualization with ggplot2
ggplot(data = delays_by_dest_cleaned,
mapping = aes(x = dist, y = delay)) +
geom_point(
mapping = aes(size = count), alpha = 1/3) +
geom_smooth(
se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 38/ 51
Comparison – filtering out outliers and noisy points
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
0
20
40
0 1000 2000
dist
delay
© W Ma 2022 39/ 51
Exercise #2 – Study the average delay per plane
Imagine that we want to investigate the distribution of average
delays among different planes.
© W Ma 2022 40/ 51
Exercise #2 – Study the average delay per plane
flights %>% glimpse()
## Rows: 336,776
## Columns: 19

 ## \$ year ## \$ month ## \$ day ## \$ dep_time 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201~ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ 517, 533, 542, 544, 554, 554, 555, 557, 557, ~ ## \$ sched_dep_time 515, 529, 540, 545, 600, 558, 600, 600, 600, ~ ## \$ dep_delay ## \$ arr_time 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, ~ 830, 850, 923, 1004, 812, 740, 913, 709, 838,~ ## \$ sched_arr_time 819, 830, 850, 1022, 837, 728, 854, 723, 846,~ ## \$ arr_delay ## \$ carrier ## \$ flight ## \$ tailnum ## \$ origin ## \$ dest ## \$ air_time ## \$ distance ## \$ hour ## \$ minute ## \$ time_hour 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,~ “UA”, “UA”, “AA”, “B6”, “DL”, “UA”, “B6”, “EV~ 1545, 1714, 1141, 725, 461, 1696, 507, 5708, ~ “N14228”, “N24211”, “N619AA”, “N804JB”, “N668~ “EWR”, “LGA”, “JFK”, “JFK”, “LGA”, “EWR”, “EW~ “IAH”, “IAH”, “MIA”, “BQN”, “ATL”, “ORD”, “FL~ 227, 227, 160, 183, 116, 150, 158, 53, 140, 1~ 1400, 1416, 1089, 1576, 762, 719, 1065, 229, ~ 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~ 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0~ 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20~

© W Ma 2022 41/ 51
Exercise #2 – Study the average delay per plane
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and
average arrival delay per tailnum.
© W Ma 2022 42/ 51
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and average
arrival delay per tailnum.
(delays_by_tailnum <- not_canceled %>%
group_by(tailnum) %>%
summarise(
count = n(), delay = mean(arr_delay)))
## # A tibble: 4,037 x 3
## tailnum count delay
## <chr> <int> <dbl>
## 1 D942DN 4 31.5
## 2 N0EGMQ 352 9.98
## 3 N10156 145 12.7
## 4 N102UW 48 2.94
## 5 N103US 46 -6.93
## 6 N104UW 46 1.80
## 7 N10575 269 20.7
## 8 N105UW 45 -0.267
## 9 N107US 41 -5.73
## 10 N108UW 60 -1.25
## # … with 4,027 more rows
© W Ma 2022 43/ 51
Distribution of delays_by_tailnum\$delay
ggplot(data = delays_by_tailnum, mapping = aes(x = delay)) +
geom_histogram(
mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(
alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
W Ma 2022 44/ 51
Problem with small n
ggplot(data = delays_by_tailnum,
mapping = aes(x = count, y = delay)) +
geom_point(
alpha = .1)
0
100
200
300
0 200 400
count
W Ma 2022 45/ 51
Detect small n in delays_by_tailnum\$count
ggplot(delays_by_tailnum) +
geom_histogram(
mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0
100
200
300
0 1 2
log10(count)
count
ggplot(data = delays_by_tailnum) +
geom_boxplot(
mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2
log10(count)
W Ma 2022 46/ 51
Detect small n in delays_by_tailnum\$count
delays_by_tailnum %>% count(count)
## # A tibble: 356 x 2
## count n
## <int> <int>
## 1 1 168
## 2 2 99
## 3 3 67
## 4 4 50
## 5 5 78
## 6 6 46
## 7 7 49
## 8 8 32
## 9 9 27
## 10 10 24
## # … with 346 more rows
I It was not a good idea to name a variable as an existing function name!!
© W Ma 2022 47/ 51
Remove noisy points
delays_by_tailnum_cleaned <- filter(delays_by_tailnum, count >=3)
ggplot(
data = delays_by_tailnum_cleaned,
mapping = aes(x = count, y = delay)) +
geom_point(
alpha = .1)
0
50
100
150
0 200 400
count
W Ma 2022 48/ 51
Remove noisy points
ggplot(data = delays_by_tailnum_cleaned, mapping = aes(x = delay)) +
geom_histogram(
mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(
alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
W Ma 2022 49/ 51
Comparison – filtering out noisy points
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
density
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
density
© W Ma 2022 50/ 51
Exploratory Data Analysis (EDA)
EDA is an iterative cycle.
You: