Introduction to Data Science

STAT 167 Introduction to Data Science
Exploratory data analysis
Wenxiu Ma
[email protected]
4/28/2022
© W Ma 2022 1/ 51
Exploratory data analysis
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.
” 1st Edition (2017)
I Chapter 5 “Exploratory Data Analysis” in print.
I Chapter 7 “Exploratory Data Analysis” @
http://r4ds.had.co.nz/exploratory-data-analysis.html
© W Ma 2022 2/ 51
Recap – complete graphing template in ggplot2
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),

stat = <STAT>, # optional
position = <POSITION> # optional
) +
<COORDINATE_FUNCTION> + # optional
<FACET_FUNCTION> +
<SCALE_FUNCTION> +
<THEME_FUNCTION>
# optional
# optional
# optional

© W Ma 2022 3/ 51
Recap – dplyr basic functions
I Pick observations by their values – filter()
I Reorder the rows – arrange()
I Pick variables by their names – select()
I Create new variables with functions of existing variables –
mutate()
I Collapse many values down to a single summary –
summarise()
I The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 4/ 51
Recap – All dplyr functions share similar grammar
I The first argument is a data frame.
I The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
I The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result –
piping (%>%)!
© W Ma 2022 5/ 51
Combining the power of ggplot2 & dplyr
A typical data science project workflow
I Data exploration is the art of looking at your data, rapidly
generating hypotheses, quickly testing them, then repeating
again and again and again.
© W Ma 2022 6/ 51
Data set: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time
© W Ma 2022 7/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
© W Ma 2022 8/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 9/ 51
Calculate average delay per destination
by_dest <- group_by(flights, dest)
delays_by_dest
<- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest
## # A tibble: 105 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826 4.38
265 199 NA
439 143 NA
8 3370 -2.5
17215 757. NA
2439 1514. NA
275 584. NA
443 116 NA
375 378 NA
297 866. NA

## # … with 95 more rows
© W Ma 2022 10/ 51
How many missing values?
filter(delays_by_dest, is.na(delay))
## # A tibble: 100 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ACK
## 2 ALB
## 3 ATL
## 4 AUS
## 5 AVL
## 6 BDL
## 7 BGR
## 8 BHM
## 9 BNA
## 10 BOS
265 199
439 143
17215 757.
2439 1514.
275 584.
443 116
375 378
297 866.
6333 758.
15508 191.
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
## # … with 90 more rows
colSums(is.na(delays_by_dest)) # count # of NAs per column
## dest count dist delay
## 0 0 0 100

© W Ma 2022 11/ 51
Deal with missing values
1. use na.rm = TRUE in the summary function
delays_by_dest1 <- summarise(by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay, na.rm = T))
delays_by_dest1
## # A tibble: 105 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826
265 199
4.38
4.85
439 143 14.4
8 3370 -2.5
17215 757. 11.3
2439 1514. 6.02
275 584. 8.00
443 116
375 378
7.05
8.03
297 866. 16.9

## # … with 95 more rows
© W Ma 2022 12/ 51
Deal with missing values
2. filter out missing values from flights
# count # of NAs per column
colSums(is.na(flights))

##
##
year
0
month
0
dep_delay
8255
carrier
0
dest
0
minute
0
day
0
dep_time
8255
## sched_dep_time arr_time sched_arr_time
##
##
##
##
##
##
##
0
arr_delay
9430
origin
0
hour
0
8713
flight
0
air_time
9430
time_hour
0
0
tailnum
2512
distance
0

© W Ma 2022 13/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay))
## # A tibble: 9,430 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int>
1525
1528
1740
1807
1939
1952
2016
NA
NA
NA
<int>
1530
1459
1745
1738
1840
1930
1930
1630
1935
1500
<dbl>
-5
29
-5
29
59
22
46
NA
NA
NA
<int>
1934
2002
2158
2251
29
2358
NA
NA
NA
NA
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 9,420 more rows, and 12 more variables:
## #
## #
## #
## #
sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
time_hour <dttm>

© W Ma 2022 14/ 51
Deal with missing values
2. filter out missing values from flights
filter(flights, is.na(arr_delay)) %>% select(dep_time:arr_delay)
## # A tibble: 9,430 x 6

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int>
1525
1528
1740
1807
1939
1952
2016
NA
NA
NA
<int>
1530
1459
1745
1738
1840
1930
1930
1630
1935
1500
<dbl>
-5
29
-5
29
59
22
46
NA
NA
NA
<int>
1934
2002
2158
2251
29
2358
NA
NA
NA
NA
<int>
1805
1647
2020
2103
2151
2207
2220
1815
2240
1825
<dbl>
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA

## # … with 9,420 more rows
I NAs in dep_delay or arr_delay represent canceled flights
© W Ma 2022 15/ 51
Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
flights instead
# NAs in dep_delay and arr_delay represent canceled flights
(not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)))
## # A tibble: 327,346 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int>
517
533
542
544
554
554
555
557
557
558
<int>
515
529
540
545
600
558
600
600
600
600
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2
<int>
830
850
923
1004
812
740
913
709
838
753
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
## # … with 327,336 more rows, and 12 more variables:
## #
## #
sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
© W Ma 2022 16/ 51
## # ti

Data cleaning – missing values
2. filter out missing values from flights and use non-canceled
flights instead
not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay))
## # A tibble: 104 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826
264 199
4.38
4.85
418 143 14.4
8 3370 -2.5
16837 757. 11.3
2411 1514. 6.02
261 584. 8.00
412 116
358 378
7.05
8.03
269 866. 16.9

## # … with 94 more rows
© W Ma 2022 17/ 51
Summary – dealing with missing values
1. use na.rm = TRUE in the summary function
(delays_by_dest1 <- flights %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay, na.rm = T)))
## # A tibble: 105 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826
265 199
4.38
4.85
439 143 14.4
8 3370 -2.5
17215 757. 11.3
2439 1514. 6.02
275 584. 8.00
443 116
375 378
7.05
8.03
297 866. 16.9

## # … with 95 more rows
© W Ma 2022 18/ 51
Summary – dealing with missing values
2. filter out missing values from flights and use non-canceled flights
instead
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826
264 199
4.38
4.85
418 143 14.4
8 3370 -2.5
16837 757. 11.3
2411 1514. 6.02
261 584. 8.00
412 116
358 378
7.05
8.03
269 866. 16.9

## # … with 94 more rows
© W Ma 2022 19/ 51
Are the results different?
I An inner join matches pairs of observations whenever their
keys are equal.
inner_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 104 x 7

##
##
dest count.x dist.x delay.x count.y dist.y delay.y
<chr> <int> <dbl>
254 1826
<dbl>
4.38
4.85
14.4
-2.5
11.3
6.02
8.00
7.05
8.03
16.9
<int> <dbl>
254 1826
<dbl>
4.38
4.85
14.4
-2.5
11.3
6.02
8.00
7.05
8.03
16.9
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
265
439
199
143
264
418
199
143
8 3370 8 3370
17215 757. 16837 757.
2439 1514. 2411 1514.
275
443
375
297
584.
116
378
866.
261
412
358
269
584.
116
378
866.

## # … with 94 more rows
© W Ma 2022 20/ 51
Are the results different?
I anti_join(x, y) drops all observations in x that have a
match in
y.
anti_join(delays_by_dest1, delays_by_dest2, by = “dest”)
## # A tibble: 1 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 LGA 1 17 NaN

© W Ma 2022 21/ 51
Why is there a NaN?
filter(flights, dest == “LGA”)
## # A tibble: 1 x 19

##
##
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int>
NA
<int>
106
<dbl>
NA
<int>
NA
## 1 2013 7 27
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #
## #
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
minute <dbl>, time_hour <dttm>

© W Ma 2022 22/ 51
Why is there a NaN?
I na.rm = TRUE will produce NaN when all values are missing
# filter(flights, dest == “LGA”) %>% str()
filter(flights, dest == “LGA”) %>% glimpse()
## Rows: 1
## Columns: 19
## $ year <int> 2013
## $ month <int> 7
## $ day <int> 27
## $ dep_time <int> NA
## $ sched_dep_time <int> 106
## $ dep_delay <dbl> NA
## $ arr_time <int> NA
## $ sched_arr_time <int> 245
## $ arr_delay <dbl> NA
## $ carrier <chr> “US”
## $ flight <int> 1632
## $ tailnum <chr> NA
## $ origin <chr> “EWR”
## $ dest <chr> “LGA”
## $ air_time <dbl> NA
## $ distance <dbl> 17
## $ hour <dbl> 1
## $ minute <dbl> 6
## $ time_hour <dttm> 2013-07-27 01:00:00
© W Ma 2022 23/ 51
Exercise #1 – Study average delay per destination
Imagine that we want to explore the relationship between the
distance and average delay for each destination.
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 24/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(delays_by_dest2
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)))
## # A tibble: 104 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ABQ
## 2 ACK
## 3 ALB
## 4 ANC
## 5 ATL
## 6 AUS
## 7 AVL
## 8 BDL
## 9 BGR
## 10 BHM
254 1826
264 199
4.38
4.85
418 143 14.4
8 3370 -2.5
16837 757. 11.3
2411 1514. 6.02
261 584. 8.00
412 116
358 378
7.05
8.03
269 866. 16.9

## # … with 94 more rows
© W Ma 2022 25/ 51
ggplot(data = delays_by_dest2, mapping = aes(x = dist, y = delay)) +
geom_point(aes(
size = count), alpha = 1/3) +
geom_smooth(
se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 26/ 51
What are the two points with largest distances?
arrange(delays_by_dest2, desc(dist))
## # A tibble: 104 x 4

##
##
dest count dist
<chr> <int> <dbl>
delay
<dbl>
## 1 HNL
## 2 ANC
## 3 SFO
## 4 OAK
## 5 SJC
## 6 SMF
## 7 LAX
## 8 BUR
## 9 LGB
## 10 PDX
701 4973. -1.37
8 3370 -2.5
13173 2578. 2.67
309 2576
328 2569
3.08
3.45
282 2521 12.1
16026 2469. 0.547
370 2465 8.18
661 2465 -0.0620
1342 2446. 5.14

## # … with 94 more rows
© W Ma 2022 27/ 51
What are the two points with largest distances?
top_n(delays_by_dest2, 2, dist)
## # A tibble: 2 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 ANC
## 2 HNL
8 3370 -2.5
701 4973. -1.37

© W Ma 2022 28/ 51
Detect outliers in delays_by_dest2$dist
ggplot(data = delays_by_dest2) +
geom_histogram(
mapping = aes(x = dist))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
0 1000 2000 3000 4000 5000
dist
count
ggplot(data = delays_by_dest2) +
geom_boxplot(
mapping = aes( x = “”, y = dist)) +
coord_flip()
0 1000 2000 3000 4000 5000
dist
© x
W Ma 2022 29/ 51
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_filtered
<- not_canceled %>% group_by(dest) %>%
summarise(
count = n(), dist = mean(distance),
delay = mean(arr_delay)) %>%
filter(dest !=
“HNL”, dest != “ANC”)
ggplot(
data = delays_by_dest_filtered,
mapping = aes(x = dist, y = delay)) +
geom_point(aes(
size = count), alpha = 1/3) + geom_smooth(se = F)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
-20
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 30/ 51
Problem with small n
I Whenever you do any aggregation, it’s always a good idea to
include either a count
n(), or a count of non-missing values
sum(!is.na(x)).
I That way you can check that you’re not drawing conclusions
based on very small amounts of data.
© W Ma 2022 31/ 51
Distribution of delays_by_dest_filtered$count
ggplot(delays_by_dest_filtered) +
geom_histogram(
mapping=aes(x=count))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5 0
10
15
20
0 5000 10000 15000
count
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(
mapping = aes(x = “”, y = count)) + coord_flip()
0 5000 10000 15000
count
© x
W Ma 2022 32/ 51
Distribution of delays_by_dest_filtered$count
ggplot(delays_by_dest_filtered) +
geom_histogram(
mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0.0
2.5
5.0
7.5
0 1 2 3 4
log10(count)
count
ggplot(data = delays_by_dest_filtered) +
geom_boxplot(
mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2 3 4
log10(count)
© x
W Ma 2022 33/ 51
Detect small n in delays_by_dest_filtered$count
arrange(delays_by_dest_filtered, count)
## # A tibble: 102 x 4

##
##
dest count dist delay
<chr> <int> <dbl> <dbl>
## 1 LEX
## 2 SBN
## 3 HDN
## 4 MTJ
## 5 EYW
## 6 PSP
## 7 JAC
## 8 BZN
## 9 CHO
## 10 MYR
1 604 -22
10 645.
14 1728
14 1795
17 1207
6.5
2.14
1.79
6.35
18 2378 -12.7
21 1876. 28.1
35 1882
46 305
58 551.
7.6
9.5
4.60

## # … with 92 more rows
© W Ma 2022 34/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
© W Ma 2022 35/ 51
Data transformation & cleaning – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by destination;
3. Summarise grouped data to compute the number of flights,
average distance, and average arrival delay per destination;
4. Filter out outliers (longest distances – HNL and ANC);
5. Filter out noisy points (small flight counts – rare destinations).
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest
<- group_by(not_canceled, dest)
delays_by_dest
<- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance),
delay = mean(arr_delay))
delays_by_dest_filtered
<- filter(delays_by_dest,
dest !=
“HNL”, dest != “ANC”)
delays_by_dest_cleaned
<- filter(delays_by_dest_filtered, count >= 10)
© W Ma 2022 36/ 51
PIPING %>%
# original code – step by step
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled_by_dest
<- group_by(not_canceled, dest)
delays_by_dest
<- summarise(not_canceled_by_dest, count = n(),
dist = mean(distance), delay = mean(arr_delay))
delays_by_dest_filtered
<- filter(delays_by_dest,
dest !=
“HNL”, dest != “ANC”)
delays_by_dest_cleaned
<- filter(delays_by_dest_filtered, count >= 10)
# alternative code using piping %>%
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
delays_by_dest_cleaned
<- not_canceled %>%
group_by(dest) %>%
summarise(
count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(dest !=
“HNL”, dest != “ANC”, count >= 10)
I You can even pipe your dplyr output into ggplot2!
© W Ma 2022 37/ 51
Finally, visualization with ggplot2
ggplot(data = delays_by_dest_cleaned,
mapping = aes(x = dist, y = delay)) +
geom_point(
mapping = aes(size = count), alpha = 1/3) +
geom_smooth(
se = FALSE)
## `geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’
0
20
40
0 1000 2000
dist
delay
count
4000
8000
12000
16000
© W Ma 2022 38/ 51
Comparison – filtering out outliers and noisy points
-20
0
20
40
0 1000 2000 3000 4000 5000
dist
delay
0
20
40
0 1000 2000
dist
delay
© W Ma 2022 39/ 51
Exercise #2 – Study the average delay per plane
Imagine that we want to investigate the distribution of average
delays among different planes.
© W Ma 2022 40/ 51
Exercise #2 – Study the average delay per plane
flights %>% glimpse()
## Rows: 336,776
## Columns: 19

## $ year
## $ month
## $ day
## $ dep_time
<int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201~
<int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
<int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
<int> 517, 533, 542, 544, 554, 554, 555, 557, 557, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, ~
## $ dep_delay
## $ arr_time
<dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, ~
<int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,~
## $ arr_delay
## $ carrier
## $ flight
## $ tailnum
## $ origin
## $ dest
## $ air_time
## $ distance
## $ hour
## $ minute
## $ time_hour
<dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,~
<chr> “UA”, “UA”, “AA”, “B6”, “DL”, “UA”, “B6”, “EV~
<int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, ~
<chr> “N14228”, “N24211”, “N619AA”, “N804JB”, “N668~
<chr> “EWR”, “LGA”, “JFK”, “JFK”, “LGA”, “EWR”, “EW~
<chr> “IAH”, “IAH”, “MIA”, “BQN”, “ATL”, “ORD”, “FL~
<dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1~
<dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, ~
<dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~
<dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0~
<dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20~

© W Ma 2022 41/ 51
Exercise #2 – Study the average delay per plane
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and
average arrival delay per tailnum.
© W Ma 2022 42/ 51
Data transformation – step by step
1. Filter out canceled flights;
2. Group non-canceled flights by tailnum;
3. Summarise grouped data to compute the number of flights and average
arrival delay per tailnum.
(delays_by_tailnum <- not_canceled %>%
group_by(tailnum) %>%
summarise(
count = n(), delay = mean(arr_delay)))
## # A tibble: 4,037 x 3
## tailnum count delay
## <chr> <int> <dbl>
## 1 D942DN 4 31.5
## 2 N0EGMQ 352 9.98
## 3 N10156 145 12.7
## 4 N102UW 48 2.94
## 5 N103US 46 -6.93
## 6 N104UW 46 1.80
## 7 N10575 269 20.7
## 8 N105UW 45 -0.267
## 9 N107US 41 -5.73
## 10 N108UW 60 -1.25
## # … with 4,027 more rows
© W Ma 2022 43/ 51
Distribution of delays_by_tailnum$delay
ggplot(data = delays_by_tailnum, mapping = aes(x = delay)) +
geom_histogram(
mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(
alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
© density
W Ma 2022 44/ 51
Problem with small n
ggplot(data = delays_by_tailnum,
mapping = aes(x = count, y = delay)) +
geom_point(
alpha = .1)
0
100
200
300
0 200 400
count
© delay
W Ma 2022 45/ 51
Detect small n in delays_by_tailnum$count
ggplot(delays_by_tailnum) +
geom_histogram(
mapping = aes(x = log10(count)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0
100
200
300
0 1 2
log10(count)
count
ggplot(data = delays_by_tailnum) +
geom_boxplot(
mapping = aes(x = “”, y = log10(count))) + coord_flip()
0 1 2
log10(count)
© x
W Ma 2022 46/ 51
Detect small n in delays_by_tailnum$count
delays_by_tailnum %>% count(count)
## # A tibble: 356 x 2
## count n
## <int> <int>
## 1 1 168
## 2 2 99
## 3 3 67
## 4 4 50
## 5 5 78
## 6 6 46
## 7 7 49
## 8 8 32
## 9 9 27
## 10 10 24
## # … with 346 more rows
I It was not a good idea to name a variable as an existing function name!!
© W Ma 2022 47/ 51
Remove noisy points
delays_by_tailnum_cleaned <- filter(delays_by_tailnum, count >=3)
ggplot(
data = delays_by_tailnum_cleaned,
mapping = aes(x = count, y = delay)) +
geom_point(
alpha = .1)
0
50
100
150
0 200 400
count
© delay
W Ma 2022 48/ 51
Remove noisy points
ggplot(data = delays_by_tailnum_cleaned, mapping = aes(x = delay)) +
geom_histogram(
mapping = aes(y=..density..), binwidth = 5,
colour = “black”, fill = “white”) +
geom_density(
alpha = .2, fill = “#FF6666”)
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
© density
W Ma 2022 49/ 51
Comparison – filtering out noisy points
0.00
0.01
0.02
0.03
0.04
0 100 200 300
delay
density
0.00
0.01
0.02
0.03
0.04
0.05
0 50 100 150
delay
density
© W Ma 2022 50/ 51
Exploratory Data Analysis (EDA)
EDA is an iterative cycle.
You:
I Generate questions about your data.
I Search for answers by visualizing, transforming, and modeling
your data.
I Use what you learn to refine your questions and/or generate
new questions.
© W Ma 2022 51/ 51