Data Driven Decision Making

Session 8
Data Driven Decision Making – DDDM
Agenda for this/next week
Recap session seven
Web scraping
Categorical data decision making
Bayesian network analysis
2
Module Assessment
Reminder: see Blackboard for assessment details
3
To start with…
Recap on session seven
4
Learning Outcomes
5
On completion of this workshop you will be able to:
Web scraping
Categorical data decision making
Bayesian network analysis
Imagine a scenario at a credit card company who requires predicting credit card fraud. It is
possible to use Logistic Regression to estimate the probability of a transaction as being
fraudulent, given predictors such as time of day, type of transaction, and region of
purchase. It is possible to build a model on historical data and then score new transactions
and classify them as fraudulent or not.
In this session, we first look for associations between predictors and a binary response
using hypothesis tests. We then build a logistic regression model and discuss how to
characterize the relationship between the response and predictors. Finally, we’ll use logistic
regression to build a model, or classifier, to predict unknown cases.
Synopsis
I. Classification and prediction
II. Clustering and similarity
Classification and clustering
What is classification? What is prediction?
Decision tree induction
Bayesian classification
Other classification methods
Classification accuracy
Summary
Overview
Classification and prediction
Aim: to predict categorical class labels for new tuples/samples
Input: a training set of tuples/samples, each with a class label
Output: a model (a classifier) based on the training set and the class labels
What is classification?
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness
analysis
Many many more!
Applications
Typical classification applications
Is similar to classification
>constructs a model
>uses the model to predict unknown or missing values
Major method: regression
>linear and multiple regression
>non-linear regression
What is prediction?
Classification:
o predicts categorical class labels
o classifies data based on the training set and the values in a
classification attribute and uses it in classifying new data
Prediction:
o models continuous-valued functions
o predicts unknown or missing values
Classification vs. prediction
Training
Data
NAME RANK YEARS TENURED
Mary Assistant Prof 3 no
James Assistant Prof 7 yes
Bill Professor 2 no
John Associate Prof 7 yes
Mark Assistant Prof 6 no
Annie Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = yes
Classifier
(Model)
An example: model construction
Testing
Data
Classifier
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Lisa Associate Prof 7 no
Jack Professor 5 yes
Ann Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Yes
A decision tree is a tree where:
internal node = a test on an attribute
tree branch = an outcome of the test
leaf node = class label or class distribution

A?

B? C?
D?
Yes
Decision tree induction
Two phases of decision tree generation:
tree construction
o at start, all the training examples at the root
o partition examples based on selected attributes
o test attributes are selected based on a heuristic or a statistical measure
tree pruning
o identify and remove branches that reflect noise or outliers
Decision tree generation
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Training set
Decision tree induction – classical example: play tennis?
outlook
overcast
humidity windy
high normal true false
sunny rain
P
N P N P
Decision tree
One rule is generated for each path in the
tree from the root to a leaf
Each attribute-value pair along a path
forms a conjunction
The leaf node holds the class prediction
Rules are generally simpler to understand
than trees
IF outlook=sunny
AND humidity=normal
THEN play tennis

outlook

overcast
humidity windy
high normal true false
sunny rain
P
N P N P
From a decision tree to classification rules
Bayesian network analysis
20
The use of Bayesian methods has become increasingly popular in modern statistical
analysis, with applications in a wide variety of scientific fields.
Bayesian methods incorporate existing information (based on expert knowledge, past
studies, and so on) into current data analysis.
This existing information is represented by a prior distribution, and the data likelihood is
effectively weighted by the prior distribution as the data analysis results are computed.
Synopsis
Because Bayes answers the questions we really care about.
Pr(I have disease | test +) vs Pr(test + | disease)
Pr(A better than B | data) vs Pr(extreme data | A=B)
Bayes is natural (vs interpreting a CI or a P-value)
Note:
blue = Bayesian, red = frequentist
Why Bayes?
You are waiting on a subway platform for a train that is known to run on a regular
schedule, only you don’t know how much time is scheduled to pass between train arrivals,
nor how long it’s been since the last train departed.
As more time passes, do you
(a) grow more confident that the train will arrive soon,
since its eventual arrival
can only be getting closer, not further away, or (b) grow less
confident
that the train will arrive soon, since the longer you wait, the more likely it
seems that either the scheduled arrival times are far apart or else that you happened to
arrive just after the last train left – or both
.
Example
If you choose (a), you’re thinking like a frequentist.
If you choose
(b), you’re thinking like a Bayesian.
An opaque jar contains thousands of beads (but obviously a finite number!). You know
that all the beads are either red or white but you have no idea at all what fraction of
them are red. You begin to draw beads out of the bin at random without replacement.
You notice that all of the first several beads have been red.
As you observe more and more red beads, is the conditional probability (i.e., conditional
upon the previous draws’ colors) of the next bead being red
(a) decreasing, as it must,
since you’re removing red beads from a finite population
, or (b) increasing, because you
initially didn’t know that there would be so many reds, but now it seems that the jar
must be mostly reds
.
Example
If you choose (a), you’re thinking like a frequentist.
If you choose
(b), you’re thinking like a Bayesian.
Limiting relative frequency:
P(a) = lim
n®¥
#times a happens in n trials
n
A nice definition mathematically, but not so great in practice.
(So we often appeal to symmetry…)
What if you can’t get all the way to infinity today?
What if there is only 1 trial? E.g., P(snow tomorrow)
What if appealing to symmetry fails? E.g., I take a penny out of my
pocket and spin it. What is P(H)?
Probability
Prob = odds
1+ odds
Fair die
Event Prob Odds
roll a 2 1/6 1/5
[or 1/5:1 or 1:5]
even # 1/2 1 [or 1:1]
X > 2 2/3 2 [or 2:1]
Persi Diaconis: “Probability isn’t a fact about the world;
probability is a fact about an observer’s knowledge.

Odds =
prob
1prob
Subjective probability
P(a|b) = P(a,b)/P(b)
P(b|a) = P(a,b)/P(a)
P(a,b) = P(a)P(b|a)
Thus, P(a|b) = P(a,b)/P(b) = P(a)P(b|a)/P(b)
But what is P(b)? P(b) = Σ P(a,b) = Σ P(a)P(b|a)
a
Thus, P(a|b) =
a
P(a)P(b | a)
P(a*)P(b | a*)
a *
å
Where a* means “any value of a”
Bayes’ theorem
P(a|b) =
P(a)P(b | a)
P(a*)P(b | a*)
a *
å
P(a|b) µ P(a) P(b|a)
posterior
µ(prior) (likelihood)
Bayes’ Theorem is used to take a prior probability, update it
with data (the likelihood), and get a posterior probability.

Medical test example. Suppose a test is 95% accurate when
a disease is present and 97% accurate when the disease
is absent. Suppose that 1% of the population has the disease.
What is P(have the disease | test +)?

p(dis | test+) = P(dis)P(test+ | dis)
P(dis)P(test
+ | dis) + P(Ødis)P(test+ | Ødis)
=
(0.01)(0.95)
(
0.01)(0.95) + (0.99)(0.03)
=
0.0095
0
.0095+ 0.0297
» 0.24
Bayesian methods: Describe the distribution P(θ | data).
A frequentist thinks of θ as fixed (but unknown) while a Bayesian
thinks of θ as a random variable that has a distribution
.
Typical statistics problem: There is a parameter, θ, that we want to
estimate, and we have some data.
Traditional (frequentist) methods: Study and describe P(data | θ). If
the data are unlikely for a given θ, then state “that value of θ is not
supported by the data.” (A hyp. test asks whether a
particular value of
θ might be correct; a CI presents a range of plausible values.)
Bayesian reasoning is natural and easy to think about. It is becoming
much more commonly used.

If Bayes is so great, why hasn’t it always been popular?
(1) Without Markov Chain Monte Carlo, it wasn’t practical.
(2) Some people distrust prior distributions, thinking that science
should be objective (as if that were possible).
Bayes is becoming much more common, due to MCMC.
E.g., Three recent years worth of
J. Amer. Stat. Assoc. “applications
and case studies” papers: 46 of 84 papers used Bayesian methods (+ 4
others merely included an application of Bayes’ Theorem).

QUESTIONS?
35
Food for thought…
36
Agenda for this/next week
37
Thank you
References
39