Session 8

Data Driven Decision Making – DDDM

Agenda for this/next week

• Recap session seven

• Web scraping

• Categorical data decision making

• Bayesian network analysis

2

Module Assessment

Reminder: see Blackboard for assessment details

3

To start with…

• Recap on session seven

4

Learning Outcomes

5

On completion of this workshop you will be able to:

• Web scraping

• Categorical data decision making

• Bayesian network analysis

Imagine a scenario at a credit card company who requires predicting credit card fraud. It is

possible to use Logistic Regression to estimate the probability of a transaction as being

fraudulent, given predictors such as time of day, type of transaction, and region of

purchase. It is possible to build a model on historical data and then score new transactions

and classify them as fraudulent or not.

In this session, we first look for associations between predictors and a binary response

using hypothesis tests. We then build a logistic regression model and discuss how to

characterize the relationship between the response and predictors. Finally, we’ll use logistic

regression to build a model, or classifier, to predict unknown cases.

Synopsis

I. Classification and prediction

II. Clustering and similarity

Classification and clustering

• What is classification? What is prediction?

• Decision tree induction

• Bayesian classification

• Other classification methods

• Classification accuracy

• Summary

Overview

Classification and prediction

• Aim: to predict categorical class labels for new tuples/samples

• Input: a training set of tuples/samples, each with a class label

• Output: a model (a classifier) based on the training set and the class labels

What is classification?

• Credit approval

• Target marketing

• Medical diagnosis

• Treatment effectiveness

analysis

• Many many more!

Applications

Typical classification applications

• Is similar to classification

>constructs a model

>uses the model to predict unknown or missing values

• Major method: regression

>linear and multiple regression

>non-linear regression

What is prediction?

• Classification:

o predicts categorical class labels

o classifies data based on the training set and the values in a

classification attribute and uses it in classifying new data

• Prediction:

o models continuous-valued functions

o predicts unknown or missing values

Classification vs. prediction

Training

Data

NAME RANK YEARS TENURED

Mary Assistant Prof 3 no

James Assistant Prof 7 yes

Bill Professor 2 no

John Associate Prof 7 yes

Mark Assistant Prof 6 no

Annie Associate Prof 3 no

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = yes

Classifier

(Model)

An example: model construction

Testing

Data

Classifier

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Lisa Associate Prof 7 no

Jack Professor 5 yes

Ann Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Yes

A decision tree is a tree where:

• internal node = a test on an attribute

• tree branch = an outcome of the test

• leaf node = class label or class distribution

A? |

B? C?

D? Yes

Decision tree induction

Two phases of decision tree generation:

• tree construction

o at start, all the training examples at the root

o partition examples based on selected attributes

o test attributes are selected based on a heuristic or a statistical measure

• tree pruning

o identify and remove branches that reflect noise or outliers

Decision tree generation

Outlook Temperature Humidity Windy Class

sunny hot high false N

sunny hot high true N

overcast hot high false P

rain mild high false P

rain cool normal false P

rain cool normal true N

overcast cool normal true P

sunny mild high false N

sunny cool normal false P

rain mild normal false P

sunny mild normal true P

overcast mild high true P

overcast hot normal false P

rain mild high true N

Training set

Decision tree induction – classical example: play tennis?

outlook

overcast

humidity windy

high normal true false

sunny rain

P

N P N P

Decision tree

• One rule is generated for each path in the

tree from the root to a leaf

• Each attribute-value pair along a path

forms a conjunction

• The leaf node holds the class prediction

• Rules are generally simpler to understand

than trees

IF outlook=sunny

AND humidity=normal

THEN play tennis

outlook |

overcast

humidity windy

high normal true false

sunny rain

P

N P N P

From a decision tree to classification rules

Bayesian network analysis

20

The use of Bayesian methods has become increasingly popular in modern statistical

analysis, with applications in a wide variety of scientific fields.

Bayesian methods incorporate existing information (based on expert knowledge, past

studies, and so on) into current data analysis.

This existing information is represented by a prior distribution, and the data likelihood is

effectively weighted by the prior distribution as the data analysis results are computed.

Synopsis

Because Bayes answers the questions we really care about.

Pr(I have disease | test +) vs Pr(test + | disease)

Pr(A better than B | data) vs Pr(extreme data | A=B)

Bayes is natural (vs interpreting a CI or a P-value)

Note: blue = Bayesian, red = frequentist

Why Bayes?

You are waiting on a subway platform for a train that is known to run on a regular

schedule, only you don’t know how much time is scheduled to pass between train arrivals,

nor how long it’s been since the last train departed.

As more time passes, do you (a) grow more confident that the train will arrive soon,

since its eventual arrival can only be getting closer, not further away, or (b) grow less

confident that the train will arrive soon, since the longer you wait, the more likely it

seems that either the scheduled arrival times are far apart or else that you happened to

arrive just after the last train left – or both.

Example

If you choose (a), you’re thinking like a frequentist.

If you choose (b), you’re thinking like a Bayesian.

An opaque jar contains thousands of beads (but obviously a finite number!). You know

that all the beads are either red or white but you have no idea at all what fraction of

them are red. You begin to draw beads out of the bin at random without replacement.

You notice that all of the first several beads have been red.

As you observe more and more red beads, is the conditional probability (i.e., conditional

upon the previous draws’ colors) of the next bead being red (a) decreasing, as it must,

since you’re removing red beads from a finite population, or (b) increasing, because you

initially didn’t know that there would be so many reds, but now it seems that the jar

must be mostly reds.

Example

If you choose (a), you’re thinking like a frequentist.

If you choose (b), you’re thinking like a Bayesian.

Limiting relative frequency:

P(a) = lim

n®¥

#times a happens in n trials

n

A nice definition mathematically, but not so great in practice.

(So we often appeal to symmetry…)

What if you can’t get all the way to infinity today?

What if there is only 1 trial? E.g., P(snow tomorrow)

What if appealing to symmetry fails? E.g., I take a penny out of my

pocket and spin it. What is P(H)?

Probability

Prob = odds

1+ odds

Fair die

Event Prob Odds

roll a 2 1/6 1/5 [or 1/5:1 or 1:5]

even # 1/2 1 [or 1:1]

X > 2 2/3 2 [or 2:1]

Persi Diaconis: “Probability isn’t a fact about the world;

probability is a fact about an observer’s knowledge.”

Odds =

prob

1– prob

Subjective probability

P(a|b) = P(a,b)/P(b)

P(b|a) = P(a,b)/P(a) ➔P(a,b) = P(a)P(b|a)

Thus, P(a|b) = P(a,b)/P(b) = P(a)P(b|a)/P(b)

But what is P(b)? P(b) = Σ P(a,b) = Σ P(a)P(b|a)

a

Thus, P(a|b) =

a

P(a)P(b | a)

P(a*)P(b | a*)

a *

å

Where a* means “any value of a”

Bayes’ theorem

P(a|b) =

P(a)P(b | a)

P(a*)P(b | a*)

a *

å

P(a|b) µ P(a) P(b|a)

posterior µ(prior) (likelihood)

Bayes’ Theorem is used to take a prior probability, update it

with data (the likelihood), and get a posterior probability.

Medical test example. Suppose a test is 95% accurate when

a disease is present and 97% accurate when the disease

is absent. Suppose that 1% of the population has the disease.

What is P(have the disease | test +)?

p(dis | test+) = P(dis)P(test+ | dis)

P(dis)P(test+ | dis) + P(Ødis)P(test+ | Ødis)

=

(0.01)(0.95)

(0.01)(0.95) + (0.99)(0.03)

=

0.0095

0.0095+ 0.0297

» 0.24

Bayesian methods: Describe the distribution P(θ | data).

A frequentist thinks of θ as fixed (but unknown) while a Bayesian

thinks of θ as a random variable that has a distribution.

Typical statistics problem: There is a parameter, θ, that we want to

estimate, and we have some data.

Traditional (frequentist) methods: Study and describe P(data | θ). If

the data are unlikely for a given θ, then state “that value of θ is not

supported by the data.” (A hyp. test asks whether a particular value of

θ might be correct; a CI presents a range of plausible values.)

Bayesian reasoning is natural and easy to think about. It is becoming

much more commonly used.

If Bayes is so great, why hasn’t it always been popular?

(1) Without Markov Chain Monte Carlo, it wasn’t practical.

(2) Some people distrust prior distributions, thinking that science

should be objective (as if that were possible).

Bayes is becoming much more common, due to MCMC.

E.g., Three recent years worth of J. Amer. Stat. Assoc. “applications

and case studies” papers: 46 of 84 papers used Bayesian methods (+ 4

others merely included an application of Bayes’ Theorem).

QUESTIONS?

35

Food for thought…

36

Agenda for this/next week

37

Thank you

References

39