The Ohio State University College of Public Health PUBH-BIO 7220 Assignment 7

Instructions: Answer the following questions. Do not just cut and paste output as your answer.

You may include output and code as an appendix at the end of your assignment.

When asked to run a test, please include ALL of the following

1. The null and alternative hypotheses

2. The result with p-value

3. State if you reject the null hypothesis or not

4. One sentence conclusion in non-statistical language ***

Questions

I. Using the Cleveland data set, fit a regression model with HD_diag as the response and sex

trestbps chol, and exang as the covariates. Include your output here.

A. Evaluate the goodness of fit of the model using both Hosmer-Lemeshow test (using

deciles of risk) and the Pearson Chi-square statistic. Assess whether the results of the 2

tests are consistent.

II. On the basis of the logistic model with sex trestbps chol, and thalach as

covariate, estimate the sensitivity and specificity of classifying subjects as having or not

having heart disease diagnosis using the cut-off values for the probability of heart disease of

0.5.

A. Repeat the previous exercise using the cut-off point specified below and fill in the table.

Draw the ROC curve by hand using the values from the table.

Cut-off | Sensitivity | Specificity |

0 | ||

0.1 | ||

0.2 | ||

0.3 | ||

0.4 | ||

0.6 | ||

0.8 | ||

1 |

B. Use stat to obtain the ROC curve. What is the discriminatory power of the model?

C. Suppose someone had fraudulently accesses your computer and altered the data of the

dependent variable in such a way that the coefficients of the model would remain the

same. However, the predicted probabilities of the outcome would be largely affected.

What would happen to the goodness of fit statistics?

III. Fit a model with sex and fbs as covariates. Assess the overall fit of the model and its

discriminatory power by conduct the Hosmer-Lemeshow goodness of fit test and calculate

the area under the curve.

A. Estimate the predicted probability of the outcome.

The Ohio State University College of Public Health PUBH-BIO 7220 Assignment 7

The data in hyponatremia.dta derive from an epidemiological study of hyponatremia (a life

threatening condition) among runners of the 2002 Boston Marathon. Hyponatremia is defined as

an electrolyte disturbance in which the serum sodium concentration is lower than normal (<135

mmol/l). The aim of the study was to determine whether a runner experienced hyponatremia and

to identify the principal risk factors. Participants in the 2002 Boston Marathon completed a

survey including demographic and anthropometric characteristics (BMI) one or two days before

the race. After the race, runners provided a blood sample in order to measure their serum sodium

concentration and completed a questionnaire detailing their urine output during the race. Prerace

and postrace weights were also recorded. Use the hyponatremia dataset for the following

exercises.

IV. Run a logistic regression model with nas135 as the dependent variable and female and

urinat3p as covariates and estimate the predicted probability of the outcome. Conduct

the Hosmer-Lemeshow goodness of fit test and calculate the area under the ROC curve.

Include your output here.

V. Make a frequency table with nas135, female and urinat3p.

A. Open a new Stata session in which you create a dataset with these variables only in

aggregated form. Generate a variable named freq which is the frequency of each cell in

the table. The new dataset will have 8 rows, 1 for each combination of nas135,

female and urinat3p. Run a logistic regression model with nas135 as the

dependent variable and female and urinat3p as covariates. Include your output

here.

B. What are the estimated the predicted probability of the outcome?

C. Compare the coefficients and estimated predicted probability of outcome for this model

to those of the model using the original dataset.

D. Alter the odds of the outcome for each of the 4 female and urinat3p combinations:

create a new variable, named fakefreq, that has the value of 31 (nas135=1, female=0,

urinat3p=0), 45 (nas135=1, female=1, urinat3p=0), 6(nas135=1, female=0,

urinat3p=1), and 6(nas135=1, female=1, urinat3p=1). The total number of

observations in each of the 4 subgroups should not change, therefore the frequency for

the nas135=0 cells should change accordingly.

E. Fit a model with female and urinat3p using fakefreq instead of freq as weight.

Compute the estimated probabilities of the outcome and compare them with those

estimated from the original data.

F. Conduct the Hosmer-Lemeshow goodness of fit test and calculate the area under the

ROC curve. Compare both statistics with those obtained from the original data.

VI. Fit the model with runtime, wtdiff, bmi and bmi2 as covariates using the original

dataset where bmi2=bmi*bmi. Compute the leverage h, the change in chi-square ΔX2, the

change in deviance and the influence diagnostic ∆