1

MAST90102

Assessment 5

Guidelines for the length and structure of responses are given in the questions. Please

ensure your submission is as polished and professionally presented as possible; please see

journal articles as a guide.

Part 1 (28 marks)

The dataset chol_riskf.dta provides data from 239 unrelated individuals from the Victorian

Family Heart Study. Total cholesterol (totchol) is the outcome measure.

The variables in this dataset are:

Variable name | Description |

proxyid | Unique identifier |

male | 0=Female, 1=Male |

age | Age (years) |

hgt | Height (cm) |

wgt | Weight (kg) |

bmi | Body mass index (kg/m2) |

smoke | Smoking status (0=non-smoker, 1=ex-smoker, 2=current smoker (≤ 20 cigarettes per day), 3= current smoker (> 20 cigarettes per day) |

totchol | Total cholesterol level (mmol/litre) |

If you perform some quick exploratory analysis you will see that the total cholesterol variable

is reasonably well behaved: please analyse this variable on its original scale. (Do not spend

time in your responses considering whether or not transformation of the outcome is

necessary.)

You are asked to consider the following exposure measures, which are both of potential

interest because they relate to total cholesterol:

wgt | Weight (kg) |

bmi | Body mass index (kg/m2) |

The other variables (age, male sex, and smoking status) are all to be considered as potential

covariates in multiple regression analysis, as described in the questions below.

[Note that the usual caveats apply: these data have been sampled and modified from an

original study and no substantive conclusions should be drawn from these analyses.]

The overall aim of your analysis is to examine the evidence for an association between total

cholesterol and the two exposure measures using regression methods, following the outline

below.

For parts 1a and 1b written explanations and interpretations, tables and graphs should be

provided. Computer output or code should be provided in an Appendix.

Perform the following steps (marks indicated):

(1a) [6 marks – 1 page limit]

Use a multiple linear regression model to obtain estimates of association between total

cholesterol and the two exposure measures simultaneously. (For part (a) ignore all of the

other covariates.) Would you recommend omitting one of the exposure measures from the

model? Is there a collinearity problem?

2

(1b) [10 marks – 2 page limit]

Your collaborator is concerned about potential confounding effects and effect modification.

They have stated in a Statistical Analysis Plan (SAP) that (i) age, sex and smoking status are

confounders, and that (ii) a further analysis will investigate if sex modifies the association

between the exposures of interest and total cholesterol.

Perform the analyses for (i) and (ii) above and provide appropriate tables, figures and text

that interprets the findings.

For the potential confounding effect of sex, age and smoking status, comment on whether

adjustment for these variables affect the associations found in part (a)? Can you explain why,

for the major effects that you observe? (This should be in general statistical terms, without

needing to be an expert in the subject matter.)

[N.B. For all of these analyses, including part (a), you should investigate if the association

between continuous covariates and the outcome is linear. Note, you do not need to go into

extensive detail with respect to investigating particular influential points, unless you identify

major issues that would affect the overall interpretations.]

(1c) [6 marks – 200 words limit]

Provide a statistical analysis paragraph (as commonly given in the methods sections of

medical research articles – see British Medical Journal (BMJ; www.bmj.com/theBMJ ) for

examples) that describes your analysis.

(1d) [6 marks – 200 words limit]

Conclude with a general summary that describes the findings for the associations between

the two exposures (body weight and body mass index) and the outcome, total cholesterol.

This should take the form of a single paragraph that summarises the main results and

attempts to interpret them.

Part 2 (14 marks)

A sexual health researcher has asked you for some statistical help in interpreting the results

of their study. In this study, the researcher randomised 100 people into 4 different education

interventions, and measured their knowledge on sexually transmitted infections (STIs) one

month later. The knowledge score is measured on a scale from 0 to 25 and the education

groups are as follows:

Group A: Control group

Group B: A one on one discussion with a nurse about STIs

Group C: A fact sheet / brochure

Group D: A group presentation

The data are provided in the dataset “knowledge.dta”.

The researcher has previously completed an introductory statistics course, and analysed the

scores across groups using the stata code below, where variables B, C, and D represent

indicator variables for education groups B, C and D respectively, and ‘score’ represents the

knowledge score.

regress score B C D

Source | SS df MS | Number of obs | = | 100 |

————-+———————————- | F(3, 96) | = | 2.70 |

Model | 214.16 3 71.3866667 | Prob > F | = | 0.0497 |

Residual | 2534.4 96 26.4 | R-squared | = | 0.0779 |

————-+———————————- | Adj R-squared | = | 0.0491 |

Total | 2748.56 99 27.7632323 | Root MSE | = | 5.1381 |

3

——————————————————————————

score | | | Coef. | Std. Err. | t | P>|t| | [95% Conf. Interval] |

————-+—————————————————————- |

B | | 2.36 | 1.453272 | 1.62 | 0.108 | -.5247225 | 5.244722 |

C | | 2.32 | 1.453272 | 1.60 | 0.114 | -.5647225 | 5.204722 |

D | | 4.12 | 1.453272 | 2.83 | 0.006 | 1.235278 | 7.004722 |

_cons | | 14.68 | 1.027619 | 14.29 | 0.000 | 12.64019 | 16.71981 |

——————————————————————————

(2a) [4 marks]

The researcher interprets the results as telling him that group D (group presentation) is the

only one that produces a higher knowledge score than the control group. Why does he make

this conclusion and what is wrong with it?

(2b) [6 marks]

Following the conclusion that he reached in part 2a, the researcher decided to leave the

“non-significant” indicator variables out of the regression model and obtained the following

results:

regress score D

Source | SS df MS | Number of obs | = | 100 |

————-+———————————- | F(1, 98) | = | 4.59 |

Model | 122.88 1 122.88 | Prob > F | = | 0.0347 |

Residual | 2625.68 98 26.7926531 | R-squared | = | 0.0447 |

————-+———————————- | Adj R-squared | = | 0.0350 |

Total | 2748.56 99 27.7632323 | Root MSE | = | 5.1762 |

——————————————————————————

score | | | Coef. | Std. Err. | t | P>|t| | [95% Conf. Interval] |

————-+—————————————————————- |

D | | 2.56 | 1.195383 | 2.14 | 0.035 | .1878005 | 4.932199 |

_cons | | 16.24 | .5976917 | 27.17 | 0.000 | 15.0539 | 17.4261 |

——————————————————————————

He suggests that this provides the simplest summary result, and asks you to explain why the

coefficient estimate has reduced (and the P-value increased) compared with the previous

model. Would you recommend that this estimate be reported? Explain why or why not and

provide a detailed explanation of what it means, using a little algebra to make this clear.

(2c) [4 marks]

Having persuaded the investigator that the initial approach above was not addressing his

specific questions of interest, you find after further discussion that he is primarily interested

in the following comparisons among the diets:

(i) Control group versus all other education interventions combined together

(ii) The fact sheet / brochure (C) compared with the more interactive interventions

combined together (B and D)

Express these comparisons in terms of the ’s using a regression model with binary group

indicators for interventions B, C, and D, and estimate them using the data and the

appropriate Stata command.

Part 3 (18 marks)

For this question you will need to analyse the dataset of 500 patients who were randomised

to either a new treatment or the standard treatment.

The variables in the dataset were simulated for this question and are coded as:

treatment fscore_0 | new treatment (coded as 1) & standard treatment (coded as 0) foot score (measuring pain in foot on a scale of 0 to 100 where a higher score indicates less pain) at baseline. |

4

fscore_12 | foot score at 12 months. |

(3a) | [4 marks] |

Provide a table of the distribution of foot score at baseline and 12 months by treatment

group and describe this table in a single paragraph.

(3b) [4 marks]

In question 3a you would have noticed that there are missing data for foot score at 12

months for some of the trial participants. Provide a table of the distribution of treatment and

foot score at baseline for those with and without foot scores measurements at 12 months

and describe this table in a single paragraph.

(3c) [4 marks]

Perform a linear regression, adjusting for baseline foot score, to estimate the association

between treatment and foot score at 12 months. This analysis is known as a complete-case

analysis as only those with complete data on all variables in the regression model are

included. Comment on the potential limitations of this analysis in terms of bias and precision.

(3d) [6 marks]

Sometimes researchers perform an adhoc approach, known as the last observation carried

forward (LOCF) to handle missing data in the outcome of the trial. Here the missing values

for the foot score at 12 months are replaced by the participant’s foot score at baseline.

Perform this analysis and comment on how the estimate and standard error for treatment

has changed compared to the complete-case estimate and standard error in part 3c.

Comment on the major assumption that this approach makes and why it is therefore not

recommended.