MAST90102 Assessment 5

Assessment 5
Guidelines for the length and structure of responses are given in the questions. Please
ensure your submission is as polished and professionally presented as possible; please see
journal articles as a guide.
Part 1 (28 marks)
The dataset chol_riskf.dta provides data from 239 unrelated individuals from the Victorian
Family Heart Study. Total cholesterol (totchol) is the outcome measure.
The variables in this dataset are:

Variable name Description
proxyid Unique identifier
male 0=Female, 1=Male
age Age (years)
hgt Height (cm)
wgt Weight (kg)
bmi Body mass index (kg/m2)
smoke Smoking status (0=non-smoker, 1=ex-smoker,
2=current smoker (≤ 20 cigarettes per day), 3= current
smoker (> 20 cigarettes per day)
totchol Total cholesterol level (mmol/litre)

If you perform some quick exploratory analysis you will see that the total cholesterol variable
is reasonably well behaved: please analyse this variable on its original scale. (Do not spend
time in your responses considering whether or not transformation of the outcome is
You are asked to consider the following exposure measures, which are both of potential
interest because they relate to total cholesterol:

wgt Weight (kg)
bmi Body mass index (kg/m2)

The other variables (age, male sex, and smoking status) are all to be considered as potential
covariates in multiple regression analysis, as described in the questions below.
[Note that the usual caveats apply: these data have been sampled and modified from an
original study and no substantive conclusions should be drawn from these analyses.]
The overall aim of your analysis is to examine the evidence for an association between total
cholesterol and the two exposure measures using regression methods, following the outline
For parts 1a and 1b written explanations and interpretations, tables and graphs should be
provided. Computer output or code should be provided in an Appendix.
Perform the following steps (marks indicated):
(1a) [6 marks – 1 page limit]
Use a multiple linear regression model to obtain estimates of association between total
cholesterol and the two exposure measures simultaneously. (For part (a) ignore all of the
other covariates.) Would you recommend omitting one of the exposure measures from the
model? Is there a collinearity problem?
(1b) [10 marks – 2 page limit]
Your collaborator is concerned about potential confounding effects and effect modification.
They have stated in a Statistical Analysis Plan (SAP) that (i) age, sex and smoking status are
confounders, and that (ii) a further analysis will investigate if sex modifies the association
between the exposures of interest and total cholesterol.
Perform the analyses for (i) and (ii) above and provide appropriate tables, figures and text
that interprets the findings.
For the potential confounding effect of sex, age and smoking status, comment on whether
adjustment for these variables affect the associations found in part (a)? Can you explain why,
for the major effects that you observe? (This should be in general statistical terms, without
needing to be an expert in the subject matter.)
[N.B. For all of these analyses, including part (a), you should investigate if the association
between continuous covariates and the outcome is linear. Note, you do not need to go into
extensive detail with respect to investigating particular influential points, unless you identify
major issues that would affect the overall interpretations.]
(1c) [6 marks – 200 words limit]
Provide a statistical analysis paragraph (as commonly given in the methods sections of
medical research articles – see British Medical Journal (BMJ; ) for
examples) that describes your analysis.
(1d) [6 marks – 200 words limit]
Conclude with a general summary that describes the findings for the associations between
the two exposures (body weight and body mass index) and the outcome, total cholesterol.
This should take the form of a single paragraph that summarises the main results and
attempts to interpret them.
Part 2 (14 marks)
A sexual health researcher has asked you for some statistical help in interpreting the results
of their study. In this study, the researcher randomised 100 people into 4 different education
interventions, and measured their knowledge on sexually transmitted infections (STIs) one
month later. The knowledge score is measured on a scale from 0 to 25 and the education
groups are as follows:
Group A: Control group
Group B: A one on one discussion with a nurse about STIs
Group C: A fact sheet / brochure
Group D: A group presentation
The data are provided in the dataset “knowledge.dta”.
The researcher has previously completed an introductory statistics course, and analysed the
scores across groups using the stata code below, where variables B, C, and D represent
indicator variables for education groups B, C and D respectively, and ‘score’ represents the
knowledge score.
regress score B C D

Source | SS df MS Number of obs = 100
————-+———————————- F(3, 96) = 2.70
Model | 214.16 3 71.3866667 Prob > F = 0.0497
Residual | 2534.4 96 26.4 R-squared = 0.0779
————-+———————————- Adj R-squared = 0.0491
Total | 2748.56 99 27.7632323 Root MSE = 5.1381


score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
B | 2.36 1.453272 1.62 0.108 -.5247225 5.244722
C | 2.32 1.453272 1.60 0.114 -.5647225 5.204722
D | 4.12 1.453272 2.83 0.006 1.235278 7.004722
_cons | 14.68 1.027619 14.29 0.000 12.64019 16.71981

(2a) [4 marks]
The researcher interprets the results as telling him that group D (group presentation) is the
only one that produces a higher knowledge score than the control group. Why does he make
this conclusion and what is wrong with it?
(2b) [6 marks]
Following the conclusion that he reached in part 2a, the researcher decided to leave the
“non-significant” indicator variables out of the regression model and obtained the following
regress score D

Source | SS df MS Number of obs = 100
————-+———————————- F(1, 98) = 4.59
Model | 122.88 1 122.88 Prob > F = 0.0347
Residual | 2625.68 98 26.7926531 R-squared = 0.0447
————-+———————————- Adj R-squared = 0.0350
Total | 2748.56 99 27.7632323 Root MSE = 5.1762


score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
D | 2.56 1.195383 2.14 0.035 .1878005 4.932199
_cons | 16.24 .5976917 27.17 0.000 15.0539 17.4261

He suggests that this provides the simplest summary result, and asks you to explain why the
coefficient estimate has reduced (and the P-value increased) compared with the previous
model. Would you recommend that this estimate be reported? Explain why or why not and
provide a detailed explanation of what it means, using a little algebra to make this clear.
(2c) [4 marks]
Having persuaded the investigator that the initial approach above was not addressing his
specific questions of interest, you find after further discussion that he is primarily interested
in the following comparisons among the diets:
(i) Control group versus all other education interventions combined together
(ii) The fact sheet / brochure (C) compared with the more interactive interventions
combined together (B and D)
Express these comparisons in terms of the ’s using a regression model with binary group
indicators for interventions B, C, and D, and estimate them using the data and the
appropriate Stata command.
Part 3 (18 marks)
For this question you will need to analyse the dataset of 500 patients who were randomised
to either a new treatment or the standard treatment.
The variables in the dataset were simulated for this question and are coded as:

new treatment (coded as 1) & standard treatment (coded as 0)
foot score (measuring pain in foot on a scale of 0 to 100 where a higher score
indicates less pain) at baseline.


fscore_12 foot score at 12 months.
(3a) [4 marks]

Provide a table of the distribution of foot score at baseline and 12 months by treatment
group and describe this table in a single paragraph.
(3b) [4 marks]
In question 3a you would have noticed that there are missing data for foot score at 12
months for some of the trial participants. Provide a table of the distribution of treatment and
foot score at baseline for those with and without foot scores measurements at 12 months
and describe this table in a single paragraph.
(3c) [4 marks]
Perform a linear regression, adjusting for baseline foot score, to estimate the association
between treatment and foot score at 12 months. This analysis is known as a complete-case
analysis as only those with complete data on all variables in the regression model are
included. Comment on the potential limitations of this analysis in terms of bias and precision.
(3d) [6 marks]
Sometimes researchers perform an adhoc approach, known as the last observation carried
forward (LOCF) to handle missing data in the outcome of the trial. Here the missing values
for the foot score at 12 months are replaced by the participant’s foot score at baseline.
Perform this analysis and comment on how the estimate and standard error for treatment
has changed compared to the complete-case estimate and standard error in part 3c.
Comment on the major assumption that this approach makes and why it is therefore not

Leave a Reply

Your email address will not be published. Required fields are marked *