CS2035 (Jan-May 2020)
You have been hired as a data analysis consultant by a large poultry farm. This farm has
recently run an experiment to determine if the protein diet currently given to their chicks
can be improved. The experiment consisted in weighting the chicks at birth and every
second day thereafter until day 20. A last measurement was also done on day 21. The
chicks were randomly divided in four distinct groups and each group received a different
The company sent you the file chicks.csv that contains the results of this experiment.
The variable weight is the chick weight in grams, Time is the day of the weighting, Chick
is the identification number of the weighted chick and Diet is the diet this chick was fed
with. Diet 1 is the current protein diet and diets 2,3,4 are the prospective ones.
The farm sells chicks for consumption when they are either 12 or 21 days old. The
heavier the chick, the higher its price, hence the farm seeks to maximize the chick weight
at age 12 and 21 days old.
The company would like you to determine if one (or more) of the three prospective diets
could potentially replace the existing one in order to increase the chicks weight. Write a
short paragraph with your final recommendations to the poultry farm. Write a Matlab
script that generates a rigorous statistical analysis, figures (no more than three) and a
summary table to support you recommendations. The figure(s) and summary table
should convey as clearly as possible your conclusion(s) to the client.
Page 2 of 4
University of Western Ontario CS2035 (Jan-May 2020)
Exercise 2 – Diabetes in Pima Women
You are the data analyst of a scientific team in the US National Institute of Diabetes and
Digestive and Kidney Diseases. Your team investigates diabetes risk among Native
Americans. The epidemiologists of your group have just finished collecting data on a
population of adult women of Pima Indian heritage living near Phoenix, Arizona. In this
dataset, the women were tested for diabetes according to World Health Organization
criteria and the following six biological variables were also recorded:
• the number of pregnancies (npreg)
• plasma glucose concentration in an oral glucose tolerance test in mg/dL (glu)
• diastolic blood pressure in mm Hg (bp)
• triceps skin fold thickness in mm (skin)
• body mass index in kg m−2
• age in years (age)
The file diabetes-pima.csv contains the collected data, and the variable diabetes
codes whether a woman is diabetic (Yes) or not (No).
Your clinical collaborators would like to infer from this dataset the probability that a
women of Pima Indian heritage and not already present in this data, is diabetic when
considering each of the six variables above independently.
a) Provide your collaborators a Matlab function that calculates the probability that
a woman of Pima Indian heritage is diabetic given any one of the six biological
variables (and independently of the five other).
b) Generate one figure that compares how that probability varies across a relevant
range of values for all six biological variables (again, independently from one
c) One of your collaborator wants to know if the patients represented in this dataset
are homogeneous when we consider the six biological variables. To address her
concern visually, perform a classical multi-dimensional scaling map in two
dimensions where each data point is coloured according to the patient’s diabetic
status. What is your conclusion?
The file cars.csv contains some features of consumer cars produced in the 1970s / early
1980s. We call the “engineering” variables all the variables of this dataset except the first
three variables: manufacturer (Mfg), model and year of the model (Model year).
a) Perform a principal component analysis (PCA) on the engineering variables of this
dataset and provide a visual representation of the PCA. Justify why we should use,
or not, inverse-variance weights for this PCA. Quantify how much this PCA
explains the variance of the data on its first two principal components. Do you
think this is enough?
b) Using a biplot, interpret how the first two principal components of the PCA
attempts to segregate the data.
c) Perform a classical multi-dimensional scaling (MDS) on the engineering variables of
this data set.
d) Show by simply creating two figures that the MDS data do not cluster by
manufacturer nor by the year of the model.
e) Perform and plot a clustering analysis on the MDS data using the k-means method
with three clusters. For each cluster on the plot, annotate 5 data points chosen
randomly with their model names.