这个作业是用Matlab分析家禽养殖数据和糖尿病数据等

CS2035 (Jan-May 2020)

Assignment 3

You have been hired as a data analysis consultant by a large poultry farm. This farm has

recently run an experiment to determine if the protein diet currently given to their chicks

can be improved. The experiment consisted in weighting the chicks at birth and every

second day thereafter until day 20. A last measurement was also done on day 21. The

chicks were randomly divided in four distinct groups and each group received a different

diet.

The company sent you the file chicks.csv that contains the results of this experiment.

The variable weight is the chick weight in grams, Time is the day of the weighting, Chick

is the identification number of the weighted chick and Diet is the diet this chick was fed

with. Diet 1 is the current protein diet and diets 2,3,4 are the prospective ones.

The farm sells chicks for consumption when they are either 12 or 21 days old. The

heavier the chick, the higher its price, hence the farm seeks to maximize the chick weight

at age 12 and 21 days old.

The company would like you to determine if one (or more) of the three prospective diets

could potentially replace the existing one in order to increase the chicks weight. Write a

short paragraph with your final recommendations to the poultry farm. Write a Matlab

script that generates a rigorous statistical analysis, figures (no more than three) and a

summary table to support you recommendations. The figure(s) and summary table

should convey as clearly as possible your conclusion(s) to the client.

Page 2 of 4

University of Western Ontario CS2035 (Jan-May 2020)

Exercise 2 – Diabetes in Pima Women

You are the data analyst of a scientific team in the US National Institute of Diabetes and

Digestive and Kidney Diseases. Your team investigates diabetes risk among Native

Americans. The epidemiologists of your group have just finished collecting data on a

population of adult women of Pima Indian heritage living near Phoenix, Arizona. In this

dataset, the women were tested for diabetes according to World Health Organization

criteria and the following six biological variables were also recorded:

• the number of pregnancies (npreg)

• plasma glucose concentration in an oral glucose tolerance test in mg/dL (glu)

• diastolic blood pressure in mm Hg (bp)

• triceps skin fold thickness in mm (skin)

• body mass index in kg m−2

(bmi)

• age in years (age)

The file diabetes-pima.csv contains the collected data, and the variable diabetes

codes whether a woman is diabetic (Yes) or not (No).

Your clinical collaborators would like to infer from this dataset the probability that a

women of Pima Indian heritage and not already present in this data, is diabetic when

considering each of the six variables above independently.

a) Provide your collaborators a Matlab function that calculates the probability that

a woman of Pima Indian heritage is diabetic given any one of the six biological

variables (and independently of the five other).

b) Generate one figure that compares how that probability varies across a relevant

range of values for all six biological variables (again, independently from one

another).

c) One of your collaborator wants to know if the patients represented in this dataset

are homogeneous when we consider the six biological variables. To address her

concern visually, perform a classical multi-dimensional scaling map in two

dimensions where each data point is coloured according to the patient’s diabetic

status. What is your conclusion?

The file cars.csv contains some features of consumer cars produced in the 1970s / early

1980s. We call the “engineering” variables all the variables of this dataset except the first

three variables: manufacturer (Mfg), model and year of the model (Model year).

a) Perform a principal component analysis (PCA) on the engineering variables of this

dataset and provide a visual representation of the PCA. Justify why we should use,

or not, inverse-variance weights for this PCA. Quantify how much this PCA

explains the variance of the data on its first two principal components. Do you

think this is enough?

b) Using a biplot, interpret how the first two principal components of the PCA

attempts to segregate the data.

c) Perform a classical multi-dimensional scaling (MDS) on the engineering variables of

this data set.

d) Show by simply creating two figures that the MDS data do not cluster by

manufacturer nor by the year of the model.

e) Perform and plot a clustering analysis on the MDS data using the k-means method

with three clusters. For each cluster on the plot, annotate 5 data points chosen

randomly with their model names.