COMPSCI-2035B Final Examination – Take Home Page 2 of 7
What is the total expected revenues for the next 5 days (assuming the owner works
every day)? What is the 90% confidence interval for your estimate?
COMPSCI-2035B Final Examination – Take Home Page 6 of 7
(22 points) 6. Spam Emails. A charity has contacted you to ask if you could help them filtering
the hundreds of spam emails they receive everyday. Spams affects directly the quality
of the service offered by the charity because they crowd out important and urgent
emails received from people in need. One of your friend thinks spams can be detected
by calculating the frequency of certain words and characters in the text of an email.
Your friend developed a program that calculates the frequencies of 48 selected keywords
and 9 other variables that look at other various metrics from the content of the email.
Hence, in total, there are 57 metrics extracted from a given email. The charity gave
you a random sample of 1,000 emails they received last month, and your friend ran the
program on those emails. Then, your friend read through all of the 1,000 emails and
annotated each email to indicate if it was indeed a spam or not. The result of this hard
work is saved in the file spam-train.csv , where the first 57 columns represent the
various metrics calculated by your friend’s program, and the last (58th) column is the
spam annotation: 1 if the email was a spam, 0 else.
a) Based on this “training” dataset presented in spam-train.csv , develop a predictive model based on a logistic regression coupled with a ROC analysis, that
classifies emails as spam or not. The charity is willing to accept that not more
than 2 out of 100 non-spam emails can be wrongly classified as “spam”.
b) Now that your predictive model is developed, your friend has retrieve new emails
received yesterday, ran the 57 metrics on them and saved the results in the file
spam-test.csv . Run you predictive model on this new data set and classify each
email as a spam or not. What is the proportion of spams that were identified with
your model in this new data set?
c) A software company has approached the charity, claiming they have a new stateof-the-art software that can detect spams like never before. The director of the
charity is tempted to buy this software but hesitates because of its hefty price.
The software company has run its state-of-the-art program on the same training
set spam-train.csv as you did. The software gives a numerical score to an email:
the higher the score, the more likely the email is a spam. The scores of the 1,000
emails of the training dataset are saved in spam comp.txt . The director asks you
if the software company does better than the method you and your friend provided
from your benevolent (free) work. Answer the director with a short paragraph that
explains your comparative analysis along with one single figure of your choice.
COMPSCI-2035B Final Examination – Take Home Page 7 of 7
(22 points) 7. Bike Sharing. A city put in place a bike sharing system a year ago. Through this
system, users are able to easily rent a bike from a particular station and return it back at
another station (possibly the same). There are complaints that some stations often have
no bike available to rent. The logistics to make sure there are enough bikes available at
all renting stations is complex and partially based on the duration of the bike ride for
each user. Municipal employees have noticed that the ride duration tends to be longer
when the weather is nice. If this is true, the municipal staff that moves bikes to empty
stations could plan its activity in advance, based on weather forecasts, to improve bike
availability. The manager of the bike sharing program wants to be sure that weather
influences the bike ride duration and hires you as a data analyst consultant to study
The manager provides a dataset of 2,000 bike rides randomly selected over the last year
in the file bike trips.csv as well as the file bike stations.csv that translates in
English the bike stations names from numerical codes. You also have access to weather
data for the city in the file bike weather.csv . The file bike INFO.txt contains
important additional information about those three files.
a) Merge the information from all three datasets (about bike rides, station names
and weather) in a single table that must have the following format (note that the
codes for weather and stations are replaced by their “names”):
day station start station end duration weather
1 BotanicalGardens MainStreetSouth 56.78 Sunny
1 MontagueStreet MontagueStreet 12.34 Storm
1 BakerStreet AdelaideStreet 8.76 Storm
1 TrainStation BotanicalGardens 6.31 Light Rain
2 ShoppingMall CravenAvenue 69.31 Sunny
· · · · · · · · · · · · · · ·
The table above is illustrative and does not represent the values of the actual data
contained in the file. The variable duration is the duration of the bike ride in