C++代写 | COM SCI M226 -1 Research on machine learning algorithms

这个作业是用C++研究机器学习的算法

Research on machine learning algorithms
COM SCI M226 -1
Assignment 5 – Tutorial Explanation
40% of final score
Due date: 31st December 2019
Question: Create a new objective function that can model
multiple parameters
In this question, you are asked to create a new objective function that allows multiple parameters to be estimated by boosting. The objective function is as follows:
φ(y, F(x), G(x), H(x)) = −λF (Iy>0F − log(1 + IPκ>0e
F
)) −
λG(Iy>0G − y log(1 + Imax ω>0e
G)) − λH(yH − α exp(H)) (1)
where
• F, G, H are the parameters to be estimated
• λF , λG, λH are the penalty multipliers
• Iy>0 is an indicator function whether y is positive.
• κ is a subset of feature variables. E.g if the set of feature variables is {x1, x2, . . . , x100}.
κ = {x2, x6, x77}.
Pκ = x2 + x6 + x77
• ω is another subset of feature variables.ω = {x1, x14, x37, x99}. max ω = max{x1, x14, x37, x99}
• α is an external constant, (different for each observation but constant)
gradientF = Iy>0 −
e
F
1 + IPκ>0e
F HessianF = −
e
F
(1+IP κ>0eF )
2
gradientG = Iy>0 − y
e
G
1 + Imax ω>0e
G HessianG = −y
eG
(1+Imax ω>0eG)
2
gradientH = y HessianH = α exp(H)
outputF = −
gradientF
HessianF
outputG = −
gradientG
HessianG
outputH = log(
gradientH
HessianH
)
gain = −λF
(gradientF )
2
HessianF
− λG
(gradientG)
2
HessianG
− λH
(gradientH − HessianH)
2
HessianH
1
Research on machine learning algorithms
COM SCI M226 -1
Assignment 5 – Tutorial Explanation
40% of final score
Due date: 31st December 2019
Tutorial Explanation: Step by Step walkthrough to finish this assignment
To modify the package so that the above objective function can be optimized with the parameters F, G, H, the new package needs to allow users to provide more parameters, calculate
split prediction differently, and extend the boosting tree grow. You will be guided through
in details on how to complete the problem.
Throughout the question, you must observe the following requirement:
• The library used is lightgbm and you can find the source code in
https://github.com/microsoft/LightGBM.
• keep all the api and function calls from R and Python.
• For part A, you will need to modify the R, c++ and Python API, together with the
source codes in src folder.
• For part B,C, you will modify only the codes on C++.
• For part D, you will need to create a report through running the example in R or
python. You can get bonus if you compile the reports in both R and Python. The
report can be done through Jupyter notebook
Part A: include more variables
In this part, you are asked to enable users to provide more inputs to address new objective
requirement.
Subset of feature set, κ, ω
The first one is to include 2 additional vectors that is a subset of feature variables. E.g
if the set of feature variables is {x1, x2, . . . , x100}. κ = {x2, x6, x77} and another set ω =
{x1, x14, x37, x99}.
• Enable R/Python/C++ api to include the two new variables. They can be either the
location index (0 to 99 from the above example) or the exact names of the features
(x1).
2
Research on machine learning algorithms
COM SCI M226 -1
Assignment 5 – Tutorial Explanation
40% of final score
Due date: 31st December 2019
• As the current library automatically create bin and dataset for features. you may create
matrix or vectors of dataset/vector to be pointed in objectivefunction.
• If done in matrix, the 2 variables are in matrix format with dimension of num data
(the number of rows in the train data) * K (integers). There should be a step to check
if it holds.
• It should be a member object of bin data, feature histogram, objective function. You
can add further if found necessary.
• the main source files you need to observe include:
– files in src/boosting
– files in src/objective and create 1 additional objective file called newobjective.cpp
– files in src/io
– tree.h tree.cpp
– meta.h meta.cpp
– R package folder to allow input of additional parameters
– Python package folder to allow input of additional parameters
– c++ api to allow input of additional parameters
– include/LightGBM/config.h
Constant α
In lightgbm, users can specify weights vector. Treat α as a similar object to weights. All
the operations involving weights should also involve α. The above set of files in create κ, ω
shoudl be similar to the one for α
Objective related paramters
Create a new objective called ”newobjective” that specifies an additional set of hyperparameters to feed the new objective.
penalty parameter a double vector (λF , λG, λH) assigned to each component of loss/penalty
function
3
Research on machine learning algorithms
COM SCI M226 -1
Assignment 5 – Tutorial Explanation
40% of final score
Due date: 31st December 2019
confidence for κ a double vector that is in the range of [0, 1] that match the number of
columns of κ in part A.
confidence for ω a double vector that is in the range of [0, 1] that match the number of
columns of ω in part A.
Part B : Minor modification of library logistics
Now that you enable the right amount of parameters to be included for modeling, you now
need to modify the logistic of the library
ConstructHistogram and bin: Modify ConstructHistogram function in all bin files and
convert the member items of bin data (sum gradients, sum hessian, cnt, weight, etc)
from scalar to vector. The dependence function that take sum gradients, sum hessian,
etc should be modified accordingly.
GetLeafSplitGain and CalculateSplittedLeafOutput: in feature histogram.hpp, GetLeafSplitGain adn CalculateSplittedLeafOutput function is defined to calculate each split
gain. Instead of directly calculate inside, create the same function in objective function
and call the objectivefunction member function from there.
Part C : Create new train training logic
Traditional boosting is achieved through iterative tree based prediction to minimize residual
loss. This new objective has 3 steps.
First identify the split/variable with biggest prediction gap First, identify within the
κ variable set for odd tree or ω variable set for even tree, identify the split/variable that
has the biggest average score difference abs((P
L
score/nL) − (
P
R
score/nR)). Then
create indicator vector (observations in left partition is 0, in right partition is 1). No
actual tree split, only find the partition variable for next step.
Grow actual tree similar to the traditional tree Similar to the traditional tree grow,
calculate output and gain using sum gradient, sum hessian etc. However, this time,
the gradient and hessian are only calculated when partition value is 1. i.e., new sum
gradient in node l is P
l
gradientiIi
.
4
Research on machine learning algorithms
COM SCI M226 -1
Assignment 5 – Tutorial Explanation
40% of final score
Due date: 31st December 2019
Further grow the tree completed in the previous step For each leaf, we split one more
time. i.e., if there are 10 leaves in the tree created in the above step, this step will expand the tree to 20 leaves. For each leaf, we operate similar to the first step: identify
within the κ variable set (assuming we are at 1st tree, 3rd tree and so on), identify
the split/variable that has the biggest average score difference abs((P
L
score/nL) −
(
P
R
score/nR)). This time, actually expand the tree with the split. Output is calculated based on sum gradient/ sum hessian.
Part D : Implementation of this new objective in real
data
Bonus (20%) Model this with Hepatitis C Virus (HCV) for Egyptian patients Data Set with
κ = {Age, red blood cells}, ω = {ALT 24, ALT 36}, α = 0.1 if Gender = Male and α = 0.7
if Gender = Female. λF = 0.3,λG = 0.7,λF = 0.9
5

BEST代写-线上留学生作业代写 & 论文代写专家

C++代写 | COM SCI M226 -1 Research on machine learning algorithms

C++代写 | COM SCI M226 -1 Research on machine learning algorithms

bestdaixie