这个作业是用C++研究机器学习的算法

Research on machine learning algorithms

COM SCI M226 -1

Assignment 5 – Tutorial Explanation

40% of final score

Due date: 31st December 2019

Question: Create a new objective function that can model

multiple parameters

In this question, you are asked to create a new objective function that allows multiple parameters to be estimated by boosting. The objective function is as follows:

φ(y, F(x), G(x), H(x)) = −λF (Iy>0F − log(1 + IPκ>0e

F

)) −

λG(Iy>0G − y log(1 + Imax ω>0e

G)) − λH(yH − α exp(H)) (1)

where

• F, G, H are the parameters to be estimated

• λF , λG, λH are the penalty multipliers

• Iy>0 is an indicator function whether y is positive.

• κ is a subset of feature variables. E.g if the set of feature variables is {x1, x2, . . . , x100}.

κ = {x2, x6, x77}.

Pκ = x2 + x6 + x77

• ω is another subset of feature variables.ω = {x1, x14, x37, x99}. max ω = max{x1, x14, x37, x99}

• α is an external constant, (different for each observation but constant)

gradientF = Iy>0 −

e

F

1 + IPκ>0e

F HessianF = −

e

F

(1+IP κ>0eF )

2

gradientG = Iy>0 − y

e

G

1 + Imax ω>0e

G HessianG = −y

eG

(1+Imax ω>0eG)

2

gradientH = y HessianH = α exp(H)

outputF = −

gradientF

HessianF

outputG = −

gradientG

HessianG

outputH = log(

gradientH

HessianH

)

gain = −λF

(gradientF )

2

HessianF

− λG

(gradientG)

2

HessianG

− λH

(gradientH − HessianH)

2

HessianH

1

Research on machine learning algorithms

COM SCI M226 -1

Assignment 5 – Tutorial Explanation

40% of final score

Due date: 31st December 2019

Tutorial Explanation: Step by Step walkthrough to finish this assignment

To modify the package so that the above objective function can be optimized with the parameters F, G, H, the new package needs to allow users to provide more parameters, calculate

split prediction differently, and extend the boosting tree grow. You will be guided through

in details on how to complete the problem.

Throughout the question, you must observe the following requirement:

• The library used is lightgbm and you can find the source code in

https://github.com/microsoft/LightGBM.

• keep all the api and function calls from R and Python.

• For part A, you will need to modify the R, c++ and Python API, together with the

source codes in src folder.

• For part B,C, you will modify only the codes on C++.

• For part D, you will need to create a report through running the example in R or

python. You can get bonus if you compile the reports in both R and Python. The

report can be done through Jupyter notebook

Part A: include more variables

In this part, you are asked to enable users to provide more inputs to address new objective

requirement.

Subset of feature set, κ, ω

The first one is to include 2 additional vectors that is a subset of feature variables. E.g

if the set of feature variables is {x1, x2, . . . , x100}. κ = {x2, x6, x77} and another set ω =

{x1, x14, x37, x99}.

• Enable R/Python/C++ api to include the two new variables. They can be either the

location index (0 to 99 from the above example) or the exact names of the features

(x1).

2

Research on machine learning algorithms

COM SCI M226 -1

Assignment 5 – Tutorial Explanation

40% of final score

Due date: 31st December 2019

• As the current library automatically create bin and dataset for features. you may create

matrix or vectors of dataset/vector to be pointed in objectivefunction.

• If done in matrix, the 2 variables are in matrix format with dimension of num data

(the number of rows in the train data) * K (integers). There should be a step to check

if it holds.

• It should be a member object of bin data, feature histogram, objective function. You

can add further if found necessary.

• the main source files you need to observe include:

– files in src/boosting

– files in src/objective and create 1 additional objective file called newobjective.cpp

– files in src/io

– tree.h tree.cpp

– meta.h meta.cpp

– R package folder to allow input of additional parameters

– Python package folder to allow input of additional parameters

– c++ api to allow input of additional parameters

– include/LightGBM/config.h

Constant α

In lightgbm, users can specify weights vector. Treat α as a similar object to weights. All

the operations involving weights should also involve α. The above set of files in create κ, ω

shoudl be similar to the one for α

Objective related paramters

Create a new objective called ”newobjective” that specifies an additional set of hyperparameters to feed the new objective.

penalty parameter a double vector (λF , λG, λH) assigned to each component of loss/penalty

function

3

Research on machine learning algorithms

COM SCI M226 -1

Assignment 5 – Tutorial Explanation

40% of final score

Due date: 31st December 2019

confidence for κ a double vector that is in the range of [0, 1] that match the number of

columns of κ in part A.

confidence for ω a double vector that is in the range of [0, 1] that match the number of

columns of ω in part A.

Part B : Minor modification of library logistics

Now that you enable the right amount of parameters to be included for modeling, you now

need to modify the logistic of the library

ConstructHistogram and bin: Modify ConstructHistogram function in all bin files and

convert the member items of bin data (sum gradients, sum hessian, cnt, weight, etc)

from scalar to vector. The dependence function that take sum gradients, sum hessian,

etc should be modified accordingly.

GetLeafSplitGain and CalculateSplittedLeafOutput: in feature histogram.hpp, GetLeafSplitGain adn CalculateSplittedLeafOutput function is defined to calculate each split

gain. Instead of directly calculate inside, create the same function in objective function

and call the objectivefunction member function from there.

Part C : Create new train training logic

Traditional boosting is achieved through iterative tree based prediction to minimize residual

loss. This new objective has 3 steps.

First identify the split/variable with biggest prediction gap First, identify within the

κ variable set for odd tree or ω variable set for even tree, identify the split/variable that

has the biggest average score difference abs((P

L

score/nL) − (

P

R

score/nR)). Then

create indicator vector (observations in left partition is 0, in right partition is 1). No

actual tree split, only find the partition variable for next step.

Grow actual tree similar to the traditional tree Similar to the traditional tree grow,

calculate output and gain using sum gradient, sum hessian etc. However, this time,

the gradient and hessian are only calculated when partition value is 1. i.e., new sum

gradient in node l is P

l

gradientiIi

.

4

Research on machine learning algorithms

COM SCI M226 -1

Assignment 5 – Tutorial Explanation

40% of final score

Due date: 31st December 2019

Further grow the tree completed in the previous step For each leaf, we split one more

time. i.e., if there are 10 leaves in the tree created in the above step, this step will expand the tree to 20 leaves. For each leaf, we operate similar to the first step: identify

within the κ variable set (assuming we are at 1st tree, 3rd tree and so on), identify

the split/variable that has the biggest average score difference abs((P

L

score/nL) −

(

P

R

score/nR)). This time, actually expand the tree with the split. Output is calculated based on sum gradient/ sum hessian.

Part D : Implementation of this new objective in real

data

Bonus (20%) Model this with Hepatitis C Virus (HCV) for Egyptian patients Data Set with

κ = {Age, red blood cells}, ω = {ALT 24, ALT 36}, α = 0.1 if Gender = Male and α = 0.7

if Gender = Female. λF = 0.3,λG = 0.7,λF = 0.9

5