# 这是一篇来自日本的关于人工智能基础的**人工智能代写**

**Instruction **

- Submit the A4 sized / letter sized report (pdf format fifile) by
**23:59 on 30th November 2022****via T2SCHOLA**.

- You can use either
**Japanese**or English. - You can write hand-written manuscript as well as using word processor such as MS Word and TEX.

- You could earn 100 pts by solving all the problems in this document, while you could earn 90 pts to solve the all the problems except (Option). The score earned in this assignment is used for the part of the fifinal score of this lecture. The fifinal score is based on sum of this report and the scores earned at Prof. Okazaki part.

**Problem 1 **

Solve the following problems on linear algebra and probability theory. Here, we assume a vector is a column vector instead of a row vector. Let *⊤ *be the transpose operation; i.e. let *x **⊤ **∈ *R 1*×**d *be the

transpose of *x **∈ *R *d **≡ *R *d**×*1 .

In problem **P1-A**, and **P1-B**, let *f*(*A**, **b**, **C**, d, *** x**) =1 2

*∥*

*Ax**−*

*b**∥*2 2+

*c**⊤*

**+**

*x**d*be a function of

*x**∈*R

*d*,

*A**∈*R

*m*

*×*

*d*,

*b**∈*R

*m*,

*c**∈*R

*d*, and

*d*

*∈*R.

**P1-A **Derive the following derivatives: *∂ ∂f *** A **, and

*∂f*

*∂*

**, respectively.**

*b***P1-B **Here *f *is rewritten as a function of ** x **as

*f*˜, and we denote the optimal variable of

**to minimize the function**

*x**f*˜ as ˆ

**= argmin**

*x*

*x**∈*R

*d*

*f*˜(

**). Derive analytical solution ofˆ**

*x***. Here we assume**

*x*

*A**⊤*

**be a positive defifinite matrix. (It is ok to use the positive defifinite matrix is also a symmetric matrix.)**

*A***P1-C **Here let *A **∈ *R *m**×**n*,and *B **∈ *R *n**×**n *be a square matrix, respectively. Derive the the following derivative, *∂ ∂ *** A **Tr(

*ABA**⊤*), where Tr represents trace operation.

**P1-D **Here let *x*, and *y *be real value, respectively. Show that the variance of a sum is V[*x *+ *y*] =V[*x*] + V[*y*] + 2COV[*x, y*] , where COV[*x, y*] is the covariance between *x *and *y*.

**Problem 2 **

Solve the following problems on linear regression and the effect of regularization. As shown in the lecture, the optimization of the linear regression also known as least square problem is defifined as the following optimization problem:

ˆ** w**LS = argmin

**12**

*w**∥*

*y**−*

*Xw**∥*22

*,*

where the design matrix, the response vector, and the parameter is represented by *X **∈ *R *n**×**d *,*y **∈ *R *n*，and *w **∈ *R *d *, respectively (Review lecture slides and check the video if necessary).

Though the regression through this optimization may work properly under some conditions, it is also known that this model is prone to overfifitting. As a simple approach to tackle the overfifitting issue, Ridge regularization is frequently employed in machine learning community. The resultant optimization problem is defifined as follows:

ˆ** w**ridge = argmin

**12**

*w**∥*

*y**−*

*Xw**∥*2 2 +

*λ*2

*∥*

*w**∥*22

*.*

**P2-A **Obtain analytical solutions of ˆ** w **LS, andˆ

**ridge, respectively. Here we assume**

*w*

*X**⊤*

**is regular, i.e. (**

*X*

*X**⊤*

**)**

*X**−*1 exists.

**P2-B **Explain the procedure of *cross validation *and the reasons why we need it in machine learning.

**(Option) P2-C **Even if *X**⊤*** X **is not regular, prove that

*X**⊤*

**+**

*X**λ*

**is regular. This means that the optimal parameterˆ**

*I*

*w*ridge is available whether *X**⊤*** X **is regular or not. Here

*I**∈*R

*d*

*×*

*d*represents an identity matrix and

*λ >*0 denotes hyper parameter of the regularization.

**(Option) P2-D **Here we assume that *X**⊤*** X **is regular, prove that

*∥*

*X*ˆ** w**LS

*∥*2 2

*≥ ∥*

*X*ˆ** w**ridge

*∥*2

2 . This result is also known as shrinkage in machine learning. Explain situation(s) where the shrinkage works effectively in machine learning.

**Problem 3 **

Solve the following problems on linear classifification and the numerical optimization. We consider linear binary classifification where an input, an output, and the parameter of the model are represented by *x **∈ *R *d *, *y **∈ *R, and *w **∈ *R *d *, respectively. Here we pursue the optimal parameter ** w **with the given training dataset

*{*

*x**i*

*, y*

*i*

*}*

*n*

*i*=1 by employing *logistic regression*. Specififically, the optimal parameter is obtained by solving the following optimization problem:

ˆ** w **= argmin

*w **J*(** w**)

*J*(** w**) :=

*n*∑

*i*=1 ln ( 1 + exp (

*−*

*y*

*i*

*w**⊤*

*x**i*)) +

*λ*2

*∥*

*w**∥*22

*,*

where *λ *denotes a hyper parameter of ridge regularization.

**P3-A **Describe the mechanism of gradient descent methods frequently used in machine learning1 ,and also explain the reason(s) why we need such a numerical method to obtain ˆ** w **in logistic regression.

**P3-B **Derive *∂J ∂ *( ** w w**)

*∈*R

*d*, and

*∂ ∂*

**(**

*w**∂J ∂*(

**) )**

*w w**⊤*

*∈*R

*d*

*×*

*d*. They are known as gradient, and hessian of

*J*w.r.t

**, respectively.**

*w***(Option) P3-C **As explained in the lecture, The necessary conditions of the optimality is *∂J ∂ *( ** w w**) =

**0**.

It should be noted that *∂J ∂ *( ** w w**) =

**0**is also the suffificient condition of the optimality for logistic regression with the certain strength of regularization

*λ >*0. This indicates that the parameter˜

**such that**

*w**∂J ∂*(

**)**

*w w**|*

**=˜**

*w***=**

*w***0**globally minimizes the objective function

*J*. Explain the reasons why *∂J ∂ *( ** w w**) =

**0**is the necessary and suffificient condition of the optimality in logistic regression.