这个作业是用Matlab完成矩阵代数的题目

Final Alernative Assessment

COMP0085 2019–20

Always provide justification and show any intermediate work for your answers.

A correct but unsupported answer may not receive any marks.

Page 1 of 10 COMP0085 2019–20

In the following, bold symbols represent vectors. When asked to provide algorithms, you

may provide equations or use numpy/MATLAB/Octave notation for matrix algebra. You

may assume the availability of standard matrix functions such as inv, pinv, eig or svd.

You may not assume statistical functions, including mean and cov. If you choose to

use programming notation, you will not lose marks for syntactic errors, as long as your

intention is clear and correct.

You might find it useful to consult the following table of standard distributions:

Name Domain Symbol Density or Probability fn

Gaussian (Normal) R x ∼ N

µ, σ2

1

√

2πσ2

e

− 1

2

(x−µ)

2

σ2

Multivariate Normal R

D x ∼ N (µ, Σ) |2πΣ|

−1/2

e

− 1

2

(x−µ)

TΣ−1

(x−µ)

Bernoulli {0, 1} x ∼ Bernoulli(p) p

x

(1 − p)

1−x

Binomial Z0−N x ∼ Binom(p)

N

x

p

x

(1 − p)

N−x

Multinomial [Z0−N ]

D x ∼ Multinom(p)

N!

x1! x2! . . . xD!

Y

D

d=1

p

xd

d

Poisson Z0+ x ∼ Poisson(µ) µ

x

e

−µ

/x!

Beta [0, 1] x ∼ Beta(α, β)

1

B(α, β)

x

α−1

(1 − x)

β−1

Exponential R+ x ∼ Exp(λ) λe−λx

Gamma R+ x ∼ Gamma(α, β)

β

α

Γ(α)

x

α−1

e

−βx

Student-t R x ∼ Student-t(α)

Γ

α+1

2

√

παΓ

α

2

1 +

x

2

α

−

α+1

2

Inverse-Wishart S

p

+ X ∼ InvWish(Ψ, ν)

|Ψ|

ν

2

2

νp

2 Γp(

ν

2

)

|X|

−

ν+p+1

2 e

− 1

2

Tr[ΨX−1

]

Dirichlet [0, 1]D x ∼ Dirichlet(α)

Γ

PD

d=1 αd

QD

d=1 Γ(αd)

Y

D

d=1

x

αd−1

d

Page 2 of 10 COMP0085 201

1. Mean field approximation

In this problem, we will compute a mean field approximation for an Ising model

with cost function H. Given a set of random variables X1, …, Xn with values Xi ∈

{−1, +1}, the Ising model cost function is defined as

φ(x) = −

Xn

i,j=1

αijxixj −

Xn

i=1

βixi for x = (x1, . . . , xn) ∈ {−1, 1}

n

, (1)

where αij and βi for i, j ∈ {1, . . . , n} are constant weights. (In the example we

discussed in class, αij was non-zero if i and j were neighbors on a grid.) The

distribution of the Ising model is given by

P(x) = 1

Z(α, β)

e

−φ(x) where Z(α, β) = X

x

e

−φ(x)

. (2)

We approach the problem by approximating P by a factorial distribution

Q(x) = Yn

i=1

Qi (xi) where Qi (xi) = Qi (xi) = (1+mi

2

xi = 1

1−mi

2

xi = −1

, (3)

each with some parameter mi ∈ [−1, 1].

Hints: (i) Recall that tanh(x) = e

x−e−x

e

x+e−x , with inverse arctanh (x) = 1

2

log

1+x

1−x

.

(ii) Whereever possible, use the fact that E[Y Z] = E[Y ]E[Z] if Y and Z are independent.

(a) Recall that mean field approximation of a distribution P by Q minimizes the

cost function KL[QkP]. Show that

F(m) := argmin

m∈[−1,1]n

KL[QkP] = argmin

m∈[−1,1]n

(−H(Q) + EQ[φ(X)]) (4)

where H(Q) is the entropy of Q and EQ denotes expectation under Q.

[3 marks]

(b) Show that

X

X∈{−1,1}n

Q (x) f(xi) = X

xi∈{−1,+1}

Qi (xi) f(xi) (5)

holds for any factorial distribution Q and any function f : {−1, 1} → R.

[3 marks]

(c) Show that the expected value of φ under Q is

EQ [φ(X)] = −

Xn

i,j=1

αijmimj −

Xn

i=1

βimi

. (6)

[3 marks]

Page 3 of 10 COMP0085 2019–20

(d) Show that the entropy of Q is

H[Q] = −

Xn

i=1

1 + mi

2

log1 + mi

2

+

1 − mi

2

log1 − mi

2

. (7)

[4 marks]

(e) Show that the optimal distribution is given by the parameter values

mi = tanhXn

j=1

αijmj + βi

(8)

[4 marks]

Page 4 of 10 COMP0085 2019–20

2. Sampling algorithms

(a) Consider a distribution with a very simple “staircase” density p (the thick line

in the left figure). We divide the area under p into three “slices”:

x

p(x)

a b

Slice 1

Slice 2

Slice 3

p

xi−1

ti

Figure A

x

p(x)

a b

xi

(xi, yi)

Figure B

We define a Markov chain that generate samples x1, x2, . . . as follows. Start

with any x1 ∈ [a, b]. For each i > 1:

(i) Choose one of the slices that overlap the location of the previous sample

xi−1 at random as follows: ‘

• ti ∼ Uniform[0, p(xi−1)]

• Select the slice k which contains (xi−1, ti)

(In the right figure, this would be slice 2.)

(ii) Regard slice k as a box and sample a point (xi

, yi) uniformly from this box.

Discard the vertical coordinate yi and keep xi as the ith sample.

Is this a valid sampler for p? More precisely: If we assume that the previous

sample xi−1 is already distributed according to p (on the grounds that the

Markov chain has converged), is xi marginally distributed according to p?

[6 marks]

(b) Consider a 2-dimensional Gaussian N (µ, Σ) with

µ =

3

3

and Σ =

1 0

0 2

.

Write down a Gibbs sampler for N (µ, Σ); that is, given xi = (xi,1, xi,2) ∈ R

2

,

specify how to sample xi+1 ∈ R

2

. [4 marks]

(c) Consider a directed graphical model with graph:

X1 X2

X3

X5 X4

X6

Page 5 of 10 COMP0085 2019–20

For a set x1, . . . , xn, we write x−i for the set with the ith element removed,

x−i = {x1, . . . , xi−1, xi+1, . . . , xn} .

If i = 1, we read Xi−1 as X6. Suppose each variable Xi takes values in {0, 1},

with conditional

P(Xi = 1|Xi−1) = σ(θiXi−1) for some θi ∈ [0, 1] ,

where σ(x) = e

x

e

x+1 is the sigmoid function. What are the full conditionals

P(Xi

|X−i)? Please express the unnormalized full conditional P˜(Xi

|X−i) in

terms of σ, and explain how to determine the normalization constant for a

given setting of X−i

. [7 marks]

Page 6 of 10 COMP0085 2019–20

3. Stochastic (natural) gradients for conjugate-exponential variational Bayes

Consider fitting a conjugate-exponential (CE) latent variable model on n observations X = {xi}

n

1

, with corresponding latents Z = {zi}

n

i

and natural parameters θ,

by variational Bayes (VB):

P(θ) = h(ν, τ )g(θ)

ν

e

θ

Tτ

P(Z, X ) = Yn

i=1

f(zi

, xi)g(θ)e

θ

TT(zi,xi)

where ν and τ are hyperparameters and T is the joint sufficient statistic on latent

and observation.

(a) Making the VB approximation Q(θ, Z) = Q(θ)Q(Z), and assuming the general

form of the VB interative updates, derive the specific updates for the CE model.

In particular, show that the update for Q(θ) depends on h

P

i

T(zi

, xi)iQ(Z)

.

[5 marks]

If the number of observations is very large, then storing and computing h

P

i

T(zi

, xi)iQ(Z)

on every iteration may be computationally expensive. An alternative is to use a

’minibatch’ stochastic gradient approach, even when a closed form update is possible in principle. Let

Q(θ; ˜ν, τ˜) = h(˜ν, τ˜)g(θ)

ν˜

e

θ

Tτ˜

.

(b) Show that this conjugate form can itself be written as an exponential family distribution on θ. Identify the natural parameters η˜θ and corresponding sufficient

statistics. [2 marks]

(c) Show that the gradient of the VB free energy with respect to η˜θ has the form

∇η˜θF = (∇η˜θµ˜θ

)

T

(η

0

θ + nη

Z

θ − η˜θ

)

where η

0

θ

are the natural parameters of the prior p(θ), η

Z

θ

is a term that depends

on the observations and Q(Z) and µ˜θ are the mean parameters of Q(θ; ˜ν, τ˜).

[4 marks]

(d) Thus, show that the stochastic gradient computed using a random ‘minibatch’

(i.e. subset of the data) is an unbiased estimate of the true gradient, and so

appropriate for stochastic gradient ascent. [2 marks]

(e) Recall (e.g., from our discussion of independent components analysis) that the

natural gradient of a function of a distribution F(Pη) with respect to the parameters η takes the form

∇ˇ F(Pη) = h−∇∇ log Pηi

−1∇F(Pη)

where all gradients are with respect to η and ∇∇ is the Hessian matrix. Show

that the natural gradient of F with respect to η˜θ has a particularly simple

form.

[4 marks]

Page 7 of 10 COMP0085 2019–20

4. Modelling earthquakes

The US geological service has collected a dataset of minor earthquake events from

across the country, and would like to model how those rates vary across different

faults and with the ambient temperature. Index the faults by i = 1, . . . , N and

different (discretised) temperature bands by t = 1, . . . , T. Let Qit be the number of

earthquakes on fault i when the temperature was in band t.

As a first pass, we might model the data by assuming that each fault i has an intrinsic

earthquake frequency ri that is independent of the earthquake rates of other faults,

and that in temperature band t the earthquake rates are amplified by a factor st

.

This suggests the model

ri ∼ Gamma(a, b) for each fault i

st ∼ Gamma(c, d) for each band t

Qit ∼ Poisson(rist) for each i, t

(a) Draw out the model as a directed graphical model. Use plates to make your

representation as compact as possible. [3 marks]

(b) What conditional independence relationships are preserved in the posterior

when all the counts Qit are observed? Draw the corresponding undirected

graph induced on r1 . . . rN , s1, . . . sT . [3 marks]

(c) Derive a Gibbs sampler to infer all the ri and st when the Qit are observed.

Specify the form and parameters of each of the conditional distributions involved. How would the samples be used to obtain a Monte Carlo estimate of

the mean and variance of ri? [6 marks]

(d) Derive an expectation propagation (EP) algorithm to infer the ri and st

. Approximate the likelihood term for Qit ∼ Poisson(rist) by

Lit ≈ r

ait

i

e

−bitris

cit

t

e

−ditst

where ait, bit, cit and dit are variational parameters to be fit using EP. You may

assume that the KL projection step is tractable. [5 marks]

Page 8 of 10 COMP0085 2019–20

5. Loopy graphs, BP and EP

Consider an undirected graphical model over discrete-valued variables X = {Xi}

with singleton and pairwise factors:

P(X ) = 1

Z

Y

nodes i

fi(Xi)

Y

edges (ij)

fij (Xi

, Xj )

We wish to compute the marginal distributions P(Xi) for all nodes i and pairwise

marginals P(Xi

, Xj ) for all edges (ij).

Recall that any distribution on discrete variables can be written in exponential family

form.

(a) Show that the marginal probabilities we seek are exactly the mean parameters of

the distribution – that is, the expected values of the sufficient statistic functions.

[3 marks]

This means that, in principle, we could find their values using the standard result

for exponential families:

µ = argmax η

Tµ − Ψ(µ)

where η are the natural parameters and Ψ is the negative entropy function, dual to

the log normalising constant.

(b) Give two reasons why this approach is intractable for a general (loopy) graph.

[3 marks]

Consider an approximation given by

P˜(X ) = 1

Z

Y

nodes i

fi(Xi)

Y

edges (ij)

f

i

ij (Xi)f

j

ij (Xj )

where we have introduced approximate factors ˜fij (Xi

, Xj ) = f

i

ij (Xi)f

j

ij (Xj ).

(c) Derive the expectation propagation (EP) updates for ˜fij . [8 marks]

(d) Show that these are equivalent to the updates under belief propagation (BP),

so that EP in this approximation corresponds to loopy BP. [2 marks]

Page 9 of 10 COMP0085 2019–20

6. A mixture of VAEs

In lecture we discussed the “structured variational autoencoder” (SVAE), which

provides one way to learn a structured latent model combined with flexible output

distributions. Here we consider another, simpler, approach.

Consider the following model for observations x, with continuous latent z ∈ R

D and

discrete latent s ∈ {1 . . . K}:

s ∼ Discrete(π)

z ∼ N (0, I)

x|z, s ∼ N

g(z; γs), σ2

s

I

The parameters are: Θ = {π, γ1 . . . , γK, σ2

1

. . . σ2

K} with π a standard probability

vector; γs, for each value of s, a parameter vector for a function g (these could be

the weights of a neural network); and σ

2

s

the variance of the observation noise for

the sth component.

(a) Write out an explicit expression for the variational free energy for this model in

terms of the variational distribution q(z, s) and the parameters defined above.

Simplify it to identify the sufficient statistics needed for exact learning.

[5 marks]

Now assume a variational inferential approximation based on an s-dependent recognition model.

q(z, s) = q(z|s)q(s) = q(z;f(x; φs))q(s)

with q(z;f(x; φs)) a normal distribution parametrised by the output of f(x; φs) –

perhaps the means and diagonal elements of the covariances.

(b) Show that, under this assumption, the free energy has the form

F(Θ, q) = X

m

rmFm(Θm, φm) − KL[q(s)kp(s)]

where rm = q(s = m) are the mixture responsibilities and Θm = {γm, σ2

m}.

[5 marks]

(c) Given data X = {xi}, introduce reparametrised samples z

t,m

i ∼ q(zi

; f(xi

, φm)), t =

1 . . . T for each i and m, to derive a learning algorithm that estimates:

• q(s), π

• φm, γm and σ

2

m.

Provide explicit equations for updates in terms of current parameters and samples. [6 marks]

Page 10 of 10 COMP0085 2019–2