这个作业是用Matlab完成矩阵代数的题目
Final Alernative Assessment
COMP0085 2019–20
Always provide justification and show any intermediate work for your answers.
A correct but unsupported answer may not receive any marks.
Page 1 of 10 COMP0085 2019–20
In the following, bold symbols represent vectors. When asked to provide algorithms, you
may provide equations or use numpy/MATLAB/Octave notation for matrix algebra. You
may assume the availability of standard matrix functions such as inv, pinv, eig or svd.
You may not assume statistical functions, including mean and cov. If you choose to
use programming notation, you will not lose marks for syntactic errors, as long as your
intention is clear and correct.
You might find it useful to consult the following table of standard distributions:
Name Domain Symbol Density or Probability fn
Gaussian (Normal) R x ∼ N
µ, σ2
1
√
2πσ2
e
− 1
2
(x−µ)
2
σ2
Multivariate Normal R
D x ∼ N (µ, Σ) |2πΣ|
−1/2
e
− 1
2
(x−µ)
TΣ−1
(x−µ)
Bernoulli {0, 1} x ∼ Bernoulli(p) p
x
(1 − p)
1−x
Binomial Z0−N x ∼ Binom(p)
N
x
p
x
(1 − p)
N−x
Multinomial [Z0−N ]
D x ∼ Multinom(p)
N!
x1! x2! . . . xD!
Y
D
d=1
p
xd
d
Poisson Z0+ x ∼ Poisson(µ) µ
x
e
−µ
/x!
Beta [0, 1] x ∼ Beta(α, β)
1
B(α, β)
x
α−1
(1 − x)
β−1
Exponential R+ x ∼ Exp(λ) λe−λx
Gamma R+ x ∼ Gamma(α, β)
β
α
Γ(α)
x
α−1
e
−βx
Student-t R x ∼ Student-t(α)
Γ
α+1
2
√
παΓ
α
2
1 +
x
2
α
−
α+1
2
Inverse-Wishart S
p
+ X ∼ InvWish(Ψ, ν)
|Ψ|
ν
2
2
νp
2 Γp(
ν
2
)
|X|
−
ν+p+1
2 e
− 1
2
Tr[ΨX−1
]
Dirichlet [0, 1]D x ∼ Dirichlet(α)
Γ
PD
d=1 αd
QD
d=1 Γ(αd)
Y
D
d=1
x
αd−1
d
Page 2 of 10 COMP0085 201
1. Mean field approximation
In this problem, we will compute a mean field approximation for an Ising model
with cost function H. Given a set of random variables X1, …, Xn with values Xi ∈
{−1, +1}, the Ising model cost function is defined as
φ(x) = −
Xn
i,j=1
αijxixj −
Xn
i=1
βixi for x = (x1, . . . , xn) ∈ {−1, 1}
n
, (1)
where αij and βi for i, j ∈ {1, . . . , n} are constant weights. (In the example we
discussed in class, αij was non-zero if i and j were neighbors on a grid.) The
distribution of the Ising model is given by
P(x) = 1
Z(α, β)
e
−φ(x) where Z(α, β) = X
x
e
−φ(x)
. (2)
We approach the problem by approximating P by a factorial distribution
Q(x) = Yn
i=1
Qi (xi) where Qi (xi) = Qi (xi) = (1+mi
2
xi = 1
1−mi
2
xi = −1
, (3)
each with some parameter mi ∈ [−1, 1].
Hints: (i) Recall that tanh(x) = e
x−e−x
e
x+e−x , with inverse arctanh (x) = 1
2
log
1+x
1−x
.
(ii) Whereever possible, use the fact that E[Y Z] = E[Y ]E[Z] if Y and Z are independent.
(a) Recall that mean field approximation of a distribution P by Q minimizes the
cost function KL[QkP]. Show that
F(m) := argmin
m∈[−1,1]n
KL[QkP] = argmin
m∈[−1,1]n
(−H(Q) + EQ[φ(X)]) (4)
where H(Q) is the entropy of Q and EQ denotes expectation under Q.
[3 marks]
(b) Show that
X
X∈{−1,1}n
Q (x) f(xi) = X
xi∈{−1,+1}
Qi (xi) f(xi) (5)
holds for any factorial distribution Q and any function f : {−1, 1} → R.
[3 marks]
(c) Show that the expected value of φ under Q is
EQ [φ(X)] = −
Xn
i,j=1
αijmimj −
Xn
i=1
βimi
. (6)
[3 marks]
Page 3 of 10 COMP0085 2019–20
(d) Show that the entropy of Q is
H[Q] = −
Xn
i=1
1 + mi
2
log1 + mi
2
+
1 − mi
2
log1 − mi
2
. (7)
[4 marks]
(e) Show that the optimal distribution is given by the parameter values
mi = tanhXn
j=1
αijmj + βi
(8)
[4 marks]
Page 4 of 10 COMP0085 2019–20
2. Sampling algorithms
(a) Consider a distribution with a very simple “staircase” density p (the thick line
in the left figure). We divide the area under p into three “slices”:
x
p(x)
a b
Slice 1
Slice 2
Slice 3
p
xi−1
ti
Figure A
x
p(x)
a b
xi
(xi, yi)
Figure B
We define a Markov chain that generate samples x1, x2, . . . as follows. Start
with any x1 ∈ [a, b]. For each i > 1:
(i) Choose one of the slices that overlap the location of the previous sample
xi−1 at random as follows: ‘
• ti ∼ Uniform[0, p(xi−1)]
• Select the slice k which contains (xi−1, ti)
(In the right figure, this would be slice 2.)
(ii) Regard slice k as a box and sample a point (xi
, yi) uniformly from this box.
Discard the vertical coordinate yi and keep xi as the ith sample.
Is this a valid sampler for p? More precisely: If we assume that the previous
sample xi−1 is already distributed according to p (on the grounds that the
Markov chain has converged), is xi marginally distributed according to p?
[6 marks]
(b) Consider a 2-dimensional Gaussian N (µ, Σ) with
µ =
3
3
and Σ =
1 0
0 2
.
Write down a Gibbs sampler for N (µ, Σ); that is, given xi = (xi,1, xi,2) ∈ R
2
,
specify how to sample xi+1 ∈ R
2
. [4 marks]
(c) Consider a directed graphical model with graph:
X1 X2
X3
X5 X4
X6
Page 5 of 10 COMP0085 2019–20
For a set x1, . . . , xn, we write x−i for the set with the ith element removed,
x−i = {x1, . . . , xi−1, xi+1, . . . , xn} .
If i = 1, we read Xi−1 as X6. Suppose each variable Xi takes values in {0, 1},
with conditional
P(Xi = 1|Xi−1) = σ(θiXi−1) for some θi ∈ [0, 1] ,
where σ(x) = e
x
e
x+1 is the sigmoid function. What are the full conditionals
P(Xi
|X−i)? Please express the unnormalized full conditional P˜(Xi
|X−i) in
terms of σ, and explain how to determine the normalization constant for a
given setting of X−i
. [7 marks]
Page 6 of 10 COMP0085 2019–20
3. Stochastic (natural) gradients for conjugate-exponential variational Bayes
Consider fitting a conjugate-exponential (CE) latent variable model on n observations X = {xi}
n
1
, with corresponding latents Z = {zi}
n
i
and natural parameters θ,
by variational Bayes (VB):
P(θ) = h(ν, τ )g(θ)
ν
e
θ
Tτ
P(Z, X ) = Yn
i=1
f(zi
, xi)g(θ)e
θ
TT(zi,xi)
where ν and τ are hyperparameters and T is the joint sufficient statistic on latent
and observation.
(a) Making the VB approximation Q(θ, Z) = Q(θ)Q(Z), and assuming the general
form of the VB interative updates, derive the specific updates for the CE model.
In particular, show that the update for Q(θ) depends on h
P
i
T(zi
, xi)iQ(Z)
.
[5 marks]
If the number of observations is very large, then storing and computing h
P
i
T(zi
, xi)iQ(Z)
on every iteration may be computationally expensive. An alternative is to use a
’minibatch’ stochastic gradient approach, even when a closed form update is possible in principle. Let
Q(θ; ˜ν, τ˜) = h(˜ν, τ˜)g(θ)
ν˜
e
θ
Tτ˜
.
(b) Show that this conjugate form can itself be written as an exponential family distribution on θ. Identify the natural parameters η˜θ and corresponding sufficient
statistics. [2 marks]
(c) Show that the gradient of the VB free energy with respect to η˜θ has the form
∇η˜θF = (∇η˜θµ˜θ
)
T
(η
0
θ + nη
Z
θ − η˜θ
)
where η
0
θ
are the natural parameters of the prior p(θ), η
Z
θ
is a term that depends
on the observations and Q(Z) and µ˜θ are the mean parameters of Q(θ; ˜ν, τ˜).
[4 marks]
(d) Thus, show that the stochastic gradient computed using a random ‘minibatch’
(i.e. subset of the data) is an unbiased estimate of the true gradient, and so
appropriate for stochastic gradient ascent. [2 marks]
(e) Recall (e.g., from our discussion of independent components analysis) that the
natural gradient of a function of a distribution F(Pη) with respect to the parameters η takes the form
∇ˇ F(Pη) = h−∇∇ log Pηi
−1∇F(Pη)
where all gradients are with respect to η and ∇∇ is the Hessian matrix. Show
that the natural gradient of F with respect to η˜θ has a particularly simple
form.
[4 marks]
Page 7 of 10 COMP0085 2019–20
4. Modelling earthquakes
The US geological service has collected a dataset of minor earthquake events from
across the country, and would like to model how those rates vary across different
faults and with the ambient temperature. Index the faults by i = 1, . . . , N and
different (discretised) temperature bands by t = 1, . . . , T. Let Qit be the number of
earthquakes on fault i when the temperature was in band t.
As a first pass, we might model the data by assuming that each fault i has an intrinsic
earthquake frequency ri that is independent of the earthquake rates of other faults,
and that in temperature band t the earthquake rates are amplified by a factor st
.
This suggests the model
ri ∼ Gamma(a, b) for each fault i
st ∼ Gamma(c, d) for each band t
Qit ∼ Poisson(rist) for each i, t
(a) Draw out the model as a directed graphical model. Use plates to make your
representation as compact as possible. [3 marks]
(b) What conditional independence relationships are preserved in the posterior
when all the counts Qit are observed? Draw the corresponding undirected
graph induced on r1 . . . rN , s1, . . . sT . [3 marks]
(c) Derive a Gibbs sampler to infer all the ri and st when the Qit are observed.
Specify the form and parameters of each of the conditional distributions involved. How would the samples be used to obtain a Monte Carlo estimate of
the mean and variance of ri? [6 marks]
(d) Derive an expectation propagation (EP) algorithm to infer the ri and st
. Approximate the likelihood term for Qit ∼ Poisson(rist) by
Lit ≈ r
ait
i
e
−bitris
cit
t
e
−ditst
where ait, bit, cit and dit are variational parameters to be fit using EP. You may
assume that the KL projection step is tractable. [5 marks]
Page 8 of 10 COMP0085 2019–20
5. Loopy graphs, BP and EP
Consider an undirected graphical model over discrete-valued variables X = {Xi}
with singleton and pairwise factors:
P(X ) = 1
Z
Y
nodes i
fi(Xi)
Y
edges (ij)
fij (Xi
, Xj )
We wish to compute the marginal distributions P(Xi) for all nodes i and pairwise
marginals P(Xi
, Xj ) for all edges (ij).
Recall that any distribution on discrete variables can be written in exponential family
form.
(a) Show that the marginal probabilities we seek are exactly the mean parameters of
the distribution – that is, the expected values of the sufficient statistic functions.
[3 marks]
This means that, in principle, we could find their values using the standard result
for exponential families:
µ = argmax η
Tµ − Ψ(µ)
where η are the natural parameters and Ψ is the negative entropy function, dual to
the log normalising constant.
(b) Give two reasons why this approach is intractable for a general (loopy) graph.
[3 marks]
Consider an approximation given by
P˜(X ) = 1
Z
Y
nodes i
fi(Xi)
Y
edges (ij)
f
i
ij (Xi)f
j
ij (Xj )
where we have introduced approximate factors ˜fij (Xi
, Xj ) = f
i
ij (Xi)f
j
ij (Xj ).
(c) Derive the expectation propagation (EP) updates for ˜fij . [8 marks]
(d) Show that these are equivalent to the updates under belief propagation (BP),
so that EP in this approximation corresponds to loopy BP. [2 marks]
Page 9 of 10 COMP0085 2019–20
6. A mixture of VAEs
In lecture we discussed the “structured variational autoencoder” (SVAE), which
provides one way to learn a structured latent model combined with flexible output
distributions. Here we consider another, simpler, approach.
Consider the following model for observations x, with continuous latent z ∈ R
D and
discrete latent s ∈ {1 . . . K}:
s ∼ Discrete(π)
z ∼ N (0, I)
x|z, s ∼ N
g(z; γs), σ2
s
I
The parameters are: Θ = {π, γ1 . . . , γK, σ2
1
. . . σ2
K} with π a standard probability
vector; γs, for each value of s, a parameter vector for a function g (these could be
the weights of a neural network); and σ
2
s
the variance of the observation noise for
the sth component.
(a) Write out an explicit expression for the variational free energy for this model in
terms of the variational distribution q(z, s) and the parameters defined above.
Simplify it to identify the sufficient statistics needed for exact learning.
[5 marks]
Now assume a variational inferential approximation based on an s-dependent recognition model.
q(z, s) = q(z|s)q(s) = q(z;f(x; φs))q(s)
with q(z;f(x; φs)) a normal distribution parametrised by the output of f(x; φs) –
perhaps the means and diagonal elements of the covariances.
(b) Show that, under this assumption, the free energy has the form
F(Θ, q) = X
m
rmFm(Θm, φm) − KL[q(s)kp(s)]
where rm = q(s = m) are the mixture responsibilities and Θm = {γm, σ2
m}.
[5 marks]
(c) Given data X = {xi}, introduce reparametrised samples z
t,m
i ∼ q(zi
; f(xi
, φm)), t =
1 . . . T for each i and m, to derive a learning algorithm that estimates:
• q(s), π
• φm, γm and σ
2
m.
Provide explicit equations for updates in terms of current parameters and samples. [6 marks]
Page 10 of 10 COMP0085 2019–2