Week 5 Homework

Your task is to fit the LDA model (topic model) to a dataset, using Gibbs sampling. First you'll use the LDA as a generative model to create a synthetic corpus, and then you'll fit the model to try to recover the latent topic structure. This exercise will bring together all the tools we've covered for Bayesian models.

I've designed the exercise to be adaptive, so that you can compare your answers to mine or simply use my answers for parts you can't figure out. This applies at a conceptual level and at the level of coding. In the instructions below, each step expands into conceptual components, and then shows my code for each component. Before expanding each step, and before revealing the final code, try working out the conceptual structure and specific code on your own. Again, the idea is for you to do as much as you can on your own, with my answers available for scaffolding as needed.

Before starting, download this bit of code for sampling probability vectors from Dirichlet distributions, which you'll need at several points along the way.

1. Create a set of documents from the LDA generative model. 1a. Choose: A number of documents (D)
A number of words per document (N) — it's easier if all documents are the same length
A number of topics (K)
A number of word types (L) — we'll just indicate the words by numbers 1 through L, rather than using real words
K = 5; %number of topics
D = 10; %number of documents
N = 100; %document length (same for all docs, for simplicity)
L = 50; %lexicon size

1b. Define eta and alpha, for the priors on beta and theta (respectively). Eta should be a row vector of length L, because it's a prior for distributions over L words. alpha should be a row vector of length K, because it's a prior for distributions over K topics. To encourage each topic to contain a small set of words, and each document to cover a small number of topics, the components of eta and alpha should be close to zero; this puts most of the Dirichlet distribution on the boundary of the simplex. eta = ones(1,L)/5; %parameter for Dirichlet prior on beta; encourages each topic to be concentrated on a small number of words
alpha = ones(1,K)/5; %parameter for Dirichlet prior on theta; encourages each document to be on a small number of topics

1c. Sample beta and theta from their priors. For each topic k, beta(k,:) is a sample from Dirichlet(eta). For each document d, theta(d,:) is a sample from Dirichlet(alpha). You can get samples using the drchrnd function: drchrnd(A,n) returns a matrix of n samples from Dirichlet(A). beta = drchrnd(eta,K); %topic by word-type; each row is the word distribution for one topic, as drawn from Dirichlet(eta)
theta = drchrnd(alpha,D); %document by topic; each row is the topic distribution for one document, as drawn from Dirichlet(alpha)

1d. Sample z based on theta. Each z(d,n) is sampled from {1,...,K} according to the probabilities in theta(d,:). z = zeros(D,N); %document by word-token
for d=1:D, for n=1:N
z(d,n) = find(rand < cumsum(theta(d,:)),1); %each z(d,n) is sampled according to theta(d,:)
end,end

1e. Sample w based on z and beta. Each w(d,n) is sampled from {1,...,L} according to the probabilities in beta(k,:), where k = z(d,n). w = zeros(D,N); %document by word-token
for d=1:D, for n=1:N
w(d,n) = find(rand < cumsum(beta(z(d,n),:)),1); %each w(d,n) is sampled according to beta(z(d,n),:)
end,end


2. Fit LDA to the data by Gibbs sampling. 2a. Setup: Choose a number (M) of cycles to run the Gibbs sampler.
Create empty arrays betaHat, thetaHat, zHat as the sequences of samples for beta, theta, z (e.g., zHat will need to hold M copies of z).
M = 100; %number of times to cycle through parameters (beta,theta,z)
betaHat = zeros(K,L,M); %series of beta samples
thetaHat = zeros(D,K,M); %series of theta samples
zHat = zeros(D,N,M); %series of z samples

2b. Choose starting values for all three variables (on step m=1).
It doesn't really matter where you start. You could draw each variable from its prior. For z, the prior is uniform once you marginalize over theta. betaHat(:,:,1) = drchrnd(eta,K); %random starting point for beta (drawn from prior)
thetaHat(:,:,1) = drchrnd(alpha,D); %random starting point for theta (drawn from prior)
zHat(:,:,1) = randi(K,D,N); %random starting point for z (uniform across topics)

2c. Loop through the Gibbs cycle M-1 times (starting at step 2 since step 1 is the initial values chosen above). for m=2:M
Resample beta conditioned on the other variables. For each beta(k,:), the relevant dependencies are the Dirichlet(eta) prior and the likelihoods of all words currently assigned to topic k (i.e., w(d,n) for z(d,n)==k). These are conjugate, so you just need to count up how many of those words are of each word-type. Matlab's histc function is useful here. for k=1:K %resample beta for each topic k
counts = histc(w(zHat(:,:,m-1)==k)',1:L); %look at all words assigned to topic k, and count the number of each word-type in 1:L
betaHat(k,:,m) = drchrnd(eta+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample for the new betaHat(k,:)
end
Resample theta conditioned on the other variables. For each theta(d,:), the relevant dependencies are the Dirichlet(alpha) prior and the likelihoods of z(d,n) for all n. These are conjugate, so you just need to count up how many of z(d,:) are assignments to each topic. for d=1:D %resample theta for each document d
counts = histc(zHat(d,:,m-1),1:K); %count topic assignments for this document
thetaHat(d,:,m) = drchrnd(alpha+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample to get the new thetaHat(d,:)
end
Resample z conditioned on the other variables. The possible values for each z(d,n) are 1:K. The relevant dependencies are the prior from theta(d,:) and the likelihood from w(d,n). These can be directly combined using Bayes' rule. for d=1:D, for n=1:N %resample z for each word n in each document d
zPost = thetaHat(d,:,m).*betaHat(:,w(d,n),m)'/(thetaHat(d,:,m)*betaHat(:,w(d,n),m)); %plain Bayes' rule: prior on z(d,n) is theta(d,:); likelihood is beta(:,w(d,n))
zHat(d,n,m) = find(rand < cumsum(zPost),1); %sample new zHat(d,n) from its posterior
end,end
end


2d. Check how well the model recovers the true latent structure. It turns out that the true posterior is very narrow, so you can just use the final sample of betaHat, thetaHat, and zHat as point estimates. Think of some ways to quantify how well these samples match the true values beta, theta, and z. (You may find it less interesting to write your own code for this, but at the least running my code should be informative.) Check how well the estimated word distributions in betaHat match the true distributions in beta. Note that the topics numbering is arbitrary, so you have to compare betaHat(j,:,M) to beta(k,:) for all j and k, and find the best match for each k. betaCorr = corr(betaHat(:,:,M)',beta'); %matrix of pairwise correlations between estimated and true topic-word distributions
disp(['Topic-word distributions, mean correlation: ' num2str(mean(max(betaCorr)))]) %average of best correlation for every topic

Use the correlation matrix just computed to determine the mapping between generating and recovered topics. This mapping will be needed to measure the recovery of theta and z. map = zeros(k,1); %map from numbering of generating topics to numbering of recovered topics
for k=1:K
map(k) = find(betaCorr(:,k)==max(betaCorr(:,k))); %number of the recovered topic that corresponds to the kth generating topic
end

Calculate the correlation between the generating and recovered topic distributions for each topic. thetaCorr = diag(corr(thetaHat(:,map,M)',theta')); %reorder topics in recovered document-topic distributions and correlate them with generating distributions
disp(['Document-topic distributions, mean correlation: ' num2str(mean(thetaCorr))]) %average correlation for every document

Calculate the proportion of correctly recovered word-topic assignments (z). disp(['Word-topic assignments, proportion correct: ' num2str(mean(mean(zHat(:,:,M)==map(z))))])