Your task is to fit the LDA model (topic model) to a dataset, using Gibbs sampling. First you'll use the LDA as a generative model to create a synthetic corpus, and then you'll fit the model to try to recover the latent topic structure. This exercise will bring together all the tools we've covered for Bayesian models.
I've designed the exercise to be adaptive, so that you can compare your answers to mine or simply use my answers for parts you can't figure out. This applies at a conceptual level and at the level of coding. In the instructions below, each step expands into conceptual components, and then shows my code for each component. Before expanding each step, and before revealing the final code, try working out the conceptual structure and specific code on your own. Again, the idea is for you to do as much as you can on your own, with my answers available for scaffolding as needed.
Before starting, download this bit of code for sampling probability vectors from Dirichlet distributions, which you'll need at several points along the way.
1. Create a set of documents from the LDA generative model.
K = 5; %number of topics
D = 10; %number of documents
N = 100; %document length (same for all docs, for simplicity)
L = 50; %lexicon size
eta = ones(1,L)/5; %parameter for Dirichlet prior on beta; encourages each topic to be concentrated on a small number of words
alpha = ones(1,K)/5; %parameter for Dirichlet prior on theta; encourages each document to be on a small number of topics
beta = drchrnd(eta,K); %topic by word-type; each row is the word distribution for one topic, as drawn from Dirichlet(eta)
theta = drchrnd(alpha,D); %document by topic; each row is the topic distribution for one document, as drawn from Dirichlet(alpha)
z = zeros(D,N); %document by word-token
for d=1:D, for n=1:N
z(d,n) = find(rand < cumsum(theta(d,:)),1); %each z(d,n) is sampled according to theta(d,:)
end,end
w = zeros(D,N); %document by word-token
for d=1:D, for n=1:N
w(d,n) = find(rand < cumsum(beta(z(d,n),:)),1); %each w(d,n) is sampled according to beta(z(d,n),:)
end,end
M = 100; %number of times to cycle through parameters (beta,theta,z)
betaHat = zeros(K,L,M); %series of beta samples
thetaHat = zeros(D,K,M); %series of theta samples
zHat = zeros(D,N,M); %series of z samples
betaHat(:,:,1) = drchrnd(eta,K); %random starting point for beta (drawn from prior)
thetaHat(:,:,1) = drchrnd(alpha,D); %random starting point for theta (drawn from prior)
zHat(:,:,1) = randi(K,D,N); %random starting point for z (uniform across topics)
for m=2:M
Resample beta conditioned on the other variables.
For each beta(k,:), the relevant dependencies are the Dirichlet(eta) prior and the likelihoods of all words currently assigned to topic k (i.e., w(d,n) for z(d,n)==k). These are conjugate, so you just need to count up how many of those words are of each word-type. Matlab's histc function is useful here.
for k=1:K %resample beta for each topic k
counts = histc(w(zHat(:,:,m-1)==k)',1:L); %look at all words assigned to topic k, and count the number of each word-type in 1:L
betaHat(k,:,m) = drchrnd(eta+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample for the new betaHat(k,:)
end
Resample theta conditioned on the other variables.
For each theta(d,:), the relevant dependencies are the Dirichlet(alpha) prior and the likelihoods of z(d,n) for all n. These are conjugate, so you just need to count up how many of z(d,:) are assignments to each topic.
for d=1:D %resample theta for each document d
counts = histc(zHat(d,:,m-1),1:K); %count topic assignments for this document
thetaHat(d,:,m) = drchrnd(alpha+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample to get the new thetaHat(d,:)
end
Resample z conditioned on the other variables.
The possible values for each z(d,n) are 1:K. The relevant dependencies are the prior from theta(d,:) and the likelihood from w(d,n). These can be directly combined using Bayes' rule.
for d=1:D, for n=1:N %resample z for each word n in each document d
zPost = thetaHat(d,:,m).*betaHat(:,w(d,n),m)'/(thetaHat(d,:,m)*betaHat(:,w(d,n),m)); %plain Bayes' rule: prior on z(d,n) is theta(d,:); likelihood is beta(:,w(d,n))
zHat(d,n,m) = find(rand < cumsum(zPost),1); %sample new zHat(d,n) from its posterior
end,end
end
betaCorr = corr(betaHat(:,:,M)',beta'); %matrix of pairwise correlations between estimated and true topic-word distributions
disp(['Topic-word distributions, mean correlation: ' num2str(mean(max(betaCorr)))]) %average of best correlation for every topic
map = zeros(k,1); %map from numbering of generating topics to numbering of recovered topics
for k=1:K
map(k) = find(betaCorr(:,k)==max(betaCorr(:,k))); %number of the recovered topic that corresponds to the kth generating topic
end
thetaCorr = diag(corr(thetaHat(:,map,M)',theta')); %reorder topics in recovered document-topic distributions and correlate them with generating distributions
disp(['Document-topic distributions, mean correlation: ' num2str(mean(thetaCorr))]) %average correlation for every document
disp(['Word-topic assignments, proportion correct: ' num2str(mean(mean(zHat(:,:,M)==map(z))))])