Your task is to fit the LDA model (topic model) to a dataset, using Gibbs sampling. First you'll use the LDA as a generative model to create a synthetic corpus, and then you'll fit the model to try to recover the latent topic structure. This exercise will bring together all the tools we've covered for Bayesian models.

I've designed the exercise to be adaptive, so that you can compare your answers to mine or simply use my answers for parts you can't figure out. This applies at a conceptual level and at the level of coding. In the instructions below, each step expands into conceptual components, and then shows my code for each component. Before expanding each step, and before revealing the final code, try working out the conceptual structure and specific code on your own. Again, the idea is for you to do as much as you can on your own, with my answers available for scaffolding as needed.

Before starting, download this bit of code for sampling probability vectors from Dirichlet distributions, which you'll need at several points along the way.

1. Create a set of documents from the LDA generative model.A number of words per document (N) — it's easier if all documents are the same length

A number of topics (K)

A number of word types (L) — we'll just indicate the words by numbers 1 through L, rather than using real words

K = 5; %number of topics

D = 10; %number of documents

N = 100; %document length (same for all docs, for simplicity)

L = 50; %lexicon size

```
```eta = ones(1,L)/5; %parameter for Dirichlet prior on beta; encourages each topic to be concentrated on a small number of words

alpha = ones(1,K)/5; %parameter for Dirichlet prior on theta; encourages each document to be on a small number of topics

1c. Sample beta and theta from their priors.

```
```beta = drchrnd(eta,K); %topic by word-type; each row is the word distribution for one topic, as drawn from Dirichlet(eta)

theta = drchrnd(alpha,D); %document by topic; each row is the topic distribution for one document, as drawn from Dirichlet(alpha)

1d. Sample z based on theta.

```
```z = zeros(D,N); %document by word-token

for d=1:D, for n=1:N

z(d,n) = find(rand < cumsum(theta(d,:)),1); %each z(d,n) is sampled according to theta(d,:)

end,end

1e. Sample w based on z and beta.

```
```w = zeros(D,N); %document by word-token

for d=1:D, for n=1:N

w(d,n) = find(rand < cumsum(beta(z(d,n),:)),1); %each w(d,n) is sampled according to beta(z(d,n),:)

end,end

2. Fit LDA to the data by Gibbs sampling.

Create empty arrays betaHat, thetaHat, zHat as the sequences of samples for beta, theta, z (e.g., zHat will need to hold M copies of z).

```
```M = 100; %number of times to cycle through parameters (beta,theta,z)

betaHat = zeros(K,L,M); %series of beta samples

thetaHat = zeros(D,K,M); %series of theta samples

zHat = zeros(D,N,M); %series of z samples

2b. Choose starting values for all three variables (on step m=1).

```
```betaHat(:,:,1) = drchrnd(eta,K); %random starting point for beta (drawn from prior)

thetaHat(:,:,1) = drchrnd(alpha,D); %random starting point for theta (drawn from prior)

zHat(:,:,1) = randi(K,D,N); %random starting point for z (uniform across topics)

2c. Loop through the Gibbs cycle M-1 times (starting at step 2 since step 1 is the initial values chosen above).

```
```for m=2:M

Resample beta conditioned on the other variables.
For each beta(k,:), the relevant dependencies are the Dirichlet(eta) prior and the likelihoods of all words currently assigned to topic k (i.e., w(d,n) for z(d,n)==k). These are conjugate, so you just need to count up how many of those words are of each word-type. Matlab's histc function is useful here.
```
```for k=1:K %resample beta for each topic k

counts = histc(w(zHat(:,:,m-1)==k)',1:L); %look at all words assigned to topic k, and count the number of each word-type in 1:L

betaHat(k,:,m) = drchrnd(eta+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample for the new betaHat(k,:)

end

Resample theta conditioned on the other variables.
For each theta(d,:), the relevant dependencies are the Dirichlet(alpha) prior and the likelihoods of z(d,n) for all n. These are conjugate, so you just need to count up how many of z(d,:) are assignments to each topic.
```
```for d=1:D %resample theta for each document d

counts = histc(zHat(d,:,m-1),1:K); %count topic assignments for this document

thetaHat(d,:,m) = drchrnd(alpha+counts,1); %add the counts to the prior to get the posterior, and draw one Dirichlet sample to get the new thetaHat(d,:)

end

Resample z conditioned on the other variables.
The possible values for each z(d,n) are 1:K. The relevant dependencies are the prior from theta(d,:) and the likelihood from w(d,n). These can be directly combined using Bayes' rule.
```
```for d=1:D, for n=1:N %resample z for each word n in each document d

zPost = thetaHat(d,:,m).*betaHat(:,w(d,n),m)'/(thetaHat(d,:,m)*betaHat(:,w(d,n),m)); %plain Bayes' rule: prior on z(d,n) is theta(d,:); likelihood is beta(:,w(d,n))

zHat(d,n,m) = find(rand < cumsum(zPost),1); %sample new zHat(d,n) from its posterior

end,end

end

2d. Check how well the model recovers the true latent structure. It turns out that the true posterior is very narrow, so you can just use the final sample of betaHat, thetaHat, and zHat as point estimates. Think of some ways to quantify how well these samples match the true values beta, theta, and z. (You may find it less interesting to write your own code for this, but at the least running my code should be informative.)

```
```betaCorr = corr(betaHat(:,:,M)',beta'); %matrix of pairwise correlations between estimated and true topic-word distributions

disp(['Topic-word distributions, mean correlation: ' num2str(mean(max(betaCorr)))]) %average of best correlation for every topic

Use the correlation matrix just computed to determine the mapping between generating and recovered topics. This mapping will be needed to measure the recovery of theta and z.

```
```map = zeros(k,1); %map from numbering of generating topics to numbering of recovered topics

for k=1:K

map(k) = find(betaCorr(:,k)==max(betaCorr(:,k))); %number of the recovered topic that corresponds to the kth generating topic

end

Calculate the correlation between the generating and recovered topic distributions for each topic.

```
```thetaCorr = diag(corr(thetaHat(:,map,M)',theta')); %reorder topics in recovered document-topic distributions and correlate them with generating distributions

disp(['Document-topic distributions, mean correlation: ' num2str(mean(thetaCorr))]) %average correlation for every document

Calculate the proportion of correctly recovered word-topic assignments (z).

```
```disp(['Word-topic assignments, proportion correct: ' num2str(mean(mean(zHat(:,:,M)==map(z))))])