Week 8 Homework

Template code is here.

1. Run backpropagation on the XOR problem (xorbackprop.m), using one hidden layer with 2 units. You'll see that it learns the task correctly sometimes and fails other times. Look at the two hidden features it learns, meaning the output activations a{1} for the four possible test stimuli. Try to find a pattern for what these features look like when the model succeeds versus when it fails. Keep in mind that the model can solve the task only if the correct output can be expressed as a linear combination of these features (via the final layer of weights).
Option 1: Simulate the model on all four test stimuli, using the final weights W (i.e. at the end of the simulation) and feedforward.m
Create a matrix for holding the feature values for all four stimuli features = zeros(2,4); %feature values; feature x stimulus
Loop through the test stimuli for i=1:4
stim = testStim(:,i);

Calculate network activation using the current stimulus [v,a] = feedforward(stim,W); %get network activations; W is weights resulting from xorbackprop.m simulation
Get the output activations of the hidden units
end
Option 2: Calculate the feature activations directly, using the final weights. Augment the testStim matrix with a constant pseudoinput for the bias terms inputAct = [testStim; 1,1,1,1]; %matrix of test stimuli, with a third pseudo-cue for the bias terms
Use the augmented stimulus matrix and the first layer of weights to get the input activations of the hidden layer hiddenInput = W{1}*inputAct; %input activation at layer 1 for all test stimuli; unit x stimulus
Use the tanh transfer function to get output activations for the hidden layer (these are the feature values) features = tanh(hiddenInput);

2. Simulate the backprop model on some other interesting task. Define a set of stimuli (each a vector of cue values) and corresponding target outputs (either a single number for each stimulus, or a vector). Explore the number of hidden units, learning rate, and number of training trials to see what values lead to successful learning.

If you can't think of an interesting task, Train a network to read the numerals on a standard digital clock. Each numeral is composed of seven binary features, which are the inputs. The output could be a single variable taking integer values 0 through 9, or it could be a vector of ten elements with the correct element set to 1 and the others set to 0. I used the latter output option, with 20 hidden units and a learning rate of .01, and the network learned the task in around 50,000 trials.

Plan out how you'd modify the model code to model your new task. Create a matrix of all possible stimuli, with a column for each stimulus and a row for each feature, and a corresponding feedback matrix of correct outputs (Notice that the XOR code didn't work this way. There I simply created a random matrix of cue values, because all cue combinations constituted legitimate stimuli. That's not generally the case.)
stimVals = [0 1 1 0 1 1 1 1 1 1; ... %top horizontal line (values for numerals 1,2,3,4,5,6,7,8,9,0)
            1 1 1 1 0 0 1 1 1 1; ... %top-right vertical line
            1 0 1 1 1 1 1 1 1 1; ... %bottom-right vertical line
            0 1 1 0 1 1 0 1 1 1; ... %bottom horizontal line
            0 1 0 0 0 1 0 1 0 1; ... %bottom-left vertical line
            0 0 0 1 1 1 0 1 1 1; ... %top-left vertical line
            0 1 1 1 1 1 0 1 1 0]; %middle horizontal line
teachingVals = eye(10); %for each stimulus, correct response vector has 1 corresponding to entry for that stimulus and zero elsewhere

Create a sequence of stimulus IDs, indicating which stimulus will be shown on each trial nstim = size(stimVals,2); %number of stimulus types
stimSeq = randi(nstim,n,1); %random sequence of n stimuli

You'll also need to update the line that defines the numbers of units in the input and output layers, to base these numbers on the stimulus and feedback matrices you defined units = [size(stimVals,1) nhidden size(teachingVals,1)]; %number of units for layers 0 through M; layers 0 and M determined by sizes of stimVals and teachingVals, respectively
On every trial, use the stimulus ID for this trial, the stimulus matrix, and the feedback matrix to determine the input to the network and the correct output (teaching signal) stim = stimVals(:,stimSeq(i)); %input to network for this trial
T = teachingVals(:,stimSeq(i)); %teaching signal for this trial

Use the input and teaching signal to calculate network activation and to learn by back-propagation [v,a] = feedforward(stim,W); %get network activation by forward computation
Delta = backprop(stim,v,a,W,T); %get weight updates by backpropagation

Change the code that tests the model's learning, to use a metric appropriate for your new task perf = zeros(n,nstim); %will track model's test performance on all stimulus types after every learning trial (this goes outside the loop over trials)
...
for j=1:nstim %loop through test trials; will test model's performance on every stimulus type
[v,a] = feedforward(stimVals(:,j),W); %get network activation for stimulus j by forward computation
perf(i,j) = sum((v{M}-teachingVals(:,j)).^2); %sum squared error: activation at output layer vs. correct response
end
...
plot(perf) %shows one learning curve (error as a function of trials) for each stimulus; perfect learning is indicated by all curves reaching zero
axis([0 n 0 2]) %fix vertical scaling so that convergence to zero is visible

3. Change the model to use rectified linear units, meaning fact(v) = max{v,0}. Test the model on either XOR or your new task, and see whether it performs better or worse than with the tanh activation function. Change feedforward.m to use the rectified linear function to translate each node's input to its output a{m} = zeros(length(v{m}),1); %initialize output vector for this layer
for i=1:length(v{m})
a{m}(i) = max(v{m}(i),0); %output for node i at layer m, from transfer function end
Alternatively, you can do it in one line: a{m} = max([v{m},zeros(length(v{m}),1)],[],2); %output at layer m, from transfer function

Change backprop.m to use the derivative of the rectified-linear function in place of the derivative of tanh The derivative of fact is 1 when v > 0, and 0 when v < 0 (hint: this means the derivative can be written using a Boolean expression) d{m} = (v{m}>0).*W{m+1}(:,1:end-1)'*d{m+1}; %error derivative for this layer's input: zero if v<0, otherwise equal to error derivative for node's output