Week 8 Homework

Template code is here.

1. Run backpropagation on the XOR problem (xorbackprop.m), using one hidden layer with 2 units. You'll see that it learns the task correctly sometimes and fails other times. Look at the two hidden features it learns, meaning the output activations a{1} for the four possible test stimuli. Try to find a pattern for what these features look like when the model succeeds versus when it fails. Keep in mind that the model can solve the task only if the correct output can be expressed as a linear combination of these features (via the final layer of weights).
Option 1: Simulate the model on all four test stimuli, using the final weights W (i.e. at the end of the simulation) and feedforward.m
Create a matrix for holding the feature values for all four stimuli ``` features = zeros(2,4); %feature values; feature x stimulus ```
Loop through the test stimuli ``` for i=1:4 stim = testStim(:,i); ```
Calculate network activation using the current stimulus ``` [v,a] = feedforward(stim,W); %get network activations; W is weights resulting from xorbackprop.m simulation ```
Get the output activations of the hidden units
``` end ```
Option 2: Calculate the feature activations directly, using the final weights. Augment the testStim matrix with a constant pseudoinput for the bias terms ``` inputAct = [testStim; 1,1,1,1]; %matrix of test stimuli, with a third pseudo-cue for the bias terms ```
Use the augmented stimulus matrix and the first layer of weights to get the input activations of the hidden layer ``` hiddenInput = W{1}*inputAct; %input activation at layer 1 for all test stimuli; unit x stimulus ```
Use the tanh transfer function to get output activations for the hidden layer (these are the feature values) ``` features = tanh(hiddenInput); ```

2. Simulate the backprop model on some other interesting task. Define a set of stimuli (each a vector of cue values) and corresponding target outputs (either a single number for each stimulus, or a vector). Explore the number of hidden units, learning rate, and number of training trials to see what values lead to successful learning.

If you can't think of an interesting task, Train a network to read the numerals on a standard digital clock. Each numeral is composed of seven binary features, which are the inputs. The output could be a single variable taking integer values 0 through 9, or it could be a vector of ten elements with the correct element set to 1 and the others set to 0. I used the latter output option, with 20 hidden units and a learning rate of .01, and the network learned the task in around 50,000 trials.

Plan out how you'd modify the model code to model your new task. Create a matrix of all possible stimuli, with a column for each stimulus and a row for each feature, and a corresponding feedback matrix of correct outputs (Notice that the XOR code didn't work this way. There I simply created a random matrix of cue values, because all cue combinations constituted legitimate stimuli. That's not generally the case.)
``` stimVals = [0 1 1 0 1 1 1 1 1 1; ... %top horizontal line (values for numerals 1,2,3,4,5,6,7,8,9,0)             1 1 1 1 0 0 1 1 1 1; ... %top-right vertical line             1 0 1 1 1 1 1 1 1 1; ... %bottom-right vertical line             0 1 1 0 1 1 0 1 1 1; ... %bottom horizontal line             0 1 0 0 0 1 0 1 0 1; ... %bottom-left vertical line             0 0 0 1 1 1 0 1 1 1; ... %top-left vertical line             0 1 1 1 1 1 0 1 1 0]; %middle horizontal line teachingVals = eye(10); %for each stimulus, correct response vector has 1 corresponding to entry for that stimulus and zero elsewhere ```
Create a sequence of stimulus IDs, indicating which stimulus will be shown on each trial ``` nstim = size(stimVals,2); %number of stimulus types stimSeq = randi(nstim,n,1); %random sequence of n stimuli ```
You'll also need to update the line that defines the numbers of units in the input and output layers, to base these numbers on the stimulus and feedback matrices you defined ``` units = [size(stimVals,1) nhidden size(teachingVals,1)]; %number of units for layers 0 through M; layers 0 and M determined by sizes of stimVals and teachingVals, respectively ```
On every trial, use the stimulus ID for this trial, the stimulus matrix, and the feedback matrix to determine the input to the network and the correct output (teaching signal) ``` stim = stimVals(:,stimSeq(i)); %input to network for this trial T = teachingVals(:,stimSeq(i)); %teaching signal for this trial ```
Use the input and teaching signal to calculate network activation and to learn by back-propagation ``` [v,a] = feedforward(stim,W); %get network activation by forward computation Delta = backprop(stim,v,a,W,T); %get weight updates by backpropagation ```
Change the code that tests the model's learning, to use a metric appropriate for your new task ``` perf = zeros(n,nstim); %will track model's test performance on all stimulus types after every learning trial (this goes outside the loop over trials) ... for j=1:nstim %loop through test trials; will test model's performance on every stimulus type [v,a] = feedforward(stimVals(:,j),W); %get network activation for stimulus j by forward computation perf(i,j) = sum((v{M}-teachingVals(:,j)).^2); %sum squared error: activation at output layer vs. correct response end ... plot(perf) %shows one learning curve (error as a function of trials) for each stimulus; perfect learning is indicated by all curves reaching zero axis([0 n 0 2]) %fix vertical scaling so that convergence to zero is visible ```

3. Change the model to use rectified linear units, meaning fact(v) = max{v,0}. Test the model on either XOR or your new task, and see whether it performs better or worse than with the tanh activation function. Change feedforward.m to use the rectified linear function to translate each node's input to its output ``` a{m} = zeros(length(v{m}),1); %initialize output vector for this layer for i=1:length(v{m}) a{m}(i) = max(v{m}(i),0); %output for node i at layer m, from transfer function end Alternatively, you can do it in one line: a{m} = max([v{m},zeros(length(v{m}),1)],[],2); %output at layer m, from transfer function ```
Change backprop.m to use the derivative of the rectified-linear function in place of the derivative of tanh The derivative of fact is 1 when v > 0, and 0 when v < 0 (hint: this means the derivative can be written using a Boolean expression) ``` d{m} = (v{m}>0).*W{m+1}(:,1:end-1)'*d{m+1}; %error derivative for this layer's input: zero if v<0, otherwise equal to error derivative for node's output ```