Outline of ridge regression with cross-validation optimization of the penalty parameter
load('ridgeData') %contains matrix of predictors (X) and vector of outcomes (Y)
X(:,end+1) = 1; %append a column of 1s for intercept term (constant predictor)
[n,m] = size(X); %number of cases, number of predictors
ranks = randperm(n); %random ordering of the n cases
Xtrain = X(ranks(1:ntrain),:); %take ntrain cases for the training set
Ytrain = Y(ranks(1:ntrain)); %the corresponding training outcomes
Xtest = X(ranks(ntrain+1:n),:); %use the remaining data as the test set
Ytest = Y(ranks(ntrain+1:n)); %the corresponding test outcomes
bhat = (Xtrain'*Xtrain+L*eye(m))^(-1)*Xtrain'*Ytrain; %estimated regression coefficients
Yhat = Xtest*bhat; %predictions for test cases
rmse = sqrt(mean((Yhat-Ytest).^2)); %root mean squared error
Now put the last four steps into a loop, to get CV performance over multiple train-test splits ("folds"). Calculate the mean RMSE and standard error of the mean across folds. I found that 100 folds was enough to get adequately small standard errors.
Then make an outer loop to test different values of the penalty parameter (L). Plot the mean and standard error as a function of L (see Matlab's errorbar function), and visually find the value of L giving the best performance.
Finally, make an outer loop to vary the size of the training set (ntrain). Try training set sizes both smaller and larger than the number of predictors. How does the profile of RMSE as a function of L compare across different training set sizes?