Department of Biomedical Engineering and Computational Science

# Input variable selection with RJMCMC for MLP network, 'demo_rjmcmc'

This program demonstrates the benefits of RJMCMC sampling in input variable selection. The program is an extension of demonstration program 'demo_3class', in which the ARD prior was used in the case of irrelevant inputs. Here, in addition to ARD, we sample also between models with different input variables.

RJMCMC method is an extension of Metropolis-Hastings algorithm that allows jumps between models with different dimensional parameter spaces. In the case of input selection the dimension of networks parameter space changes with respect to the number of inputs chosen in the model. RJMCMC visits the models according to their posterior probability. After sampling a chain of convenient length the marginal posterior probability of input can be approximated from the number of visits to that model. The effect of RJMCMC to the given problem can be seen below.

The data used in the demonstration program is the same used by Radford M. Neal in his three-way classification example in Software for Flexible Bayesian Modeling (http://www.cs.toronto.edu/~radford/fbm.software.html). The data consists of 1000 4-D vectors which are classified into three classes. The data is generated by drawing the components of vector, x1, x2, x3 and x4, uniformly form (0,1). The class of each vector is selected according to the first two components of the vector, x1 and x2. After this a Gaussian noise with standard deviation of 0.1 has been added to every component of the vector.

The data is divided into training and test parts. After the posterior parameters for network has been sampled only ~13% of test units are misclassified. The posterior importances of inputs are 1.00 (input 1), 1.00 (input 2), 0.15 (input 3) and 0.17 (input 4)

The code of demonstration program is shown below.

```x=load('demos/cdata');
y=repmat(0,size(x,1),3);
y(x(:,5)==0,1) = 1;
y(x(:,5)==1,2) = 1;
y(x(:,5)==2,3) = 1;
x(:,end)=[];

% Divide the data into training and test parts.
xt = x(401:end,:);
x=x(1:400,:);
yt=y(401:end,:);
y=y(1:400,:);

nin=size(x,2);
nhid=8;
nout=size(y,2);
% create MLP with logistic output function ('mlp2b')
net = mlp2('mlp2c', nin, nhid, nout);

%Create a Gaussian multivariate hierarchical prior with ARD
net=mlp2normp(net, {{repmat(1,1,net.nin) 0.5 0.05 -0.05 1}... % input-hidden weigth
{1 0.5 0.05} ...                           % bias-hidden
{1 -0.5 0.05} ...                          % hidden-output weigth
{1}})                                       % bias-output

% To create prior for the weights without ARD, comment out the above
% lines and remove the comment marks below.
%net=mlp2normp(net, {{10 0.05 1}...                             % input-hidden weight
%                    {10 0.05 0.5} ...                          % bias-hidden
%                    {10 -0.05 0.5} ...                         % hidden-output
%                    {1}});                                     % bias-output

% Intialize weights to zero and set the optimization parameters...
w=randn(size(mlp2pak(net)))*0.01;

fe=str2fun('mlp2c_e');
fg=str2fun('mlp2c_g');
n=length(y);
itr=1:floor(0.5*n);     % training set of data for early stop
its=floor(0.5*n)+1:n;   % test set of data for early stop
optes=scges_opt;
optes.display=1;
optes.tolfun=1e-1;
optes.tolx=1e-1;

% ... Start scaled conjugate gradient optimization with early stopping.
[w,fs,vs]=scges(fe, w, optes, fg, net, x(itr,:),y(itr,:), net,x(its,:),y(its,:));
net=mlp2unpak(net,w);

% Set the starting values for weights hyperparameter to be the
% variance of the early stopped weight.
shypw1 = std(net.w1,0,2)';
shypb1 = std(net.b1);
shypw2 = std(net.w2(:));

net=mlp2normp(net, {{shypw1.^2  0.5 0.05 -0.05 1}...               % input-hidden weigth
{shypb1.^2  0.5 0.05} ...                      % bias-hidden
{shypw2.^2 -0.5 0.05} ...                      % hidden-output weigth
{1}})                                          % bias-output

% First we initialize random seed for Monte
% Carlo sampling and set the sampling options to default.
hmc2('state', sum(100*clock));
opt=mlp2c_mcopt;

%  Here we do the sampling without persistence.
opt.repeat=70;
opt.hmc_opt.steps=10;
opt.gibbs=1;
opt.nsamples=1;
[r,net,rstate]=mlp2c_mc(opt, net, x, y);

% Now that the starting values are found we start the main sampling.
opt.hmc_opt.persistence=1;
opt.hmc_opt.steps=40;
opt.hmc_opt.decay=0.95;
opt.repeat=50;
opt.nsamples=600;
opt.hmc_opt.window=5;

% Here we use RJMCMC for input variable selection. The options for
% RJMCMC are set below.
opt.sample_inputs=1;
opt.rj_opt.pswitch = 1/3;
opt.rj_opt.lpk = log(ones(1,net.nin)/net.nin) % log of uniform prior for k:s
net.inputii = logical(ones(1,net.nin))

[r,net,rstate]=mlp2c_mc(opt, net, x, y, [], [], r, rstate);

% Thin the sample chain.
r=thin(r,50,5);

% Lets test how well the network works for test data.
forw = mlp2fwds(r,xt);
for i=1:size(forw,3)
tga(:,:,i) = softmax(forw(:,:,i)')';
end
tga = mean(tga,3);
tt = tga==repmat(max(tga,[],2),1,size(tga,2));

% lets calculate the percentage of misclassified points
missed = (sum(sum(abs(tt-yt)))/2)/size(yt,1)*100

% The inportance of input variables
importance = sum(r.inputii)/size(r.inputii,1)
```