[pymvpa] clf generalization over a different set of classes

Tue Aug 4 11:21:02 UTC 2009

Hi,

I just want to ckeck if a classifier trained on two classes A & B could 
also discriminate between two different classes C & D.

e.g. "is a classifier trained on face vs house samples also able to make 
correct predictions for cat vs. scissors?"

here is my code:

from mvpa.suite import *

attr = SampleAttributes(os.path.join(pymvpa_dataroot, 'attributes.txt'))
ds = NiftiDataset(samples=os.path.join(pymvpa_dataroot, 'bold.nii.gz'),
                       labels=attr.labels,
                       chunks=attr.chunks,
                       mask=os.path.join(pymvpa_dataroot, 'mask.nii.gz'))

detrend(ds, perchunk=True, model='regress', polyord=2)
zscore(ds, perchunk=True, targetdtype='float64', baselinelabels=[0])

# train ds: face vs house
ds1 = ds.select(labels=[1,2])
# predict ds: cat vs scissors
ds2 = ds.select(labels=[4,5])

# no of samples in each ds
s_ds1 = len(ds1.samples)
s_ds2 = len(ds2.samples)

# check for equal no. of samples
if s_ds1 < s_ds2:
    ds2 = ds2.getRandomSamples(s_ds1/2)
elif s_ds1 > s_ds2:
    ds1 = ds1.getRandomSamples(s_ds2/2)
elif s_ds1 == s_ds2:
    print ' > equal no. of samples'

# just check mean of of two classes for each ds
for i in ds1.uniquelabels:
    ds_tmp = ds1.select(labels=[i])
    print ' > mean of ds1 with label', i, N.mean(ds_tmp.samples)

for i in ds2.uniquelabels:
    ds_tmp = ds2.select(labels=[i])
    print ' > mean of ds2 with label', i, N.mean(ds_tmp.samples)

tmp_labels = ds2.labels.copy()
# fake labels
new_labels=tmp_labels.copy()
for l in xrange(tmp_labels.shape[0]):
    if tmp_labels[l] == 4.0:
        new_labels[l]=2.0
    elif tmp_labels[l] == 5.0:
        new_labels[l]=1.0

# assign new labels to predict ds
ds2.labels = new_labels

#setup clf
clf = LinearNuSVMC(nu=0.5, probability=0)

# setup validation procedure
terr = TransferError(clf)
terr.states.enable('confusion')
terr(ds1, ds1)

error = terr(ds1, ds2)   
print terr.confusion   

This shows an accuracy of about 40 % ... however changing the assignment 
of the faked labels to:

for l in xrange(tmp_labels.shape[0]):
    if tmp_labels[l] == 4.0:
        new_labels[l]=1.0 # instead of =2.0
    elif tmp_labels[l] == 5.0:
        new_labels[l]=2.0 # instead of=1.0

leads to an accuracy of about 60 %. I am not sure if this makes sense, 
or if I missed something here. For making a statement like "clf trained 
on label A/B is also able to discriminate between classes C/D" would you 
suggest to run both  analysis above and calculate the mean of both 
prediction errors... or how is this "usually" done?

Best regards,
Matthias