[pymvpa] What is the value of using errorfx when using Cross validation?

Wed Feb 25 14:49:16 UTC 2015

Hello,
I am doing a k-fold cross validation on a data according to the following:
1. I'm partioning the data myself - set a train ('0' chunk) and test chunks
('1' chunk).
2. Using clf.train() and then clf.predict()
3. print the accuracy result and confusion matrix.
And i'm repeating this k times (by running the script attached k times)

The reason i'm asking is because:
The standard diviation among the accuracy results produced when using
CrossValidation class, and the standard diviation among
accuracy results in the way i described *are different*.

Using the CrossValidation class - the standard diviation is smaller.

Why is that? does it happend to involve errorfx?
What is the correct way to build such a scheme?

I've attached the script i'm using to produce single fold accuracy result.

Would appriciate your help,
Gal Star
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20150225/b6bb4131/attachment.html>
-------------- next part --------------
from __future__ import division
import sys
import os
from mvpa2.suite import *
import numpy

type = sys.argv[1]
sub = sys.argv[2]
fold = sys.argv[3]

#img_name = '4D_scans_conc_remote.nii.gz'
img_name = '4D_scans_conc.nii.gz'
#map_name = 'map.txt.2label_balance.txt'
map_name = 'map.txt'
#map_name = 'map.txt.2label'
source 	 = '/home/gals/converted_data/sub_brik_data/' + type + '/' + sub + '/selfmade_cv/' + fold
#source 	 = '/home/gals/converted_data/sub_brik_data/' + type + '/' + sub + '/rec_rem_test/selfmade_cv/' + fold
#map_source 	 = '/home/gals/converted_data/sub_brik_data/' + type + '/' + sub + '/couples_test/' + fold
#img_source 	 = '/home/gals/converted_data/sub_brik_data/' + type + '/' + sub
#source 	 = '/home/gals/converted_data/sub_brik_data/' + type + '/' + sub + '/2label_output/' + fold
#source  = '/home/gals/converted_data/sub_brik_data/ester/' + type
#source  = '/home/gals/converted_data/sub_brik_data/gal/'+ sub + '/'
#source  = '/home/gals/converted_data/sub_brik_data/gal/Dov/'

print "type: %s" % type
print "sub: %s" % sub
print "fold number: %s" % fold

#########################################################
# Read mvpa sample attributes definition from text file #
#########################################################
attr=SampleAttributes(os.path.join(source,map_name))
print "after sampleAttributes"

fds=fmri_dataset(samples=os.path.join(source,img_name),targets=attr.targets,chunks=attr.chunks,mask='/home/gals/masks/brain_mask.nii.gz')
print "passed fmri dataset"

# remote:
#interesting = numpy.array([l in ['2221','2222','2121','2122','23'] for l in fds.sa.targets])
#interesting = numpy.array([l in ['221','211','23'] for l in fds.sa.targets])
#interesting = numpy.array([l in ['22','21','23'] for l in fds.sa.targets])

# recent:
interesting = numpy.array([l in ['2121','2111','23'] for l in fds.sa.targets])
#interesting = numpy.array([l in ['2221','2211','23'] for l in fds.sa.targets])
fds = fds[interesting]

#zscore(fds)
zscore(fds,chunks_attr=None)
#zscore(fds, param_est=('targets', ['23']))#, chunks_attr='chunks')
#zscore(fds, param_est=('targets', ['3']), chunks_attr='chunks')

# remote:
#interesting = numpy.array([l in ['2221','2222','2121','2122'] for l in fds.sa.targets])

# recent:
#interesting = numpy.array([l in ['2221','2211'] for l in fds.sa.targets])
interesting = numpy.array([l in ['2121','2111'] for l in fds.sa.targets])
#interesting = numpy.array([l in ['22','21'] for l in fds.sa.targets])
#interesting = numpy.array([l in ['221','211'] for l in fds.sa.targets])
fds = fds[interesting]

#fds = fds[:, numpy.all(numpy.isfinite(fds.samples),axis=0)]

sens = SensitivityBasedFeatureSelection(OneWayAnova(), FixedNElementTailSelector(1000, tail='upper', mode='select'), enable_ca=['sensitivity'])
#, postproc=FxMapper('features', lambda x: x /x.max(), attrfx=None)

#clf = LinearCSVMC()
#clf = FeatureSelectionClassifier(LinearCSVMC(),sens, enable_ca=['training_stats'])
clf = FeatureSelectionClassifier(RbfCSVMC(C=-1.0),sens, enable_ca=['training_stats'])
#clf = FeatureSelectionClassifier(SVM(kernel=PolyLSKernel()),sens, enable_ca=['training_stats'])

nfold = NFoldPartitioner (attr='chunks')
confusion = ConfusionMatrix ()

int_train = numpy.array([l in [0] for l in fds.sa.chunks])
int_test = numpy.array([l in [1] for l in fds.sa.chunks])
train = fds[int_train]
test = fds[int_test]

print len(train)
print len(test)
clf.train(train)
predictions = clf.predict(test.samples)
confusion.add(test.targets,predictions)

print confusion.as_string(summary=True)
print "Total accuracy results: %d" % confusion.percent_correct

print clf.ca.training_stats.as_string()