[pymvpa] cross-validation

Tue Mar 16 13:01:33 UTC 2010

Interesting discussion, and I think one of those topics that seems 
simple until you start working with real imaging data!

a few comments:
- "balancing" the number of examples in the training data (i.e. so equal 
numbers of examples in each of the classes you are classifying) is 
critical, as Yaroslav already mentioned. My usual practice is to 
randomly subset the examples in the bigger class to achieve balance, 
then do cross-validation on the balanced data set. This of course means 
that some examples were not included, so I usually repeat the entire 
process 10 times to get an idea of the stability of the performance. I 
generally hope to see that the (averaged-cross-validation-folds)  
accuracies of the "repeats" are fairly close, such as between 5% or so 
of each other. I use the average of these averages as the actual 
performance. This increases the number of tests that are performed 
rather drastically, though, and requires careful planning for 
significance testing (permutation tests should be performed using 
precisely the same structure as the true labeling).

- The consensus seems to be that partitioning on the runs is "safest" 
for fMRI data, since it avoids potentially inflating the performance. 
The inflation can occur because volumes from within the same run will 
probably be more similar than volume from other runs, because of scanner 
drift, subject fatigue, subject movement, physiological changes, etc. 
But sometimes it's not very practical. For example, I have a data set 
with 16 examples in each of two classes, but divided into only two runs. 
Classification accuracy is above chance when partitioning on the runs 
(i.e. 8 examples of each class in the training data), but not great. If 
I partition by leaving out two examples of each class instead (at 
random; I repeated the leave-two-out process as well) performance 
improves 20 - 30%. But I want to be very sure that the increase in 
performance is due to the increase in the amount of training data, not 
because of mixing examples from the runs.

Elements of experiment design come into play here: you need to be sure 
that there were sufficient breaks between the experimental tasks, proper 
preprocessing was done, etc. But in some experiments it does seem 
reasonable to think that partitioning other than on the runs is valid.

What I do in this case is compare the performance of partitioning on the 
runs with the performance on random "runs". In other words, I permute 
the run labels within each class and perform partitioning on the runs 
again. If this procedure makes a difference on classification accuracy, 
I know that there is a problem, and look at the design, preprocessing, 
temporal compression, etc. But if performance with these random "runs" 
is about the same as partitioning on the actual runs I feel safer moving 
to a non-run-partitioning scheme.

hope this helps,
Jo