[pymvpa] cross-validation
J.A. Etzel
J.A.Etzel at med.umcg.nl
Tue Mar 16 13:01:33 UTC 2010
Interesting discussion, and I think one of those topics that seems
simple until you start working with real imaging data!
a few comments:
- "balancing" the number of examples in the training data (i.e. so equal
numbers of examples in each of the classes you are classifying) is
critical, as Yaroslav already mentioned. My usual practice is to
randomly subset the examples in the bigger class to achieve balance,
then do cross-validation on the balanced data set. This of course means
that some examples were not included, so I usually repeat the entire
process 10 times to get an idea of the stability of the performance. I
generally hope to see that the (averaged-cross-validation-folds)
accuracies of the "repeats" are fairly close, such as between 5% or so
of each other. I use the average of these averages as the actual
performance. This increases the number of tests that are performed
rather drastically, though, and requires careful planning for
significance testing (permutation tests should be performed using
precisely the same structure as the true labeling).
- The consensus seems to be that partitioning on the runs is "safest"
for fMRI data, since it avoids potentially inflating the performance.
The inflation can occur because volumes from within the same run will
probably be more similar than volume from other runs, because of scanner
drift, subject fatigue, subject movement, physiological changes, etc.
But sometimes it's not very practical. For example, I have a data set
with 16 examples in each of two classes, but divided into only two runs.
Classification accuracy is above chance when partitioning on the runs
(i.e. 8 examples of each class in the training data), but not great. If
I partition by leaving out two examples of each class instead (at
random; I repeated the leave-two-out process as well) performance
improves 20 - 30%. But I want to be very sure that the increase in
performance is due to the increase in the amount of training data, not
because of mixing examples from the runs.
Elements of experiment design come into play here: you need to be sure
that there were sufficient breaks between the experimental tasks, proper
preprocessing was done, etc. But in some experiments it does seem
reasonable to think that partitioning other than on the runs is valid.
What I do in this case is compare the performance of partitioning on the
runs with the performance on random "runs". In other words, I permute
the run labels within each class and perform partitioning on the runs
again. If this procedure makes a difference on classification accuracy,
I know that there is a problem, and look at the design, preprocessing,
temporal compression, etc. But if performance with these random "runs"
is about the same as partitioning on the actual runs I feel safer moving
to a non-run-partitioning scheme.
hope this helps,
Jo
More information about the Pkg-ExpPsy-PyMVPA
mailing list