[pymvpa] Train and test on different classes from a dataset

Tue Feb 5 14:27:12 UTC 2013

If you are doing some sort of cross-modal or cross-day analysis (e.g. 
training on data collected under one set of conditions and testing on 
data collected under another set of conditions) I agree that only 
permuting the training data can be quite sensible.

But when we only have a single dataset (such as do a and b vary in this 
set of trials, partitioning on the runs) I don't fully agree with this 
characterization:

 > For all these classifiers trained on permuted data we want to know how
 > well they can discriminate our empirical data (aka testing data) -- more
 > precisely the pristine testing data. Because, in general, we do _not_
 > want to know how well a classifier trained on no signal can discriminate
 > any other dataset with no signal (aka permuted testing data).

Perhaps the difference is that I tend to think of each set of 
cross-validations as the unit that should be permuted. For example, 
suppose I have four runs and I'm partitioning on the runs. The 
true-labeled accuracy is thus computed by averaging the four accuracies 
(test on the first run, test on the second run, ...).

My instinct is that the permutations should follow that pattern: ONE 
permutation mean should come from permuting the labels, doing the 
cross-validation, and averaging over the folds (the four runs, in this 
case). In other words, to keep the linking between the cross-validation 
folds in the permutation test (e.g. training on runs 1-3 will likely be 
somewhat similar to training on runs 1,2,4) we need to permute the 
*entire* dataset at once (all four runs), not just the training (three 
runs at a time).

I hope this argument is somewhat clear; I'd like to make a picture and 
example but have to get something else done first. Unfortunately I 
haven't been able to dig into the code you've sent yet either; hopefully 
tomorrow.

Jo

On 2/5/2013 7:33 AM, Francisco Pereira wrote:
> I'm catching up with this long thread and all I can say is I fully
> concur with Michael, in particular:
>
> On Tue, Feb 5, 2013 at 3:11 AM, Michael Hanke <mih at debian.org> wrote:
>>
>> Why are we doing permutation analysis? Because we want to know how
>> likely it is to observe a specific prediction performance on a
>> particular dataset under the H0 hypothesis, i.e. how good can a
>> classifier get at predicting our empirical data when the training
>> did not contain the signal of interest -- aka chance performance.
>
> Permuting the test set might make sense, perhaps, if you wanted to
> make a statement about the result variability over all possible test
> sets of that size if H0 was true.
>
> Francisco
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>

-- 
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/