[pymvpa] Train and test on different classes from a dataset

Fri Feb 1 11:41:14 UTC 2013

>  I don't see that permuting> the test set labels hurts the inter-sample dependencies 

Since in cross validation the testing set is usually way smaller than the training set, I think that permuting only the testing set is very critical (more critical if the whole data is small size). The expected permutation distribution of such case is some kind sparse (lots of gaps and spikes) distribution. Imperfect randomization algorithm adds more problems as well. 

Regards,
-Rawi

----- Original Message -----
> From: Michael Hanke <mih at debian.org>
> To: Development and support of PyMVPA <pkg-exppsy-pymvpa at lists.alioth.debian.org>
> Cc: 
> Sent: Friday, February 1, 2013 10:21 AM
> Subject: Re: [pymvpa] Train and test on different classes from a dataset
> 
> On Thu, Jan 31, 2013 at 02:13:14PM -0600, J.A. Etzel wrote:
>>  Why do you say in the tutorial that "Doing a whole-dataset
>>  permutation is a common mistake ..." ? I don't see that permuting
>>  the test set labels hurts the inter-sample dependencies ... won't I
>>  still have (say) 5 A and 5 B in my test set?
> 
> I am attaching some code and a figure. This is a modified version of
> 
> http://pymvpa.org/examples/permutation_test.html
> 
> I ran 24 permutation analysis for 12 combinations of number of
> chunks/runs/... and SNR. In the figure you can see MC sample histograms
> for all these combinations (always using 200 permutations). The greenish
> bars represent the permutation results from permuting both training and
> testing portion of the data (note that only within chunk permutation was
> done -- although this should have no effect on this data). The blueish
> histogram is the same analysis but only the training set has been
> permuted (I can't think of any good reason why one would only permute the
> testing set -- except for speed ;-).
> 
> The input data is pure noise, plus a bit of univariate signal (according
> to SNR) added to two of three features. In all simulations there are 200
> samples in the dataset, but either grouped in 2, 3 or 5 chunks.
> 
> I am using the SNR parameter in this simulation as a way to increase
> within category similarity. In a real dataset inter-sample similarity
> could have many reasons, of course.
> 
> The dashed line shows the theoretical chance performance at 0.5, the red
> line the empirical performance for the unpermuted dataset.
> 
> Now tell me that it doesn't make a difference what portion of the data
> you permute ;-) Depending on the actual number of chunks and data
> consistency the "permutability" of the dataset varies quite a bit -- 
> but
> this is only reflected in the distributions when the testing portion is
> not permuted as well. For example, look at the upper right (high sample
> similarity, smallish training portion), in a significant portion of all
> permutations the training dataset isn't "properly" permuted at all
> (within category label swapping), in the other extreme case the labels
> are swapped entirely between categories. This can happen with small
> datasets and large chunks -- however, the green histogram doesn't tell me
> about it, at all.
> 
> [BTW sorry for the poor quality of the figure, but I was hoping to be
> gentle to the listserver. If you run the attached code, it will generate
> a more beautiful one]
> 
> Please point me to any conceptual of technical mistake you can think of
> -- this topic comes up frequently, the more critical feedback the better...
> 
> Cheers,
> 
> Michael
> 
> -- 
> Michael Hanke
> http://mih.voxindeserto.de
> 
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>