[pymvpa] Fwd: 10 fold stratified crossvalidation

Tue Nov 23 22:58:32 UTC 2010

If I understand what you're trying to do (10-fold cross-validation 
repeated 10 times) I would not try to do it in one big step but rather 
as 10-fold cross-validation ten different times. In other words, set up 
a script with a splitter something like asplitter = 
NGroupSplitter(ngroups=10, nperlabel='equal') to do one repeat of 
10-fold cross-validation.

Each time you run the script you should get a different answer, if the 
randomization works properly, since the partitions (which samples to put 
into each of the ten training and testing sets) and omitted samples (to 
have nperlabel equal if the training data isn't equal to start with) 
will vary. So the most simple solution could be to run the script ten 
times, recording the values each time.

I generally prefer to set up the partitions, left-out samples (to have 
nperlabel equal), and random seeds before starting the analysis, instead 
of the simple strategy of running the script 10 times that I just 
described. The advantage is that then the analysis is reproducible (for 
example you could run the same analysis with two different types of 
classifiers) and you can confirm that all the replications and 
cross-validation partitionings are unique. Unfortunately, I've never run 
such an analysis in pyMVPA; perhaps someone else has and can explain if 
it's straightforward to precalculate the partitions and left-out samples.

good luck,
Jo

On 11/23/2010 3:13 PM, Jakob Scherer wrote:
> Hi all,
> I am working with an EEG dataset containing 674 trials distributed
> across 10 labels/classes.
>
> <Dataset / float32 674 x 63426 uniq: 674 chunks 10 labels labels_mapped>
>
> that i read in from eeglab. as in eeglab there is a file for each
> condition (i.e. label), i read them in one after the other and
> concatenate them on the fly.
> so the samples in the final dataset are in fact ordered by label (see
> [ref_1]), and not in the order they were recorded originally.
>
> I am then running a linear SVM on the dataset. In order to have a good
> estimate on the prediction power, the crossvalidation that i would
> like to apply is:
>
> "Ten times ten-fold stratified crossvalidation"
>
> i.e.
> 1. split up the dataset into ten splits (which are stratified, i.e.
> nperlabel='equal').
> 2. train on nine of them an test on the remaining.
> 3. do this for every subsplit (i.e. leave-one-out).
> 4. repeat the whole thing 10 times (i.e. the initial splitting)
>
> my first guess was
>
> asplitter=NFoldSplitter(cvtype=N.min(dataset.samplesperlabel.values()), \
>         nperlabel='equal', \
>         nrunspersplit=1, \
>         count=10, \
>         strategy='random')
>
> now one issue is the huge number of chunks (similar to the problem
> Thorsten Kranz mentioned a few days ago "RFE memory issue").
>
> i then "discovered" the dataset.coarsenChunks() function. I could use
> this function to realize step 1 above. however, then there should be
> some kind of shuffling for the chunks_new [ref_2]. with this shuffling
> of chunks [ref_3], i would then run:
>
> ########################
> ACC_collector = {}
>
> for repetition_nr in str(N.r_[:10]):
>
>         dataset.coarsenChunks(10, shuffle=True)
>
>         aclassifier = LinearCSVMC(C=C)
>         asplitter=NFoldSplitter(cvtype=1, nperlabel='equal',
> nrunspersplit=1, count=None)
>
>         cv = CrossValidatedTransferError(TransferError(aclassifier), \
>                         asplitter, \
>                                 harvest_attribs=[\
>
> 'transerror.clf.getSensitivityAnalyzer(force_training=False)()'],\
>                                 enable_states=[\
>                                         'confusion',\
>                                         'harvested',\
>                                         'transerrors',\
>                                 ])
>
>         cv(dataset)
>
>         ACC_collector[repetition_nr] = cv.confusion.stats['ACC']
>
> ACC_estimate = N.mean(ACC_collector.values())
> ########################
>
>
> To make a long story short:
> - for the shuffling, i attached a proposition how to change
> mvpa/datasets/miscfx.py (lines marked by  "# proposed shuffle")
> - do you think this is the way to go, or do you see any easier one?
>
>
> Many thanks in advance,
> jakob
>
>