[pymvpa] Fwd: 10 fold stratified crossvalidation

Wed Nov 24 02:46:50 UTC 2010

thank you Jo,
i ran it with the NGroupSplitter you proposed, and it went fine (after
having realized that i just have to permute chunks randomly, i.e.
chunks = N.random.permute(dataset.nsamples)).
i'm thus still using the <Dataset / float32 674 x 63426 uniq: 674
chunks 10 labels labels_mapped> dataset (1 sample per chunk). So the
proposed change to miscfx.py is probably obsolete.

thanks for the hint towards systematically running comparisons too. i
found that the pymvpa dev branches (0.5-0.6) have an example file
/doc/examples/nested_cv.py, introducing a NFoldPartitioner.
i'll have a look at that, maybe it's simplifying the procedure.

well, thanks again for that valuable hint,
jakob

On Tue, Nov 23, 2010 at 11:58 PM, J.A. Etzel <jetzel at artsci.wustl.edu> wrote:
> If I understand what you're trying to do (10-fold cross-validation repeated
> 10 times) I would not try to do it in one big step but rather as 10-fold
> cross-validation ten different times. In other words, set up a script with a
> splitter something like asplitter = NGroupSplitter(ngroups=10,
> nperlabel='equal') to do one repeat of 10-fold cross-validation.
>
> Each time you run the script you should get a different answer, if the
> randomization works properly, since the partitions (which samples to put
> into each of the ten training and testing sets) and omitted samples (to have
> nperlabel equal if the training data isn't equal to start with) will vary.
> So the most simple solution could be to run the script ten times, recording
> the values each time.
>
> I generally prefer to set up the partitions, left-out samples (to have
> nperlabel equal), and random seeds before starting the analysis, instead of
> the simple strategy of running the script 10 times that I just described.
> The advantage is that then the analysis is reproducible (for example you
> could run the same analysis with two different types of classifiers) and you
> can confirm that all the replications and cross-validation partitionings are
> unique. Unfortunately, I've never run such an analysis in pyMVPA; perhaps
> someone else has and can explain if it's straightforward to precalculate the
> partitions and left-out samples.
>
> good luck,
> Jo
>
>
> On 11/23/2010 3:13 PM, Jakob Scherer wrote:
>>
>> Hi all,
>> I am working with an EEG dataset containing 674 trials distributed
>> across 10 labels/classes.
>>
>> <Dataset / float32 674 x 63426 uniq: 674 chunks 10 labels labels_mapped>
>>
>> that i read in from eeglab. as in eeglab there is a file for each
>> condition (i.e. label), i read them in one after the other and
>> concatenate them on the fly.
>> so the samples in the final dataset are in fact ordered by label (see
>> [ref_1]), and not in the order they were recorded originally.
>>
>> I am then running a linear SVM on the dataset. In order to have a good
>> estimate on the prediction power, the crossvalidation that i would
>> like to apply is:
>>
>> "Ten times ten-fold stratified crossvalidation"
>>
>> i.e.
>> 1. split up the dataset into ten splits (which are stratified, i.e.
>> nperlabel='equal').
>> 2. train on nine of them an test on the remaining.
>> 3. do this for every subsplit (i.e. leave-one-out).
>> 4. repeat the whole thing 10 times (i.e. the initial splitting)
>>
>> my first guess was
>>
>> asplitter=NFoldSplitter(cvtype=N.min(dataset.samplesperlabel.values()), \
>>        nperlabel='equal', \
>>        nrunspersplit=1, \
>>        count=10, \
>>        strategy='random')
>>
>> now one issue is the huge number of chunks (similar to the problem
>> Thorsten Kranz mentioned a few days ago "RFE memory issue").
>>
>> i then "discovered" the dataset.coarsenChunks() function. I could use
>> this function to realize step 1 above. however, then there should be
>> some kind of shuffling for the chunks_new [ref_2]. with this shuffling
>> of chunks [ref_3], i would then run:
>>
>> ########################
>> ACC_collector = {}
>>
>> for repetition_nr in str(N.r_[:10]):
>>
>>        dataset.coarsenChunks(10, shuffle=True)
>>
>>        aclassifier = LinearCSVMC(C=C)
>>        asplitter=NFoldSplitter(cvtype=1, nperlabel='equal',
>> nrunspersplit=1, count=None)
>>
>>        cv = CrossValidatedTransferError(TransferError(aclassifier), \
>>                        asplitter, \
>>                                harvest_attribs=[\
>>
>> 'transerror.clf.getSensitivityAnalyzer(force_training=False)()'],\
>>                                enable_states=[\
>>                                        'confusion',\
>>                                        'harvested',\
>>                                        'transerrors',\
>>                                ])
>>
>>        cv(dataset)
>>
>>        ACC_collector[repetition_nr] = cv.confusion.stats['ACC']
>>
>> ACC_estimate = N.mean(ACC_collector.values())
>> ########################
>>
>>
>> To make a long story short:
>> - for the shuffling, i attached a proposition how to change
>> mvpa/datasets/miscfx.py (lines marked by  "# proposed shuffle")
>> - do you think this is the way to go, or do you see any easier one?
>>
>>
>> Many thanks in advance,
>> jakob
>>
>>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa
>