[pymvpa] Fwd: 10 fold stratified crossvalidation

Jakob Scherer jakob.scherer at gmail.com
Tue Nov 23 21:13:57 UTC 2010


Hi all,
I am working with an EEG dataset containing 674 trials distributed
across 10 labels/classes.

<Dataset / float32 674 x 63426 uniq: 674 chunks 10 labels labels_mapped>

that i read in from eeglab. as in eeglab there is a file for each
condition (i.e. label), i read them in one after the other and
concatenate them on the fly.
so the samples in the final dataset are in fact ordered by label (see
[ref_1]), and not in the order they were recorded originally.

I am then running a linear SVM on the dataset. In order to have a good
estimate on the prediction power, the crossvalidation that i would
like to apply is:

"Ten times ten-fold stratified crossvalidation"

i.e.
1. split up the dataset into ten splits (which are stratified, i.e.
nperlabel='equal').
2. train on nine of them an test on the remaining.
3. do this for every subsplit (i.e. leave-one-out).
4. repeat the whole thing 10 times (i.e. the initial splitting)

my first guess was

asplitter=NFoldSplitter(cvtype=N.min(dataset.samplesperlabel.values()), \
       nperlabel='equal', \
       nrunspersplit=1, \
       count=10, \
       strategy='random')

now one issue is the huge number of chunks (similar to the problem
Thorsten Kranz mentioned a few days ago "RFE memory issue").

i then "discovered" the dataset.coarsenChunks() function. I could use
this function to realize step 1 above. however, then there should be
some kind of shuffling for the chunks_new [ref_2]. with this shuffling
of chunks [ref_3], i would then run:

########################
ACC_collector = {}

for repetition_nr in str(N.r_[:10]):

       dataset.coarsenChunks(10, shuffle=True)

       aclassifier = LinearCSVMC(C=C)
       asplitter=NFoldSplitter(cvtype=1, nperlabel='equal',
nrunspersplit=1, count=None)

       cv = CrossValidatedTransferError(TransferError(aclassifier), \
                       asplitter, \
                               harvest_attribs=[\

'transerror.clf.getSensitivityAnalyzer(force_training=False)()'],\
                               enable_states=[\
                                       'confusion',\
                                       'harvested',\
                                       'transerrors',\
                               ])

       cv(dataset)

       ACC_collector[repetition_nr] = cv.confusion.stats['ACC']

ACC_estimate = N.mean(ACC_collector.values())
########################


To make a long story short:
- for the shuffling, i attached a proposition how to change
mvpa/datasets/miscfx.py (lines marked by  "# proposed shuffle")
- do you think this is the way to go, or do you see any easier one?


Many thanks in advance,
jakob



references:


[ref_1]
(Pdb) dataset.labels
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9])


[ref_2] no shuffling:
In [30]: dataset.coarsenChunks(10)

In [31]: dataset.chunks

Out[31]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
      5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
      8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
      9, 9, 9, 9, 9, 9, 9])


[ref_3] chunks shuffling:
In [32]: dataset.coarsenChunks(10, shuffle=True)
In [33]: dataset.chunks

Out[33]:
array([4, 8, 5, 5, 5, 1, 8, 8, 8, 2, 2, 4, 2, 5, 4, 8, 2, 7, 5, 8, 7, 9, 7,
      6, 4, 0, 3, 7, 7, 0, 8, 5, 0, 8, 6, 9, 9, 6, 8, 3, 3, 5, 2, 9, 4, 6,
      5, 5, 9, 3, 9, 2, 1, 6, 9, 1, 6, 6, 8, 5, 8, 0, 0, 3, 0, 4, 9, 8, 3,
      9, 3, 7, 9, 8, 7, 8, 9, 3, 3, 1, 6, 8, 1, 7, 2, 0, 0, 2, 4, 2, 5, 7,
      4, 5, 3, 3, 7, 3, 4, 1, 5, 8, 6, 9, 1, 2, 2, 8, 8, 1, 3, 8, 9, 6, 2,
      2, 9, 7, 0, 6, 8, 0, 6, 4, 5, 9, 9, 5, 5, 9, 1, 7, 9, 0, 4, 6, 0, 8,
      5, 4, 4, 1, 2, 4, 5, 8, 5, 2, 6, 3, 5, 4, 7, 1, 2, 5, 4, 0, 8, 0, 1,
      3, 9, 8, 3, 3, 8, 7, 6, 9, 6, 5, 7, 3, 0, 7, 0, 1, 5, 0, 8, 5, 8, 4,
      6, 0, 0, 4, 5, 1, 3, 3, 2, 0, 6, 5, 7, 4, 3, 5, 2, 6, 1, 9, 3, 1, 2,
      3, 2, 1, 5, 9, 6, 6, 4, 3, 4, 7, 6, 2, 1, 1, 7, 9, 1, 7, 4, 3, 1, 6,
      4, 4, 1, 4, 0, 5, 1, 6, 1, 7, 1, 6, 8, 3, 2, 9, 0, 3, 1, 3, 4, 8, 5,
      9, 7, 1, 0, 8, 4, 4, 7, 3, 9, 2, 8, 1, 7, 1, 3, 1, 8, 5, 1, 6, 9, 0,
      7, 1, 3, 9, 9, 4, 8, 3, 5, 8, 7, 4, 8, 0, 9, 6, 7, 4, 8, 7, 2, 8, 3,
      0, 0, 4, 7, 5, 7, 6, 9, 5, 5, 0, 2, 9, 9, 7, 3, 2, 1, 3, 6, 0, 2, 3,
      5, 2, 9, 6, 9, 2, 6, 5, 3, 6, 9, 2, 4, 1, 2, 4, 3, 5, 2, 6, 8, 0, 2,
      7, 1, 2, 7, 7, 3, 9, 4, 9, 9, 3, 1, 6, 5, 4, 9, 7, 0, 9, 6, 7, 4, 4,
      4, 2, 2, 6, 6, 7, 2, 3, 0, 9, 7, 6, 8, 1, 4, 5, 9, 3, 2, 3, 4, 6, 9,
      3, 2, 9, 3, 6, 7, 4, 2, 6, 0, 1, 9, 6, 5, 8, 5, 5, 5, 9, 3, 8, 8, 0,
      7, 2, 1, 5, 7, 2, 3, 1, 3, 8, 8, 2, 7, 5, 8, 7, 4, 4, 0, 5, 7, 6, 4,
      0, 2, 5, 2, 0, 3, 7, 5, 5, 5, 7, 1, 6, 1, 1, 0, 4, 8, 3, 1, 4, 8, 3,
      5, 0, 2, 4, 5, 9, 2, 6, 1, 1, 1, 3, 9, 4, 4, 6, 6, 0, 3, 6, 4, 2, 7,
      8, 2, 0, 9, 0, 1, 7, 4, 6, 9, 2, 2, 3, 8, 0, 1, 7, 0, 9, 3, 8, 0, 1,
      7, 7, 9, 7, 3, 5, 2, 0, 8, 2, 7, 7, 0, 1, 5, 3, 5, 9, 2, 0, 0, 2, 0,
      8, 4, 3, 5, 4, 6, 4, 2, 6, 1, 0, 8, 7, 5, 4, 0, 2, 8, 9, 7, 8, 1, 4,
      2, 1, 3, 7, 7, 0, 9, 2, 4, 6, 3, 9, 3, 9, 7, 0, 7, 0, 5, 4, 4, 7, 0,
      8, 9, 7, 5, 8, 1, 1, 6, 8, 9, 4, 1, 5, 6, 0, 4, 3, 5, 4, 1, 8, 8, 3,
      0, 1, 6, 0, 3, 1, 1, 1, 6, 8, 9, 0, 2, 1, 2, 7, 0, 0, 7, 0, 7, 9, 0,
      0, 7, 6, 5, 2, 2, 9, 6, 6, 1, 6, 7, 9, 9, 9, 3, 6, 6, 5, 8, 2, 3, 5,
      8, 7, 1, 9, 4, 4, 4, 4, 5, 2, 1, 0, 3, 6, 2, 1, 6, 8, 8, 6, 2, 8, 5,
      2, 0, 5, 0, 6, 8, 6])
-------------- next part --------------
A non-text attachment was scrubbed...
Name: miscfx.py
Type: application/octet-stream
Size: 12875 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20101123/30a5dfa4/attachment-0001.obj>


More information about the Pkg-ExpPsy-PyMVPA mailing list