[pymvpa] Fwd: 10 fold stratified crossvalidation
Jakob Scherer
jakob.scherer at gmail.com
Tue Nov 23 21:13:57 UTC 2010
Hi all,
I am working with an EEG dataset containing 674 trials distributed
across 10 labels/classes.
<Dataset / float32 674 x 63426 uniq: 674 chunks 10 labels labels_mapped>
that i read in from eeglab. as in eeglab there is a file for each
condition (i.e. label), i read them in one after the other and
concatenate them on the fly.
so the samples in the final dataset are in fact ordered by label (see
[ref_1]), and not in the order they were recorded originally.
I am then running a linear SVM on the dataset. In order to have a good
estimate on the prediction power, the crossvalidation that i would
like to apply is:
"Ten times ten-fold stratified crossvalidation"
i.e.
1. split up the dataset into ten splits (which are stratified, i.e.
nperlabel='equal').
2. train on nine of them an test on the remaining.
3. do this for every subsplit (i.e. leave-one-out).
4. repeat the whole thing 10 times (i.e. the initial splitting)
my first guess was
asplitter=NFoldSplitter(cvtype=N.min(dataset.samplesperlabel.values()), \
nperlabel='equal', \
nrunspersplit=1, \
count=10, \
strategy='random')
now one issue is the huge number of chunks (similar to the problem
Thorsten Kranz mentioned a few days ago "RFE memory issue").
i then "discovered" the dataset.coarsenChunks() function. I could use
this function to realize step 1 above. however, then there should be
some kind of shuffling for the chunks_new [ref_2]. with this shuffling
of chunks [ref_3], i would then run:
########################
ACC_collector = {}
for repetition_nr in str(N.r_[:10]):
dataset.coarsenChunks(10, shuffle=True)
aclassifier = LinearCSVMC(C=C)
asplitter=NFoldSplitter(cvtype=1, nperlabel='equal',
nrunspersplit=1, count=None)
cv = CrossValidatedTransferError(TransferError(aclassifier), \
asplitter, \
harvest_attribs=[\
'transerror.clf.getSensitivityAnalyzer(force_training=False)()'],\
enable_states=[\
'confusion',\
'harvested',\
'transerrors',\
])
cv(dataset)
ACC_collector[repetition_nr] = cv.confusion.stats['ACC']
ACC_estimate = N.mean(ACC_collector.values())
########################
To make a long story short:
- for the shuffling, i attached a proposition how to change
mvpa/datasets/miscfx.py (lines marked by "# proposed shuffle")
- do you think this is the way to go, or do you see any easier one?
Many thanks in advance,
jakob
references:
[ref_1]
(Pdb) dataset.labels
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9])
[ref_2] no shuffling:
In [30]: dataset.coarsenChunks(10)
In [31]: dataset.chunks
Out[31]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9])
[ref_3] chunks shuffling:
In [32]: dataset.coarsenChunks(10, shuffle=True)
In [33]: dataset.chunks
Out[33]:
array([4, 8, 5, 5, 5, 1, 8, 8, 8, 2, 2, 4, 2, 5, 4, 8, 2, 7, 5, 8, 7, 9, 7,
6, 4, 0, 3, 7, 7, 0, 8, 5, 0, 8, 6, 9, 9, 6, 8, 3, 3, 5, 2, 9, 4, 6,
5, 5, 9, 3, 9, 2, 1, 6, 9, 1, 6, 6, 8, 5, 8, 0, 0, 3, 0, 4, 9, 8, 3,
9, 3, 7, 9, 8, 7, 8, 9, 3, 3, 1, 6, 8, 1, 7, 2, 0, 0, 2, 4, 2, 5, 7,
4, 5, 3, 3, 7, 3, 4, 1, 5, 8, 6, 9, 1, 2, 2, 8, 8, 1, 3, 8, 9, 6, 2,
2, 9, 7, 0, 6, 8, 0, 6, 4, 5, 9, 9, 5, 5, 9, 1, 7, 9, 0, 4, 6, 0, 8,
5, 4, 4, 1, 2, 4, 5, 8, 5, 2, 6, 3, 5, 4, 7, 1, 2, 5, 4, 0, 8, 0, 1,
3, 9, 8, 3, 3, 8, 7, 6, 9, 6, 5, 7, 3, 0, 7, 0, 1, 5, 0, 8, 5, 8, 4,
6, 0, 0, 4, 5, 1, 3, 3, 2, 0, 6, 5, 7, 4, 3, 5, 2, 6, 1, 9, 3, 1, 2,
3, 2, 1, 5, 9, 6, 6, 4, 3, 4, 7, 6, 2, 1, 1, 7, 9, 1, 7, 4, 3, 1, 6,
4, 4, 1, 4, 0, 5, 1, 6, 1, 7, 1, 6, 8, 3, 2, 9, 0, 3, 1, 3, 4, 8, 5,
9, 7, 1, 0, 8, 4, 4, 7, 3, 9, 2, 8, 1, 7, 1, 3, 1, 8, 5, 1, 6, 9, 0,
7, 1, 3, 9, 9, 4, 8, 3, 5, 8, 7, 4, 8, 0, 9, 6, 7, 4, 8, 7, 2, 8, 3,
0, 0, 4, 7, 5, 7, 6, 9, 5, 5, 0, 2, 9, 9, 7, 3, 2, 1, 3, 6, 0, 2, 3,
5, 2, 9, 6, 9, 2, 6, 5, 3, 6, 9, 2, 4, 1, 2, 4, 3, 5, 2, 6, 8, 0, 2,
7, 1, 2, 7, 7, 3, 9, 4, 9, 9, 3, 1, 6, 5, 4, 9, 7, 0, 9, 6, 7, 4, 4,
4, 2, 2, 6, 6, 7, 2, 3, 0, 9, 7, 6, 8, 1, 4, 5, 9, 3, 2, 3, 4, 6, 9,
3, 2, 9, 3, 6, 7, 4, 2, 6, 0, 1, 9, 6, 5, 8, 5, 5, 5, 9, 3, 8, 8, 0,
7, 2, 1, 5, 7, 2, 3, 1, 3, 8, 8, 2, 7, 5, 8, 7, 4, 4, 0, 5, 7, 6, 4,
0, 2, 5, 2, 0, 3, 7, 5, 5, 5, 7, 1, 6, 1, 1, 0, 4, 8, 3, 1, 4, 8, 3,
5, 0, 2, 4, 5, 9, 2, 6, 1, 1, 1, 3, 9, 4, 4, 6, 6, 0, 3, 6, 4, 2, 7,
8, 2, 0, 9, 0, 1, 7, 4, 6, 9, 2, 2, 3, 8, 0, 1, 7, 0, 9, 3, 8, 0, 1,
7, 7, 9, 7, 3, 5, 2, 0, 8, 2, 7, 7, 0, 1, 5, 3, 5, 9, 2, 0, 0, 2, 0,
8, 4, 3, 5, 4, 6, 4, 2, 6, 1, 0, 8, 7, 5, 4, 0, 2, 8, 9, 7, 8, 1, 4,
2, 1, 3, 7, 7, 0, 9, 2, 4, 6, 3, 9, 3, 9, 7, 0, 7, 0, 5, 4, 4, 7, 0,
8, 9, 7, 5, 8, 1, 1, 6, 8, 9, 4, 1, 5, 6, 0, 4, 3, 5, 4, 1, 8, 8, 3,
0, 1, 6, 0, 3, 1, 1, 1, 6, 8, 9, 0, 2, 1, 2, 7, 0, 0, 7, 0, 7, 9, 0,
0, 7, 6, 5, 2, 2, 9, 6, 6, 1, 6, 7, 9, 9, 9, 3, 6, 6, 5, 8, 2, 3, 5,
8, 7, 1, 9, 4, 4, 4, 4, 5, 2, 1, 0, 3, 6, 2, 1, 6, 8, 8, 6, 2, 8, 5,
2, 0, 5, 0, 6, 8, 6])
-------------- next part --------------
A non-text attachment was scrubbed...
Name: miscfx.py
Type: application/octet-stream
Size: 12875 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20101123/30a5dfa4/attachment-0001.obj>
More information about the Pkg-ExpPsy-PyMVPA
mailing list