[pymvpa] Fwd: 10 fold stratified crossvalidation
J.A. Etzel
jetzel at artsci.wustl.edu
Tue Nov 23 22:58:32 UTC 2010
If I understand what you're trying to do (10-fold cross-validation
repeated 10 times) I would not try to do it in one big step but rather
as 10-fold cross-validation ten different times. In other words, set up
a script with a splitter something like asplitter =
NGroupSplitter(ngroups=10, nperlabel='equal') to do one repeat of
10-fold cross-validation.
Each time you run the script you should get a different answer, if the
randomization works properly, since the partitions (which samples to put
into each of the ten training and testing sets) and omitted samples (to
have nperlabel equal if the training data isn't equal to start with)
will vary. So the most simple solution could be to run the script ten
times, recording the values each time.
I generally prefer to set up the partitions, left-out samples (to have
nperlabel equal), and random seeds before starting the analysis, instead
of the simple strategy of running the script 10 times that I just
described. The advantage is that then the analysis is reproducible (for
example you could run the same analysis with two different types of
classifiers) and you can confirm that all the replications and
cross-validation partitionings are unique. Unfortunately, I've never run
such an analysis in pyMVPA; perhaps someone else has and can explain if
it's straightforward to precalculate the partitions and left-out samples.
good luck,
Jo
On 11/23/2010 3:13 PM, Jakob Scherer wrote:
> Hi all,
> I am working with an EEG dataset containing 674 trials distributed
> across 10 labels/classes.
>
> <Dataset / float32 674 x 63426 uniq: 674 chunks 10 labels labels_mapped>
>
> that i read in from eeglab. as in eeglab there is a file for each
> condition (i.e. label), i read them in one after the other and
> concatenate them on the fly.
> so the samples in the final dataset are in fact ordered by label (see
> [ref_1]), and not in the order they were recorded originally.
>
> I am then running a linear SVM on the dataset. In order to have a good
> estimate on the prediction power, the crossvalidation that i would
> like to apply is:
>
> "Ten times ten-fold stratified crossvalidation"
>
> i.e.
> 1. split up the dataset into ten splits (which are stratified, i.e.
> nperlabel='equal').
> 2. train on nine of them an test on the remaining.
> 3. do this for every subsplit (i.e. leave-one-out).
> 4. repeat the whole thing 10 times (i.e. the initial splitting)
>
> my first guess was
>
> asplitter=NFoldSplitter(cvtype=N.min(dataset.samplesperlabel.values()), \
> nperlabel='equal', \
> nrunspersplit=1, \
> count=10, \
> strategy='random')
>
> now one issue is the huge number of chunks (similar to the problem
> Thorsten Kranz mentioned a few days ago "RFE memory issue").
>
> i then "discovered" the dataset.coarsenChunks() function. I could use
> this function to realize step 1 above. however, then there should be
> some kind of shuffling for the chunks_new [ref_2]. with this shuffling
> of chunks [ref_3], i would then run:
>
> ########################
> ACC_collector = {}
>
> for repetition_nr in str(N.r_[:10]):
>
> dataset.coarsenChunks(10, shuffle=True)
>
> aclassifier = LinearCSVMC(C=C)
> asplitter=NFoldSplitter(cvtype=1, nperlabel='equal',
> nrunspersplit=1, count=None)
>
> cv = CrossValidatedTransferError(TransferError(aclassifier), \
> asplitter, \
> harvest_attribs=[\
>
> 'transerror.clf.getSensitivityAnalyzer(force_training=False)()'],\
> enable_states=[\
> 'confusion',\
> 'harvested',\
> 'transerrors',\
> ])
>
> cv(dataset)
>
> ACC_collector[repetition_nr] = cv.confusion.stats['ACC']
>
> ACC_estimate = N.mean(ACC_collector.values())
> ########################
>
>
> To make a long story short:
> - for the shuffling, i attached a proposition how to change
> mvpa/datasets/miscfx.py (lines marked by "# proposed shuffle")
> - do you think this is the way to go, or do you see any easier one?
>
>
> Many thanks in advance,
> jakob
>
>
More information about the Pkg-ExpPsy-PyMVPA
mailing list