[pymvpa] EEG dataset and chunks

Thu Apr 7 19:29:05 UTC 2011

>    This should run 5 evaluation, using 1/5 of the available data each time
>    to test the classifier. Correct?

correct in that it should generate for you 5 partitions, where in first
one you would obtain nsamples/5 first samples (and corresponding
"chunks" unique per each sample in your case)

>    Now, for this to work properly, it requires that targets are properly
>    randomly distributed in the dataset... 

well... theoretically speaking, if you  have lots of samples, you might
escape by doing classical leave-1-out cross-validation.  That would be
implemented by using NFoldPartitioner on your dataset (ie without
NGroupPartitioner).  But it would take a while to do such a
cross-validation -- might be not desired unless coded explicitly for it
(e.g. for SVMs either using CachedKernel to avoid recomputation of
kernels, or even more trickery...)

> for instance if the last 1/5 of
>    the samples only contain target 2, then it won't work...

yeap -- that is the catch ;)

you could use NFoldPartitioner(cvtype=2) which would combine all
possible combinations of 2 chunks with a consecutive Sifter (recently
introduced) to get only those partitions which carry labels from both
classes, but, once again, it would be A LOT to cross-validate (i.e.
roughly (nsamples/2)^2), so I guess not solution for you either

> What do you suggest to solve this problem?

If you have some certainty that samples are independent, then to get
reasonable generalization estimate, just assign np.arange(nsamples/2)
(assuming balanced initially classes) as chunks to samples per each
condition.  Then in each chunk, you would guarantee to have a pair of
conditions ;)  And then you are welcome to use NGroupPartitioner to
bring number of partitions to some more cost-effective number , e.g. 10.

> I have tried to use a ChainNode,
>    chaining the NGroupPartitioner and a Balancer but it didn't work,

if I see it right -- it should have worked, unless you had really
degenerate case, e.g. one of partitions contained samples only of 1
category.

>    apparently due to a bug in Balancer (see another mail on that one).

oops -- need to check emails then...

>    My main question though is: it seems weird to add chunks attribute like
>    this. Is it the correct way?

well... if you consider your samples independent from each other, then
yes -- it is reasonable to assign each sample into a separate chunk.

>    Btw, is there a way to pick at random 80% of the data (with equal
>    number of samples for each target) for training and the remaining 20%
>    for testing, and repeat this as many times as I want to obtain a
>    consistant result?

although I think we haven't tried it, but this should do:

CrossValidation(clf,
                Balancer(amount=0.8, limit=None, attr='targets', count=3), 
                splitter=Splitter('balanced_set', [True, False]))

should do cross-validation by taking 3 of those (raise 3 to the number you like).

what we do here -- we say to balance targets, take 80% and mark them True,
while other 20% False.  Then we proceed to the cross-validation.  That thing
uses an actual splitter which splits dataset into training/testing parts.
Usually such splitter is not specified and constructed by CrossValidation
assuming operation on partitions labeled as 0,1 (and possibly 2) usually
provided by Partitioners.  But now we want to split based on balanced_set --
and we can do that, and instruct it to take 80% True for training, and the rest
(False) for testing.

limit=None is there to say to not limit subsampling to any attribute (commonly
chunks), so in this case you don't even need to have chunks at all.

is that what you needed?

-- 
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic