[pymvpa] unbalanced datasets

Edmund Chong edmund.w.chong at gmail.com
Fri Aug 10 18:53:27 UTC 2012


Thanks Jo!

For me, I don't have that many runs, so partitioning on groups of runs is
not really a good option. So I'd rather try doing "leave-n-samples out"
instead of "leave-n-runs out" -- I looked at your paper and indeed it seems
that breaking the run structure is ok for classification However, do you
know of any way to do that with pre-written pymvpa functions, or do I have
to manually partition the runs myself?

Thanks!
-Edmund

On Wed, Aug 8, 2012 at 3:37 PM, J.A. Etzel <jetzel at artsci.wustl.edu> wrote:

> What you describe is one option. I talked about those types of schemes and
> when they can be ok (in my opinion!) in http://dx.doi.org/10.1016/j.**
> neuroimage.2010.08.050<http://dx.doi.org/10.1016/j.neuroimage.2010.08.050>
>
> As general advice, it seems best to try to partition so that the number of
> examples of each case in each cross-validation fold is roughly equal.
> Sometimes that's just plain not possible. For example, I have a dataset
> with a large number of runs, but only trials the person gets correct are
> analyzed, so the number of examples in some runs for some people varies
> drastically. What we did in this case was to partition on groups of runs,
> so one fold is to leave runs 1,2,3, and 4 out. This scheme equalized the
> number of examples somewhat (though I still subsetted examples to make them
> exactly equal), and seemed to help the amount of variation.
>
> Jo
>
>
>
> On 8/7/2012 10:52 AM, Edmund Chong wrote:
>
>> Hi all,
>>
>> I recently asked a question on dealing with unbalanced datasets and
>> here's a follow-up question.
>> So let's say I have empty runs, or runs where there are zero samples for
>> one of the conditions. This leads to problems if that run happens to be
>> the test run on a leave-one-run-out cross-validation procedure.
>>
>> My workaround for that was this: if I had one of such runs with empty
>> conditions, then I would set NFoldPartitioner(cvtype=2), together with
>> Balancer() so that any combination of two runs would have at least one
>> sample per condition. But if I had two of such runs with empty
>> conditions, then I would set cvtype=3, and so on. However this means I
>> have less data for the training set on each classification fold.
>>
>> Is there any other possible solution for this? In fact, is it possible
>> to do leave-n-samples-out classification: So on each fold I randomly
>> select n samples per condition to test on, and use the remaining samples
>> (after balancing) for training, disregarding the chunks structure?
>>
>> Thanks!
>> -Edmund
>>
>>
>> ______________________________**_________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.**alioth.debian.org<Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org>
>> http://lists.alioth.debian.**org/cgi-bin/mailman/listinfo/**
>> pkg-exppsy-pymvpa<http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>
>>
>>
> --
> Joset A. Etzel, Ph.D.
> Research Analyst
> Cognitive Control & Psychopathology Lab
> Washington University in St. Louis
> http://mvpa.blogspot.com/
>
> ______________________________**_________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.**alioth.debian.org<Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org>
> http://lists.alioth.debian.**org/cgi-bin/mailman/listinfo/**
> pkg-exppsy-pymvpa<http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20120810/c90b6938/attachment.html>


More information about the Pkg-ExpPsy-PyMVPA mailing list