[pymvpa] unbalanced datasets

J.A. Etzel jetzel at artsci.wustl.edu
Fri Aug 10 19:50:55 UTC 2012


I'd phrase that as "breaking the run structure is **SOMETIMES** ok for 
classification" ...

You should definitely check the partitioning schemes for order effects 
and be as conservative as possible. For example, you could ensure that 
samples collected 2 TR apart are not split between the training and 
testing sets, etc. If you have signs that your results are very 
sensitive to partitioning scheme (e.g. vastly different accuracies when 
shifting the partitioning by one sample) you'll need to look closer at 
dependencies. Basically, use extreme caution.

I'm not current enough on pymvpa to give any advice there ... I do 
strongly suggest pre-planning the partitioning schemes (i.e. which 
examples in which sets on which folds and replications) prior to 
starting to ensure balance and to enable sensitivity testing.

Jo



On 8/10/2012 1:53 PM, Edmund Chong wrote:
> Thanks Jo!
>
> For me, I don't have that many runs, so partitioning on groups of runs
> is not really a good option. So I'd rather try doing "leave-n-samples
> out" instead of "leave-n-runs out" -- I looked at your paper and indeed
> it seems that breaking the run structure is ok for classification
> However, do you know of any way to do that with pre-written pymvpa
> functions, or do I have to manually partition the runs myself?
>
> Thanks!
> -Edmund
>
> On Wed, Aug 8, 2012 at 3:37 PM, J.A. Etzel <jetzel at artsci.wustl.edu
> <mailto:jetzel at artsci.wustl.edu>> wrote:
>
>     What you describe is one option. I talked about those types of
>     schemes and when they can be ok (in my opinion!) in
>     http://dx.doi.org/10.1016/j.__neuroimage.2010.08.050
>     <http://dx.doi.org/10.1016/j.neuroimage.2010.08.050>
>
>     As general advice, it seems best to try to partition so that the
>     number of examples of each case in each cross-validation fold is
>     roughly equal. Sometimes that's just plain not possible. For
>     example, I have a dataset with a large number of runs, but only
>     trials the person gets correct are analyzed, so the number of
>     examples in some runs for some people varies drastically. What we
>     did in this case was to partition on groups of runs, so one fold is
>     to leave runs 1,2,3, and 4 out. This scheme equalized the number of
>     examples somewhat (though I still subsetted examples to make them
>     exactly equal), and seemed to help the amount of variation.
>
>     Jo
>
>
>
>     On 8/7/2012 10:52 AM, Edmund Chong wrote:
>
>         Hi all,
>
>         I recently asked a question on dealing with unbalanced datasets and
>         here's a follow-up question.
>         So let's say I have empty runs, or runs where there are zero
>         samples for
>         one of the conditions. This leads to problems if that run
>         happens to be
>         the test run on a leave-one-run-out cross-validation procedure.
>
>         My workaround for that was this: if I had one of such runs with
>         empty
>         conditions, then I would set NFoldPartitioner(cvtype=2),
>         together with
>         Balancer() so that any combination of two runs would have at
>         least one
>         sample per condition. But if I had two of such runs with empty
>         conditions, then I would set cvtype=3, and so on. However this
>         means I
>         have less data for the training set on each classification fold.
>
>         Is there any other possible solution for this? In fact, is it
>         possible
>         to do leave-n-samples-out classification: So on each fold I randomly
>         select n samples per condition to test on, and use the remaining
>         samples
>         (after balancing) for training, disregarding the chunks structure?
>
>         Thanks!
>         -Edmund
>
>
>         _________________________________________________
>         Pkg-ExpPsy-PyMVPA mailing list
>         Pkg-ExpPsy-PyMVPA at lists.__alioth.debian.org
>         <mailto:Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org>
>         http://lists.alioth.debian.__org/cgi-bin/mailman/listinfo/__pkg-exppsy-pymvpa
>         <http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>
>
>
>     --
>     Joset A. Etzel, Ph.D.
>     Research Analyst
>     Cognitive Control & Psychopathology Lab
>     Washington University in St. Louis
>     http://mvpa.blogspot.com/
>
>     _________________________________________________
>     Pkg-ExpPsy-PyMVPA mailing list
>     Pkg-ExpPsy-PyMVPA at lists.__alioth.debian.org
>     <mailto:Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org>
>     http://lists.alioth.debian.__org/cgi-bin/mailman/listinfo/__pkg-exppsy-pymvpa
>     <http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>
>
>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>

-- 
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/



More information about the Pkg-ExpPsy-PyMVPA mailing list