[pymvpa] Balancing with searchlight and statistical issues.

Jo Etzel jetzel at wustl.edu
Tue Mar 1 15:00:43 UTC 2016

Here's a response to the second part of your question:

On 2/29/2016 11:30 AM, Roberto Guidotti wrote:
>     Also, you say the dataset is unbalanced, but has 12 runs, each with
>     10 trials, half A and half B. That sounds balanced to me
> I classified in few subject the motor response with good accuracies, but
> now I would like to decode decision, since is a decision task, which is
> the main reason why my dataset is unbalanced. Stimuli are balanced,
> since the subject views half A and half B, but he has to respond if the
> stimulus is either A or B, thus I could have runs with unbalanced
> condition (e.g. 8 A vs 2 B, etc.).

I see; you're classifying decisions, not stimuli, and the people's 
decisions were unbalanced. (As far as the classifier is concerned, the 
balanced stimuli are totally irrelevant; it's the labels (decisions, 
here) that matter.)

Classifying with an imbalanced training set is not at all a good idea in 
most cases; you'll need to balance it so that you have equal numbers of 
each class. I'll try to get a demo up with more explanation, but the 
short version is that linear SVMs (and many other common MVPA 
algorithms) are exquisitely sensitive to imbalance: a training set with 
21 of one class and 20 of the other can make seriously skewed results.

While there are ways to adjust example weighting, etc, with fMRI 
datasets I generally recommend subsetting examples for balance instead. 
Since you have 12 runs, you might find that the balance is a bit closer 
if you do leave-two-runs-out (or even three or four) instead of 
leave-one-run-out cross-validation.

Say you have 21 of one class and 20 of the other in a training set. 
You'll then want to remove one of the larger class (at random), so that 
there are 20 examples of both classes. To make sure you didn't happen to 
remove a "weird" example (and so your results were totally dependent on 
which example was removed), the balancing process should be repeated 
several times (e.g. 5, 10, depending on how serious the imbalance is) 
and results averaged over those replications.

I don't know how to set it up in pymvpa, but when dealing with 
imbalanced datasets my usual practice is to look at how many examples 
are present for each person, and figure out a cross-validation scheme 
that will minimize the imbalance as much as possible. They I 
precalculate which examples will be omitted in each person for each 
replication (e.g., the first replication leave out the 3rd "A" in run 2, 
the second replication, omit the 5th). Ideally, I omit examples before 
classifying, so that all cross-validation folds will be fully balanced, 
then do the classification with that balanced dataset. (This parallels 
the idea of dataset-wise permutation testing - first balance the 
dataset, then do the cross-validation.)

hope this makes sense,

> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis

More information about the Pkg-ExpPsy-PyMVPA mailing list