[pymvpa] Balancing with searchlight and statistical issues.
Jo Etzel
jetzel at wustl.edu
Tue Mar 1 15:00:43 UTC 2016
Here's a response to the second part of your question:
On 2/29/2016 11:30 AM, Roberto Guidotti wrote:
> Also, you say the dataset is unbalanced, but has 12 runs, each with
> 10 trials, half A and half B. That sounds balanced to me
>
> I classified in few subject the motor response with good accuracies, but
> now I would like to decode decision, since is a decision task, which is
> the main reason why my dataset is unbalanced. Stimuli are balanced,
> since the subject views half A and half B, but he has to respond if the
> stimulus is either A or B, thus I could have runs with unbalanced
> condition (e.g. 8 A vs 2 B, etc.).
I see; you're classifying decisions, not stimuli, and the people's
decisions were unbalanced. (As far as the classifier is concerned, the
balanced stimuli are totally irrelevant; it's the labels (decisions,
here) that matter.)
Classifying with an imbalanced training set is not at all a good idea in
most cases; you'll need to balance it so that you have equal numbers of
each class. I'll try to get a demo up with more explanation, but the
short version is that linear SVMs (and many other common MVPA
algorithms) are exquisitely sensitive to imbalance: a training set with
21 of one class and 20 of the other can make seriously skewed results.
While there are ways to adjust example weighting, etc, with fMRI
datasets I generally recommend subsetting examples for balance instead.
Since you have 12 runs, you might find that the balance is a bit closer
if you do leave-two-runs-out (or even three or four) instead of
leave-one-run-out cross-validation.
Say you have 21 of one class and 20 of the other in a training set.
You'll then want to remove one of the larger class (at random), so that
there are 20 examples of both classes. To make sure you didn't happen to
remove a "weird" example (and so your results were totally dependent on
which example was removed), the balancing process should be repeated
several times (e.g. 5, 10, depending on how serious the imbalance is)
and results averaged over those replications.
I don't know how to set it up in pymvpa, but when dealing with
imbalanced datasets my usual practice is to look at how many examples
are present for each person, and figure out a cross-validation scheme
that will minimize the imbalance as much as possible. They I
precalculate which examples will be omitted in each person for each
replication (e.g., the first replication leave out the 3rd "A" in run 2,
the second replication, omit the 5th). Ideally, I omit examples before
classifying, so that all cross-validation folds will be fully balanced,
then do the classification with that balanced dataset. (This parallels
the idea of dataset-wise permutation testing - first balance the
dataset, then do the cross-validation.)
hope this makes sense,
Jo
____________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
--
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/
More information about the Pkg-ExpPsy-PyMVPA
mailing list