[pymvpa] Balancing with searchlight and statistical issues.

Roberto Guidotti robbenson18 at gmail.com
Wed Mar 2 10:27:14 UTC 2016

Thank you for the response,

> I don't see benefit in balancing the data within each run. You will just
> loose lots of valuable data. Depending on your approach, you might want to
> balance the data within whole training set.
> Since you are using less data for training, your accuracy will be biased
> towards the chance lavel. However it will be more stable, but I am not sure
> how much of a difference it can make on a subject level or a group level.
> If computational time is an issue and you don't want to use LORCV, you can
> just use 5 fold CV without testing on every combination of runs

Yes, that's true, I also thought about using a whole dataset balancing and
a different CV schema and my concern was about some independence issue of
the examples, but using betas I think I did not have any problem, I need
only to implement it on pymvpa!

> You want to have one mean accuracy map per subject and use those maps to
> create average group map.

I use to do that as exploratory analysis, look at group mean to understand
if there are some interesting areas or the group map is completely

As Jo said, what statistical method to use is last thing you should worry
> about now. Just look if your accuracy maps and group accuracy maps are
> reasonable. In general, you don't want to use t-test because virtually none
> of its assumptions are met. Also it has much lower power as Stelzer's
> method, which you don't have to use, but at least stay away from t-test.

Ok I will stay away from it!!! :)

> > The group mean histogram looks a gaussian peaked at 0.57 accuracy, I
> think it is reasonable.
> No, that's not good. It should look gaussian peaked at chance
> level/majority class level with a heavy tail. So peaked at 0.5 if you have
> balance dataset and more if it's unbalanced. Also, you should not test
> against a chance level but against proportion of a majority class and oyu
> should consider using different metric than accuracy (ROC, balanced
> accuracy, f score, something else).

I would like to dig into this, I have a balanced dataset, so do you think
is because of the analysis e.g. cross-validation schema (as I could think),
of the metric or because of the data?
I will explore it by using your precious suggestion, but it could be good
to discuss about it! ;)

> > OT: I always thought that SVM was not so sensible to unbalancing,
> because it uses only few samples as support vectors!
> With unbalanced data, optimal hyperplane will by chance misclassify more
> samples from the majority class than from minority class, therefore to
> minimize the error, it will be moved towards the minority group to include
> more of the majority samples. Therefore predictions of SVM are often biased
> towards the majority class and it's not uncommon to have SVM that is
> predicting only the majority class. Nuber of support vectors have nothing
> to do with it, also with high dimensional data or complicated decision
> boundary, it's possible that majority of your samples or even all are
> support vectors.

Definetely true, moreover, I often forget that we are have to deal with
curse of dimensionality and with high dimensional data where all (or the
majority of) samples are support vectors, thank you (and Jo) for the

> BW,
> Richard

Thank you,

> On Tue, Mar 1, 2016 at 6:43 PM, Roberto Guidotti <robbenson18 at gmail.com>
> wrote:
>> So, thank you Jo for your response and sorry because I didn't explained
>> clearly my strategy as well.
>> I balanced the dataset within runs, so if I have 8A and 2B, after
>> balancing I will have 2A and 2B chosen randomly (by pymvpa), since I could
>> have some high unbalanced runs (2A vs 2B) I decided to use a two run out
>> cross-validation, in order to have more samples in the testing set, thus a
>> less biased accuracy (with 2 samples per class, I can have 0, 0.5, 1
>> accuracies) , but I did not replicate the balancing process, because I
>> definetely increase the computational time (using either a two run out
>> cross-validation).
>> So do you suggest to use more balanced dataset replications and a leave
>> one run out cross-validation?
>> Do you think that using a data oriented balancing (e.g. remove beta
>> images that are not similar to the image average) or I am introducing some
>> other bias?
>> OT: I always thought that SVM was not so sensible to unbalancing, because
>> it uses only few samples as support vectors!
>> Thank you,
>> Roberto
>> On 1 March 2016 at 16:00, Jo Etzel <jetzel at wustl.edu> wrote:
>>> Here's a response to the second part of your question:
>>> On 2/29/2016 11:30 AM, Roberto Guidotti wrote:
>>>>     Also, you say the dataset is unbalanced, but has 12 runs, each with
>>>>     10 trials, half A and half B. That sounds balanced to me
>>>> I classified in few subject the motor response with good accuracies, but
>>>> now I would like to decode decision, since is a decision task, which is
>>>> the main reason why my dataset is unbalanced. Stimuli are balanced,
>>>> since the subject views half A and half B, but he has to respond if the
>>>> stimulus is either A or B, thus I could have runs with unbalanced
>>>> condition (e.g. 8 A vs 2 B, etc.).
>>> I see; you're classifying decisions, not stimuli, and the people's
>>> decisions were unbalanced. (As far as the classifier is concerned, the
>>> balanced stimuli are totally irrelevant; it's the labels (decisions, here)
>>> that matter.)
>>> Classifying with an imbalanced training set is not at all a good idea in
>>> most cases; you'll need to balance it so that you have equal numbers of
>>> each class. I'll try to get a demo up with more explanation, but the short
>>> version is that linear SVMs (and many other common MVPA algorithms) are
>>> exquisitely sensitive to imbalance: a training set with 21 of one class and
>>> 20 of the other can make seriously skewed results.
>>> While there are ways to adjust example weighting, etc, with fMRI
>>> datasets I generally recommend subsetting examples for balance instead.
>>> Since you have 12 runs, you might find that the balance is a bit closer if
>>> you do leave-two-runs-out (or even three or four) instead of
>>> leave-one-run-out cross-validation.
>>> Say you have 21 of one class and 20 of the other in a training set.
>>> You'll then want to remove one of the larger class (at random), so that
>>> there are 20 examples of both classes. To make sure you didn't happen to
>>> remove a "weird" example (and so your results were totally dependent on
>>> which example was removed), the balancing process should be repeated
>>> several times (e.g. 5, 10, depending on how serious the imbalance is) and
>>> results averaged over those replications.
>>> I don't know how to set it up in pymvpa, but when dealing with
>>> imbalanced datasets my usual practice is to look at how many examples are
>>> present for each person, and figure out a cross-validation scheme that will
>>> minimize the imbalance as much as possible. They I precalculate which
>>> examples will be omitted in each person for each replication (e.g., the
>>> first replication leave out the 3rd "A" in run 2, the second replication,
>>> omit the 5th). Ideally, I omit examples before classifying, so that all
>>> cross-validation folds will be fully balanced, then do the classification
>>> with that balanced dataset. (This parallels the idea of dataset-wise
>>> permutation testing - first balance the dataset, then do the
>>> cross-validation.)
>>> hope this makes sense,
>>> Jo
>>> ____________________________________________
>>>> Pkg-ExpPsy-PyMVPA mailing list
>>>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>>>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>> --
>>> Joset A. Etzel, Ph.D.
>>> Research Analyst
>>> Cognitive Control & Psychopathology Lab
>>> Washington University in St. Louis
>>> http://mvpa.blogspot.com/
>>> _______________________________________________
>>> Pkg-ExpPsy-PyMVPA mailing list
>>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20160302/92e729d7/attachment-0001.html>

More information about the Pkg-ExpPsy-PyMVPA mailing list