[pymvpa] Balancing with searchlight and statistical issues.

Tue Mar 1 22:54:12 UTC 2016

> I balanced the dataset within runs, so if I have 8A and 2B, after
balancing I will have 2A and 2B chosen randomly (by pymvpa), since I could
have some high unbalanced runs (2A vs 2B)

I don't see benefit in balancing the data within each run. You will just
loose lots of valuable data. Depending on your approach, you might want to
balance the data within whole training set.

> I decided to use a two run out cross-validation, in order to have more
samples in the testing set, thus a less biased accuracy (with 2 samples per
class, I can have 0, 0.5, 1 accuracies)
Since you are using less data for training, your accuracy will be biased
towards the chance lavel. However it will be more stable, but I am not sure
how much of a difference it can make on a subject level or a group level.
If computational time is an issue and you don't want to use LORCV, you can
just use 5 fold CV without testing on every combination of runs

> I did an average map of the 66 cross-validation map I obtained for each
subject;
You want to have one mean accuracy map per subject and use those maps to
create average group map.

> to do a first exploratory analysis I did a simple t-test versus chance
level (I didn't do the Stelzer's method because of the computational time)
and I had almost all voxels significative (not corrected), because of the
L2ROCV, I think. So do you think I can do other more robust statistical
tests using these maps? Or I have to do the Stelzer's method?

As Jo said, what statistical method to use is last thing you should worry
about now. Just look if your accuracy maps and group accuracy maps are
reasonable. In general, you don't want to use t-test because virtually none
of its assumptions are met. Also it has much lower power as Stelzer's
method, which you don't have to use, but at least stay away from t-test.

> Or throw away the searchlight maps?
If you can focus on some a priori selected ROI, then you can totally do it,
you will have more power and save yourself lots of trouble.

> The group mean histogram looks a gaussian peaked at 0.57 accuracy, I
think it is reasonable.
No, that's not good. It should look gaussian peaked at chance
level/majority class level with a heavy tail. So peaked at 0.5 if you have
balance dataset and more if it's unbalanced. Also, you should not test
against a chance level but against proportion of a majority class and oyu
should consider using different metric than accuracy (ROC, balanced
accuracy, f score, something else).

> OT: I always thought that SVM was not so sensible to unbalancing, because
it uses only few samples as support vectors!
With unbalanced data, optimal hyperplane will by chance misclassify more
samples from the majority class than from minority class, therefore to
minimize the error, it will be moved towards the minority group to include
more of the majority samples. Therefore predictions of SVM are often biased
towards the majority class and it's not uncommon to have SVM that is
predicting only the majority class. Nuber of support vectors have nothing
to do with it, also with high dimensional data or complicated decision
boundary, it's possible that majority of your samples or even all are
support vectors.

BW,
Richard

On Tue, Mar 1, 2016 at 6:43 PM, Roberto Guidotti <robbenson18 at gmail.com>
wrote:

> So, thank you Jo for your response and sorry because I didn't explained
> clearly my strategy as well.
>
> I balanced the dataset within runs, so if I have 8A and 2B, after
> balancing I will have 2A and 2B chosen randomly (by pymvpa), since I could
> have some high unbalanced runs (2A vs 2B) I decided to use a two run out
> cross-validation, in order to have more samples in the testing set, thus a
> less biased accuracy (with 2 samples per class, I can have 0, 0.5, 1
> accuracies) , but I did not replicate the balancing process, because I
> definetely increase the computational time (using either a two run out
> cross-validation).
>
> So do you suggest to use more balanced dataset replications and a leave
> one run out cross-validation?
> Do you think that using a data oriented balancing (e.g. remove beta images
> that are not similar to the image average) or I am introducing some other
> bias?
>
> OT: I always thought that SVM was not so sensible to unbalancing, because
> it uses only few samples as support vectors!
>
> Thank you,
> Roberto
>
> On 1 March 2016 at 16:00, Jo Etzel <jetzel at wustl.edu> wrote:
>
>> Here's a response to the second part of your question:
>>
>> On 2/29/2016 11:30 AM, Roberto Guidotti wrote:
>>
>>>     Also, you say the dataset is unbalanced, but has 12 runs, each with
>>>     10 trials, half A and half B. That sounds balanced to me
>>>
>>> I classified in few subject the motor response with good accuracies, but
>>> now I would like to decode decision, since is a decision task, which is
>>> the main reason why my dataset is unbalanced. Stimuli are balanced,
>>> since the subject views half A and half B, but he has to respond if the
>>> stimulus is either A or B, thus I could have runs with unbalanced
>>> condition (e.g. 8 A vs 2 B, etc.).
>>>
>>
>> I see; you're classifying decisions, not stimuli, and the people's
>> decisions were unbalanced. (As far as the classifier is concerned, the
>> balanced stimuli are totally irrelevant; it's the labels (decisions, here)
>> that matter.)
>>
>> Classifying with an imbalanced training set is not at all a good idea in
>> most cases; you'll need to balance it so that you have equal numbers of
>> each class. I'll try to get a demo up with more explanation, but the short
>> version is that linear SVMs (and many other common MVPA algorithms) are
>> exquisitely sensitive to imbalance: a training set with 21 of one class and
>> 20 of the other can make seriously skewed results.
>>
>> While there are ways to adjust example weighting, etc, with fMRI datasets
>> I generally recommend subsetting examples for balance instead. Since you
>> have 12 runs, you might find that the balance is a bit closer if you do
>> leave-two-runs-out (or even three or four) instead of leave-one-run-out
>> cross-validation.
>>
>> Say you have 21 of one class and 20 of the other in a training set.
>> You'll then want to remove one of the larger class (at random), so that
>> there are 20 examples of both classes. To make sure you didn't happen to
>> remove a "weird" example (and so your results were totally dependent on
>> which example was removed), the balancing process should be repeated
>> several times (e.g. 5, 10, depending on how serious the imbalance is) and
>> results averaged over those replications.
>>
>> I don't know how to set it up in pymvpa, but when dealing with imbalanced
>> datasets my usual practice is to look at how many examples are present for
>> each person, and figure out a cross-validation scheme that will minimize
>> the imbalance as much as possible. They I precalculate which examples will
>> be omitted in each person for each replication (e.g., the first replication
>> leave out the 3rd "A" in run 2, the second replication, omit the 5th).
>> Ideally, I omit examples before classifying, so that all cross-validation
>> folds will be fully balanced, then do the classification with that balanced
>> dataset. (This parallels the idea of dataset-wise permutation testing -
>> first balance the dataset, then do the cross-validation.)
>>
>> hope this makes sense,
>> Jo
>>
>>
>>
>>
>>
>> ____________________________________________
>>
>>> Pkg-ExpPsy-PyMVPA mailing list
>>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>>
>>>
>> --
>> Joset A. Etzel, Ph.D.
>> Research Analyst
>> Cognitive Control & Psychopathology Lab
>> Washington University in St. Louis
>> http://mvpa.blogspot.com/
>>
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20160301/8205a7d7/attachment.html>