[pymvpa] Unbalanced Datasets

Wed Apr 22 21:10:54 UTC 2015

I have tried weighting unbalanced datasets using another language or
package, but it's been so long ago that I unfortunately I can't find the
code for it).

>From what I recall, you would essentially inversely attribute weights to
each of the class labels according to the label sample size. In other
words, the class with the greater number of samples would be weighted
proportionally less (I believe I weighted it relative to 1), and the class
with fewer samples would be weighted proportionally more. From what I
remember though, this method did no better than simply
sub-sampling/oversampling the class with fewer samples. Obviously this may
depend on other parameters as well (such as overall class sizes), however.
(Altering weights also did quite poorly for extremely unbalanced
datasets.)

While sub-sampling the larger class may be unstable (as Jo mentioned), I've
gotten decent results by bootstrapping (sampling with replacement) samples
from both classes to n samples, where n is larger than the number of
samples from the largest class (often times I've made n to be quite large
for robust results). As long as your smallest class has enough samples, I
think this method could prove to be useful. At the very least this method
won't bias your classifier.

Alternatively, if you simply just want to maximize your training set to
include all samples, you could run a permutation test as well. While this
will bias your classifier and most likely inflate your accuracy rates, if
you have a null distribution with enough iterations, you can still run
significance testing.

Taku

On Wed, Apr 22, 2015 at 3:20 PM, J.A. Etzel <jetzel at wustl.edu> wrote:

> I wonder if the lack of responses is because people (myself included)
> don't use weighting for fMRI datasets, but rather balance through
> subsetting and experimental design ... anyone use (or ever tried) weighting
> unbalanced datasets?
>
> I've never tried analyzing a dataset as badly balanced (3 to 1) as your
> example; subsetting is certainly very unstable in this case. Perhaps you
> can reduce the imbalance by changing the cross-validation partitioning (eg
> leave 2 runs out instead of 1 or on the subjects)?
>
> Jo
>
>
>
> On 4/22/2015 12:43 PM, Bill Broderick wrote:
>
>> Hi all,
>>
>> I think my first question was broader than it needed to be, so hopefully
>> this is more to the point.
>>
>> I'm trying to run MVPA on a classification with unbalanced classes,
>> using a Linear SVM, and would like to weight the error signals to
>> correct for unbalanced-ness. With PyMVPA's Linear CSVMC
>> (http://www.pymvpa.org/generated/mvpa2.clfs.svm.LinearCSVMC.html), it
>> looks like there's a weight and weight_label parameter that would do
>> what I would like, but I cannot find any usage examples. Can someone
>> provide me with one?
>>
>> For example, if I have a dataset with three times as many examples in
>> class A as in class B, how would I set up the Linear CSVMC to weight the
>> error in class B as three times larger?
>>
>> Thanks,
>> William
>>
>>
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>
>>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20150422/87966956/attachment.html>