[pymvpa] Class weighting in unbalanced EEG data

Thu Apr 28 01:44:19 UTC 2016

On Wed, 27 Apr 2016, Ben McCartney wrote:

>    Hello,

>    I am running an EEG experiment which involves presenting one of two images
>    on the screen and recording the participant's neural activation. Right now
>    I'm writing a script to take these recordings and attempt to decode from
>    the EEG data which image has been displayed. I have had success on similar
>    scripts making use of PyMVPA's PLR, however in these cases the classes are
>    balanced whereas in the experiment I'm trying to do now the ratio is
>    closer to 1:4. The initial approach I had in mind was to use asymmetrical
>    error weights to penalize misclassification of the less numerous class
>    more than the dominant class. I understand that there are various other
>    approaches to handling unbalanced classes, such as over/undersampling or
>    using a classifier more robust in this regard such as random forests, but
>    this seemed to me to be the most clean way to start.

>    I tried the class weight parameter in SKLearn as per this tutorial
>    http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html,
>    and this seemed to work well for the data here, but failed to make any
>    difference when I used my EEG data. Upon increasing the number of
>    dimensions in the sample data (changing the 2 in n_samples_1, 2), I
>    noticed the difference between the predictions of the classifier with
>    weights and the classifier without weights diminish. By the time I had
>    reached 7 features per training sample the predictions were the same. Does
>    this parameter just not work well with many dimensions? My EEG data has
>    around 200 trials (samples) and each sample has 180 datapoints in each of
>    20 electrode channels, so my training data is shaped 200 * 3600.

quick one:  GNB surprisingly often provides reasonable generalization
performance, is fast and by default uses laplacian_smoothing priors
scaling based on input data ratio of data points.   In general it
actually doesn't care about disbalance since solely relies on mean/std
stats per class, but with this default prior (thanks Bayes) it might
potentially lead to better overall performance (than with uniform
prior).

Quite often provides reasonable results even in cases of
super-heavy disbalances (I had 1:100 for binary classification).  So
might be worth at least giving it a shot.

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik