[pymvpa] Dataset with multidimensional feature vector per voxel

Ulrike Kuhl kuhl at cbs.mpg.de
Thu Nov 19 09:40:31 UTC 2015

Dear Yaroslav, dear all,

hooray for simulations! :-)

I was not aware of the profound effect on classification performance if the groups are not perfectly balanced.
My tests using the 'npart' partitioner on clean and noisy test data showed the expected result (accuracy of 0.5 for non-signal voxels, 1 for the others). Cool!

Still, two questions remain:
a) Can I assess how the individual partitions look like (i.e. which subject is additionally removed to make the groups balanced)?

b) How do I deal with groups that have a larger imbalance? I've tried with my dummy data already: If I feed a dataset with imbalanced group sizes into the classification with 'npart' partitioner the result is random classification at all voxels. 
In my original data I have more participants in the second group than in the first, so I would need to restrict the size of the second group given the size of the first for each partition. My idea was to take everyone from group 1 and randomly pick the same number of participants from group 2 - what's the best way to realize this? 

Thanks a lot!

----- Original Message -----
From: "Yaroslav Halchenko" <debian at onerussian.com>
To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa at lists.alioth.debian.org>
Sent: Tuesday, 17 November, 2015 16:18:20
Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel

On Tue, 17 Nov 2015, Ulrike Kuhl wrote:

> Here you go:

> print DS_clean.summary()

> Dataset: 20x3375 at float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: mapper>
> stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1

> Counts of targets in each chunk:
>   chunks\targets  0   1
>                  --- ---
>         0         1   0
>         1         1   0
>         2         1   0
>         3         1   0
>         4         1   0
>         5         1   0
>         6         1   0
>         7         1   0
>         8         1   0
>         9         1   0
>        10         0   1
>        11         0   1
>        12         0   1
>        13         0   1
>        14         0   1
>        15         0   1
>        16         0   1
>        17         0   1
>        18         0   1
>        19         0   1

so the problem is that in each chunk you have only one sample and
overall you have only 20 samples to train on.  Whenever you NFold
partition it, you end up with 10 samples of one target and 9 of
another.   If there is a clear signal, error is minimized to correct
labeling.  If there is no signal, error is minimized to just always say
"class with majority of samples (10 vs 9)" which then always leads to
misclassification of the held out sample since it is of the opposite

the fun was if you just ran it on real data -- most probably you would have got
some strong negative bias but possibly still some reasonable around chance
performances... and then would have scratched you head a lot.  So --
simulations rule! ;)

The simplest way to handle it: guarantee balanced number of samples from
both categories in training (and thus testing) splits.

There are two ways then to do it:

1. simplest  but more ad-hoc.  Group them all with chunks bringing two
samples from both classes together, so you end up with 10 chunks and
thus 10 splits if using NFold(1)

2. create a partitioner which would select all possible combinations
from the two classes, i.e. have 10*10=100 splits.

  Two ways to do it 

  a. with existing codebase smth like this should work

    npart = ChainNode([
        ## so it should select only those splits where we took 1 from
        ## each of the targets categories leaving things in balance
        Sifter([('partitions', 2),
                 { 'uvalues': ds.sa['targets'].unique,
                   'balanced': True})
    ], space='partitions')

 which will do in your case NFold(2) across chunks, thus select every
combination of two chunks, but then use only those (Sifter removes others)
which have balanced targets.

 b. WiP
  to simplify this so it would look like

   factpart = FactorialPartitioner(

N.B. Matteo -- one more testcase to test! ;)

Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        

Pkg-ExpPsy-PyMVPA mailing list
Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
Max Planck Institute for Human Cognitive and Brain Sciences 
Department of Neuropsychology (A219) 
Stephanstraße 1a 
04103 Leipzig 

Phone: +49 (0) 341 9940 2625 
Mail: kuhl at cbs.mpg.de 
Internet: http://www.cbs.mpg.de/staff/kuhl-12160

More information about the Pkg-ExpPsy-PyMVPA mailing list