[pymvpa] Dataset with multidimensional feature vector per voxel

Thu Nov 19 14:43:22 UTC 2015

I ran into a similar issue with unbalanced classification and wanted
to look at the individual partitions as well. I couldn't figure
out how to do so just in PyMVPA, so I ended up using a separate Python module,
UnbalanceDataset: https://github.com/fmfn/UnbalancedDataset. With
that, I sub-sampled the more common group to balance the two groups,
which created a new dataset. I was then able to investigate what was
going on in that dataset and what each of the partitions look like as
if it were a regular dataset.

Is that what you're looking for?

On Thu, Nov 19, 2015 at 9:27 AM, Ulrike Kuhl <kuhl at cbs.mpg.de> wrote:
> Dear Yaroslav, dear all,
>
> I might have solved the balancing problem using the pyMVPA's 'Balancer' (duh!).
> I extended the code of the partitioner like this:
>
> npart = ChainNode([
>   NFoldPartitioner(len(DS_noisy.sa['targets'].unique),
>                          attr='chunks'),
>   ## so it should select only those splits where we took 1 from
>   ## each of the targets categories leaving things in balance
>   Sifter([('partitions', 2),
>                 ('targets',
>                  { 'uvalues': DS_noisy.sa['targets'].unique,
>                    'balanced': True})
>                 ]),
>   Balancer(attr='targets',count=1,limit='partitions',apply_selection=True)
>   ], space='partitions')
>
>
> The classification result on noisy data looks perfect even on imbalanced group sizes - is it correct to do it like this?
>
> Also, I would still like know how I can see how the individual partitions look like.
>
> Thanks!
> Ulrike
>
>
> ----- Original Message -----
> From: "kuhl" <kuhl at cbs.mpg.de>
> To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa at lists.alioth.debian.org>
> Sent: Thursday, 19 November, 2015 10:40:31
> Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel
>
> Dear Yaroslav, dear all,
>
> hooray for simulations! :-)
>
> I was not aware of the profound effect on classification performance if the groups are not perfectly balanced.
> My tests using the 'npart' partitioner on clean and noisy test data showed the expected result (accuracy of 0.5 for non-signal voxels, 1 for the others). Cool!
>
> Still, two questions remain:
> a) Can I assess how the individual partitions look like (i.e. which subject is additionally removed to make the groups balanced)?
>
> b) How do I deal with groups that have a larger imbalance? I've tried with my dummy data already: If I feed a dataset with imbalanced group sizes into the classification with 'npart' partitioner the result is random classification at all voxels.
> In my original data I have more participants in the second group than in the first, so I would need to restrict the size of the second group given the size of the first for each partition. My idea was to take everyone from group 1 and randomly pick the same number of participants from group 2 - what's the best way to realize this?
>
> Thanks a lot!
> Ulrike
>
> ----- Original Message -----
> From: "Yaroslav Halchenko" <debian at onerussian.com>
> To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa at lists.alioth.debian.org>
> Sent: Tuesday, 17 November, 2015 16:18:20
> Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel
>
> On Tue, 17 Nov 2015, Ulrike Kuhl wrote:
>
>> Here you go:
>
>> print DS_clean.summary()
>
>> Dataset: 20x3375 at float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: mapper>
>> stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1
>
>> Counts of targets in each chunk:
>>   chunks\targets  0   1
>>                  --- ---
>>         0         1   0
>>         1         1   0
>>         2         1   0
>>         3         1   0
>>         4         1   0
>>         5         1   0
>>         6         1   0
>>         7         1   0
>>         8         1   0
>>         9         1   0
>>        10         0   1
>>        11         0   1
>>        12         0   1
>>        13         0   1
>>        14         0   1
>>        15         0   1
>>        16         0   1
>>        17         0   1
>>        18         0   1
>>        19         0   1
>
>
> so the problem is that in each chunk you have only one sample and
> overall you have only 20 samples to train on.  Whenever you NFold
> partition it, you end up with 10 samples of one target and 9 of
> another.   If there is a clear signal, error is minimized to correct
> labeling.  If there is no signal, error is minimized to just always say
> "class with majority of samples (10 vs 9)" which then always leads to
> misclassification of the held out sample since it is of the opposite
> class.
>
> the fun was if you just ran it on real data -- most probably you would have got
> some strong negative bias but possibly still some reasonable around chance
> performances... and then would have scratched you head a lot.  So --
> simulations rule! ;)
>
> The simplest way to handle it: guarantee balanced number of samples from
> both categories in training (and thus testing) splits.
>
> There are two ways then to do it:
>
> 1. simplest  but more ad-hoc.  Group them all with chunks bringing two
> samples from both classes together, so you end up with 10 chunks and
> thus 10 splits if using NFold(1)
>
> 2. create a partitioner which would select all possible combinations
> from the two classes, i.e. have 10*10=100 splits.
>
>   Two ways to do it
>
>   a. with existing codebase smth like this should work
>
>     npart = ChainNode([
>         NFoldPartitioner(len(ds.sa['targets'].unique),
>                          attr='chunks'),
>         ## so it should select only those splits where we took 1 from
>         ## each of the targets categories leaving things in balance
>         Sifter([('partitions', 2),
>                 ('targets',
>                  { 'uvalues': ds.sa['targets'].unique,
>                    'balanced': True})
>                 ]),
>     ], space='partitions')
>
>  which will do in your case NFold(2) across chunks, thus select every
> combination of two chunks, but then use only those (Sifter removes others)
> which have balanced targets.
>
>  b. WiP
>   https://github.com/PyMVPA/PyMVPA/pull/386
>   to simplify this so it would look like
>
>    factpart = FactorialPartitioner(
>        NFoldPartitioner(attr='chunks'),
>        attr='targets'
>    )
>
> N.B. Matteo -- one more testcase to test! ;)
>
> --
> Yaroslav O. Halchenko
> Center for Open Neuroscience     http://centerforopenneuroscience.org
> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
> Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
> WWW:   http://www.linkedin.com/in/yarik
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
> --
> Max Planck Institute for Human Cognitive and Brain Sciences
> Department of Neuropsychology (A219)
> Stephanstraße 1a
> 04103 Leipzig
>
> Phone: +49 (0) 341 9940 2625
> Mail: kuhl at cbs.mpg.de
> Internet: http://www.cbs.mpg.de/staff/kuhl-12160
> --
> Max Planck Institute for Human Cognitive and Brain Sciences
> Department of Neuropsychology (A219)
> Stephanstraße 1a
> 04103 Leipzig
>
> Phone: +49 (0) 341 9940 2625
> Mail: kuhl at cbs.mpg.de
> Internet: http://www.cbs.mpg.de/staff/kuhl-12160
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa