[pymvpa] Dataset with multidimensional feature vector per voxel

Tue Nov 17 15:18:20 UTC 2015

On Tue, 17 Nov 2015, Ulrike Kuhl wrote:

> Here you go:

> print DS_clean.summary()

> Dataset: 20x3375 at float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: mapper>
> stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1

> Counts of targets in each chunk:
>   chunks\targets  0   1
>                  --- ---
>         0         1   0
>         1         1   0
>         2         1   0
>         3         1   0
>         4         1   0
>         5         1   0
>         6         1   0
>         7         1   0
>         8         1   0
>         9         1   0
>        10         0   1
>        11         0   1
>        12         0   1
>        13         0   1
>        14         0   1
>        15         0   1
>        16         0   1
>        17         0   1
>        18         0   1
>        19         0   1

so the problem is that in each chunk you have only one sample and
overall you have only 20 samples to train on.  Whenever you NFold
partition it, you end up with 10 samples of one target and 9 of
another.   If there is a clear signal, error is minimized to correct
labeling.  If there is no signal, error is minimized to just always say
"class with majority of samples (10 vs 9)" which then always leads to
misclassification of the held out sample since it is of the opposite
class.

the fun was if you just ran it on real data -- most probably you would have got
some strong negative bias but possibly still some reasonable around chance
performances... and then would have scratched you head a lot.  So --
simulations rule! ;)

The simplest way to handle it: guarantee balanced number of samples from
both categories in training (and thus testing) splits.

There are two ways then to do it:

1. simplest  but more ad-hoc.  Group them all with chunks bringing two
samples from both classes together, so you end up with 10 chunks and
thus 10 splits if using NFold(1)

2. create a partitioner which would select all possible combinations
from the two classes, i.e. have 10*10=100 splits.

  Two ways to do it 

  a. with existing codebase smth like this should work

    npart = ChainNode([
        NFoldPartitioner(len(ds.sa['targets'].unique),
                         attr='chunks'),
        ## so it should select only those splits where we took 1 from
        ## each of the targets categories leaving things in balance
        Sifter([('partitions', 2),
                ('targets',
                 { 'uvalues': ds.sa['targets'].unique,
                   'balanced': True})
                ]),
    ], space='partitions')

 which will do in your case NFold(2) across chunks, thus select every
combination of two chunks, but then use only those (Sifter removes others)
which have balanced targets.

 b. WiP
  https://github.com/PyMVPA/PyMVPA/pull/386
  to simplify this so it would look like

   factpart = FactorialPartitioner(
       NFoldPartitioner(attr='chunks'),
       attr='targets'
   )

N.B. Matteo -- one more testcase to test! ;)

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik