[pymvpa] Balancer - strange behaviour?

Mon Jan 4 16:07:50 UTC 2016

Hi!

I'm currently re-checking some code I wrote up a while ago when I stumbled upon something strange: my balancer does not seem to produce balanced groups for training and testing.

I have a toy dataset DS_noisy (you can find it here: https://bigmail.cbs.mpg.de/i/d218f77e0f8ce3cbd5c301484ece951a.hdf5)
DS_noisy.summary() gives:

Dataset: 11x3472 at float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: imghdr,mapper>
stats: mean=-7.70839e-07 std=0.999997 var=0.999994 min=-1.75513 max=3.98463

Counts of targets in each chunk:
  chunks\targets  0   1
                 --- ---
        0         1   0
        1         1   0
        2         1   0
        3         1   0
        4         1   0
        5         1   0
        6         1   0
        7         0   1
        8         0   1
        9         0   1
       10         0   1

Summary for targets across chunks
  targets  mean  std  min max #chunks
    0     0.636 0.481  0   1     7
    1     0.364 0.481  0   1     4

Summary for chunks across targets
  chunks mean std min max #targets
    0     0.5 0.5  0   1      1
    1     0.5 0.5  0   1      1
    2     0.5 0.5  0   1      1
    3     0.5 0.5  0   1      1
    4     0.5 0.5  0   1      1
    5     0.5 0.5  0   1      1
    6     0.5 0.5  0   1      1
    7     0.5 0.5  0   1      1
    8     0.5 0.5  0   1      1
    9     0.5 0.5  0   1      1
   10     0.5 0.5  0   1      1
Sequence statistics for 11 entries from set [0, 1]
Counter-balance table for orders up to 2:
Targets/Order O1    |  O2    |
      0:       6 1  |   5 2  |
      1:       0 3  |   0 2  |
Correlations: min=-0.57 max=0.61 mean=-0.1 sum(abs)=4.3 

As you can see, the number of participants per group is imbalanced. For a classification using an SVM with a searchlight I want to partition the data into balanced groups. For this, I use the following partitioner:

npart = ChainNode([
  NFoldPartitioner(len(DS_noisy.sa['targets'].unique),
                         attr='chunks'),
  Sifter([('partitions', 2),
                ('targets',
                 { 'uvalues': DS_noisy.sa['targets'].unique,
                   'balanced': True})
                ]),
  Balancer(attr='targets',count=1,limit='partitions',apply_selection=True)
  ], space='partitions')

Finally I look at the partitions to check if everything is fine:

for ds_ in npart.generate(DS_noisy):
  print('A new split:')
  print('Testing:')
  testing = DS_noisy[ds_.sa.partitions == 2]
  print list(zip(testing.sa.chunks, testing.sa.targets))
  print('Training:')
  training = DS_noisy[ds_.sa.partitions == 1]
  print list(zip(training.sa.chunks, training.sa.targets))

The output baffles me:

A new split:
Testing:
[(0, 0), (4, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (5, 0), (6, 0), (7, 1)]

A new split:
Testing:
[(0, 0), (5, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (4, 0), (6, 0), (7, 1)]

A new split:
Testing:
[(0, 0), (6, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (7, 1)]

...

Problem: The targets in the training data are far from being balanced (in fact, it only ever uses the (7, 1) combination, the remainder are always target 0 items). Also: shouldn't the two test items always be one target 0 and one target 1 item?
Am I doing something wrong or am I looking at it the wrong way?

Any help is appreciated!

Best,
Ulrike

-- 
Max Planck Institute for Human Cognitive and Brain Sciences 
Department of Neuropsychology (A219) 
Stephanstraße 1a 
04103 Leipzig 

Phone: +49 (0) 341 9940 2625 
Mail: kuhl at cbs.mpg.de 
Internet: http://www.cbs.mpg.de/staff/kuhl-12160