[pymvpa] Balancer - strange behaviour?
Ulrike Kuhl
kuhl at cbs.mpg.de
Mon Jan 4 16:07:50 UTC 2016
Hi!
I'm currently re-checking some code I wrote up a while ago when I stumbled upon something strange: my balancer does not seem to produce balanced groups for training and testing.
I have a toy dataset DS_noisy (you can find it here: https://bigmail.cbs.mpg.de/i/d218f77e0f8ce3cbd5c301484ece951a.hdf5)
DS_noisy.summary() gives:
Dataset: 11x3472 at float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: imghdr,mapper>
stats: mean=-7.70839e-07 std=0.999997 var=0.999994 min=-1.75513 max=3.98463
Counts of targets in each chunk:
chunks\targets 0 1
--- ---
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 0 1
8 0 1
9 0 1
10 0 1
Summary for targets across chunks
targets mean std min max #chunks
0 0.636 0.481 0 1 7
1 0.364 0.481 0 1 4
Summary for chunks across targets
chunks mean std min max #targets
0 0.5 0.5 0 1 1
1 0.5 0.5 0 1 1
2 0.5 0.5 0 1 1
3 0.5 0.5 0 1 1
4 0.5 0.5 0 1 1
5 0.5 0.5 0 1 1
6 0.5 0.5 0 1 1
7 0.5 0.5 0 1 1
8 0.5 0.5 0 1 1
9 0.5 0.5 0 1 1
10 0.5 0.5 0 1 1
Sequence statistics for 11 entries from set [0, 1]
Counter-balance table for orders up to 2:
Targets/Order O1 | O2 |
0: 6 1 | 5 2 |
1: 0 3 | 0 2 |
Correlations: min=-0.57 max=0.61 mean=-0.1 sum(abs)=4.3
As you can see, the number of participants per group is imbalanced. For a classification using an SVM with a searchlight I want to partition the data into balanced groups. For this, I use the following partitioner:
npart = ChainNode([
NFoldPartitioner(len(DS_noisy.sa['targets'].unique),
attr='chunks'),
Sifter([('partitions', 2),
('targets',
{ 'uvalues': DS_noisy.sa['targets'].unique,
'balanced': True})
]),
Balancer(attr='targets',count=1,limit='partitions',apply_selection=True)
], space='partitions')
Finally I look at the partitions to check if everything is fine:
for ds_ in npart.generate(DS_noisy):
print('A new split:')
print('Testing:')
testing = DS_noisy[ds_.sa.partitions == 2]
print list(zip(testing.sa.chunks, testing.sa.targets))
print('Training:')
training = DS_noisy[ds_.sa.partitions == 1]
print list(zip(training.sa.chunks, training.sa.targets))
The output baffles me:
A new split:
Testing:
[(0, 0), (4, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (5, 0), (6, 0), (7, 1)]
A new split:
Testing:
[(0, 0), (5, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (4, 0), (6, 0), (7, 1)]
A new split:
Testing:
[(0, 0), (6, 0)]
Training:
[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (7, 1)]
...
Problem: The targets in the training data are far from being balanced (in fact, it only ever uses the (7, 1) combination, the remainder are always target 0 items). Also: shouldn't the two test items always be one target 0 and one target 1 item?
Am I doing something wrong or am I looking at it the wrong way?
Any help is appreciated!
Best,
Ulrike
--
Max Planck Institute for Human Cognitive and Brain Sciences
Department of Neuropsychology (A219)
Stephanstraße 1a
04103 Leipzig
Phone: +49 (0) 341 9940 2625
Mail: kuhl at cbs.mpg.de
Internet: http://www.cbs.mpg.de/staff/kuhl-12160
More information about the Pkg-ExpPsy-PyMVPA
mailing list