[pymvpa] permutation test for unbalanced data

Tue Oct 17 20:59:43 UTC 2017

On Wed, 11 Oct 2017, Anna Manelis wrote:

> Dear Experts,

> I have an unbalanced dataset (group1=16 subjects ('better'), group2=12
> subjects ('worse'). 

is there any groupping of subjects or each has a dedicated "chunk"?

> I would like to perform Monte-Carlo testing of the SMLR
> between-subject classification analysis (performed on cope images).  I used
> the permutation_test.py script as an example and tried to adjust it for my
> unbalanced dataset.

> Any feedback on whether the script below makes sense is greatly appreciated.

well -- all the disbalanced cases are "tricky" to say the least ;)

I think, that for the peace of mind, you should also try  your pipeline on the
totally random data (ds.samples = np.random.normal(size=ds.shape))  to make
sure that "garbage in, garbage out", and nothing is "significant" (well, should
only be in "p" portion of the cases ;))

> lm = 0.2 # The value was determined through the nested cross-validation
> procedure

if it was nested, you were most likely to end up with different lm per
each split.  or in other words: how was it done and among which lm did
you check?  Feels like some double dipping might be going on

> clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm)

> ### how often do we want to shuffle the data
> repeater = Repeater(count=200)

> ### permute the training part of a dataset exactly ONCE
> permutator = AttributePermutator('targets', limit={'partitions': 1},
> count=1)

> ### Make sure that each group ("better" and "worse") has the same number of
> samples
> par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
>                   Sifter([('partitions', 2),
>                           ('targets', dict(uvalues=['better', 'worse'],
>                balanced=True))])

so, here you are saying to select the partitioning where you have
balanced # of samples in the testing set... so training set will remain
dis-balanced.  That would lead SMLR to tend to classify as the "better".
How does your cv.ca.stats look like -- is my fear fulfilled?

here is an example:

In [51]: import mvpa2.suite as mv

In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(), errorfx=mv.mean_mismatch_error, enable_ca=['stats'])

In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2, nfeatures=4, nonbogus_features=[0,1],snr=0)

In [54]: res = cv(ds); print cv.ca.stats  # on random data -- near chance result
----------.
predictions\targets  L0   L1
            `------  ---  --- P' N'  FP  FN  PPV  NPV  TPR  SPC  FDR  MCC  F1  AUC
         L0           5    6  11  9  6   5  0.45 0.44  0.5  0.4 0.55 -0.1 0.48 0.3
         L1           5    4   9 11  5   6  0.44 0.45  0.4  0.5 0.56 -0.1 0.42 0.3
Per target:          ---  ---
         P           10   10
         N           10   10
         TP           5    4
         TN           4    5
Summary \ Means:     ---  --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55 -0.1 0.45 0.3
       CHI^2         0.4 p=0.94
        ACC         0.45
        ACC%         45
     # of sets        5   ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32

In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)]

In [56]: print ds_.summary()
Dataset: 15x4 at float64, <sa: chunks,targets>, <fa: nonbogus_targets>, <a: bogus_features,nonbogus_features>
stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959 max=0.737558

Counts of targets in each chunk:
  chunks\targets  L0  L1
                 --- ---
        0         1   2
        1         1   2
        2         1   2
        3         1   2
        4         1   2

Summary for targets across chunks
  targets mean std min max #chunks
    L0      1   0   1   1     5
    L1      2   0   2   2     5

Summary for chunks across targets
  chunks mean std min max #targets
    0     1.5 0.5  1   2      2
    1     1.5 0.5  1   2      2
    2     1.5 0.5  1   2      2
    3     1.5 0.5  1   2      2
    4     1.5 0.5  1   2      2
Sequence statistics for 15 entries from set ['L0', 'L1']
Counter-balance table for orders up to 2:
Targets/Order O1    |  O2    |
     L0:       4 1  |   3 2  |
     L1:       0 9  |   0 8  |
Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8

In [57]: res = cv(ds_); print cv.ca.stats  # on random and disbalanced data
----------.
predictions\targets   L0   L1
            `------   ---  ---  P'  N'  FP  FN  PPV  NPV  TPR  SPC  FDR  MCC   F1  AUC
         L0           1     3   4   11  3   4  0.25 0.64  0.2  0.7 0.75 -0.11 0.22 0.6
         L1           4     7   11  4   4   3  0.64 0.25  0.7  0.2 0.36 -0.11 0.67 0.6
Per target:           ---  ---
         P            5    10
         N            10    5
         TP           1     7
         TN           7     1
Summary \ Means:      ---  --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56 -0.11 0.44 0.6
       CHI^2         3.4  p=0.33
        ACC          0.53
        ACC%        53.33
     # of sets        5    ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08

# although overall accuracy might be low, you can see that classifier tends to misclassify into L1

and overall mean accuracy is a "suboptimal" measure in this case.

What we could have estimated was the mean of the accuracies per each
class -- that one behaves better in case of the disbalanced datasets

So in this case, with mean_mismatch_error (mean_match_accuracy reported
in the ca.stats) we are getting

In [63]: (1+7)/(5.+10)
Out[63]: 0.5333333333333333

whenever we should use for such cases (and may be overall, while also
taking care about chunks, since splits could be disbalanced differently,
in this example I just assume a single "measurement"):

In [62]: np.mean([1/5., 7/10.])
Out[62]: 0.44999999999999996

although numbers here are  not really demonstrative, let's consider more
critical case, where all the samples would have gone into the "majority"
class:

In [70]: cm = mv.ConfusionMatrix()

In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15)

In [72]: print cm
----------.
predictions\targets   L1   L2
            `------   ---  ---  P'  N'  FP  FN  PPV  NPV TPR SPC  FDR MCC  F1
         L1           0     0   0   15  0   5   nan 0.67  0   1   nan  0   0
         L2           5    10   15  0   5   0  0.67  nan  1   0  0.33  0  0.8
Per target:           ---  ---
         P            5    10
         N            10    5
         TP           0    10
         TN           10    0
Summary \ Means:      ---  --- 7.5 7.5 2.5 2.5  nan  nan 0.5 0.5  nan  0  0.4
       CHI^2          15  p=0.0018
        ACC          0.67
        ACC%        66.67
     # of sets        1

whohoo -- 66% correct!
but if we compute the mean of per-class accuracies:

In [73]: np.mean([0/5., 10/10.])
Out[73]: 0.5

1.  Unfortunately "lazy" us didn't implement such a dedicated error
    function yet, BUT  you have that  value as a part of the 

    In [84]: print cm.stats['mean(TPR)']
    0.5

    which is the mean of TPRs per each label, which in turn is what
    we are interested here

2.  I am not quite certain if your above approach (though see below
comment about postproc)  is completely without ground.  Since you assure
that the test set always has one sample of each class, training
set is always consistently disbalanced, and your null distribution
should account for all those "tend to classify as the majority class",
so I expect  p value for the true classification performance being
"legit".

how does your chance distribution look like? (distr_est.ca.dist_samples)

meanwhile I will prep the "mean_tpr" function (which should match
mean_match_accuracy in balanced case) and "mean_fnr" (which should match
mean_mismatch_error in balanced case) (if I didn't mix anything
up) to use in such cases.

> ### I believe this will apply permutator on each fold created by par.
> par_perm = ChainNode([par, permutator], space=par.get_space())

> # CV with null-distribution estimation that permutes the training data for
> # each fold independently
> null_cv = CrossValidation(
>             clf,
>             par_perm,
>             errorfx=mean_mismatch_error)

such null_cv and cv below will  provide err per each split (do print err to see
that).  You should add   postproc=mean_sample()   to both of them.

> # Monte Carlo distribution estimator
> distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
>                        enable_ca=['dist_samples'])

> # actual CV with H0 distribution estimation
> cv = CrossValidation(clf, par, errorfx=mean_mismatch_error,
>                      null_dist=distr_est, enable_ca=['stats'])

> err = cv(ds_mni_bw)

> p = err.ca.null_prob
> np.asscalar(p)

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik