[pymvpa] permutation test for unbalanced data

Tue Oct 24 12:37:29 UTC 2017

Thank you Richard. This is very helpful.
Anna.

On Tue, Oct 24, 2017 at 8:32 AM, Richard Dinga <dinga92 at gmail.com> wrote:

> > well -- all the disbalanced cases are "tricky" to say the least ;)
>
> Imbalanced cases are fine, just use AUC or some other measure that is not
> thresholding the results, and all your problems will go away.
>
> On Wed, Oct 18, 2017 at 5:21 AM, Anna Manelis <anna.manelis at gmail.com>
> wrote:
>
>> Thanks for your feedback!
>>
>> 1. Nested cross-validation is done across subjects. For a set of 28
>> subjects (16 (gr1) + 12 (gr2)), 27 subjects are selected and the SMLR
>> accuracies are computed for several lms (e.g., 0.2, 0.5, 1.0) using a
>> leave-one-pair-of-subjects-from-different-groups cross-validation (so
>> train on 25 subjects, test on 2). This procedure is repeated 27 times (so
>> one different subject is out for each round of NCV). The mean accuracy is
>> computed across all splits within each round across all 27 rounds. The lm
>> corresponding to the highest mean accuracy is selected as the best
>> parameter and used in the full sample (n=28) SMLR.
>>
>> 2. It would be great to balance training sets also (to have equal number
>> per group), but I was not sure if one set of randomly selected subjects
>> from a large group  (that is equal to the size of the smaller group) would
>> do, or if the procedure should be repeated multiple times (to make sure
>> that all subjects were included across training sets). I am also not sure
>> how to incorporate generation of equal random sets to the scripts.
>>
>> Processing time may become an issue. It takes slightly more than 2 min to
>> run it for 28 subjects (given 192 splits produced by ChainNode). With more
>> training sets (and potentially more subjects or even more unbalanced
>> datasets) the processing time can become infinitely longer...
>>
>> The script with Monte Carlo testing (with 1000 repetitions) described in
>> my first e-mail  never finished running (well I stopped it after a day or
>> so). There should be some more efficient way of creating a null
>> distribution for unbalanced datasets...
>>
>> 3. Below is the output of
>> print cv.ca.stats.matrix
>> [[141  68]
>>  [ 51 124]]
>>
>>  Note there are 192 splits.
>>
>> ACC%=69.01%
>> TPR "better" = 0.73
>> TPR "worse" = 0.65
>>
>> Confidence intervals (computed using BinomialProportionCI(width=.95,
>> meth='jeffreys')) are
>> 64.3%< ACC% < 73.5%.
>>
>>
>> 4. So that new error function in  https://github.com/PyMVPA/
>> PyMVPA/pull/541 computes a mean of TPRs?
>>
>> Thank you,
>> Anna.
>>
>>
>>
>>
>> On Tue, Oct 17, 2017 at 4:59 PM, Yaroslav Halchenko <
>> debian at onerussian.com> wrote:
>>
>>>
>>> On Wed, 11 Oct 2017, Anna Manelis wrote:
>>>
>>> > Dear Experts,
>>>
>>> > I have an unbalanced dataset (group1=16 subjects ('better'), group2=12
>>> > subjects ('worse').
>>>
>>> is there any groupping of subjects or each has a dedicated "chunk"?
>>>
>>> > I would like to perform Monte-Carlo testing of the SMLR
>>> > between-subject classification analysis (performed on cope images).  I
>>> used
>>> > the permutation_test.py script as an example and tried to adjust it
>>> for my
>>> > unbalanced dataset.
>>>
>>> > Any feedback on whether the script below makes sense is greatly
>>> appreciated.
>>>
>>> well -- all the disbalanced cases are "tricky" to say the least ;)
>>>
>>> I think, that for the peace of mind, you should also try  your pipeline
>>> on the
>>> totally random data (ds.samples = np.random.normal(size=ds.shape))  to
>>> make
>>> sure that "garbage in, garbage out", and nothing is "significant" (well,
>>> should
>>> only be in "p" portion of the cases ;))
>>>
>>> > lm = 0.2 # The value was determined through the nested cross-validation
>>> > procedure
>>>
>>> if it was nested, you were most likely to end up with different lm per
>>> each split.  or in other words: how was it done and among which lm did
>>> you check?  Feels like some double dipping might be going on
>>>
>>> > clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm)
>>>
>>> > ### how often do we want to shuffle the data
>>> > repeater = Repeater(count=200)
>>>
>>> > ### permute the training part of a dataset exactly ONCE
>>> > permutator = AttributePermutator('targets', limit={'partitions': 1},
>>> > count=1)
>>>
>>> > ### Make sure that each group ("better" and "worse") has the same
>>> number of
>>> > samples
>>> > par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
>>> >                   Sifter([('partitions', 2),
>>> >                           ('targets', dict(uvalues=['better', 'worse'],
>>> >                balanced=True))])
>>>
>>> so, here you are saying to select the partitioning where you have
>>> balanced # of samples in the testing set... so training set will remain
>>> dis-balanced.  That would lead SMLR to tend to classify as the "better".
>>> How does your cv.ca.stats look like -- is my fear fulfilled?
>>>
>>> here is an example:
>>>
>>> In [51]: import mvpa2.suite as mv
>>>
>>> In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(),
>>> errorfx=mv.mean_mismatch_error, enable_ca=['stats'])
>>>
>>> In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2,
>>> nfeatures=4, nonbogus_features=[0,1],snr=0)
>>>
>>> In [54]: res = cv(ds); print cv.ca.stats  # on random data -- near
>>> chance result
>>> ----------.
>>> predictions\targets  L0   L1
>>>             `------  ---  --- P' N'  FP  FN  PPV  NPV  TPR  SPC  FDR
>>> MCC  F1  AUC
>>>          L0           5    6  11  9  6   5  0.45 0.44  0.5  0.4 0.55
>>> -0.1 0.48 0.3
>>>          L1           5    4   9 11  5   6  0.44 0.45  0.4  0.5 0.56
>>> -0.1 0.42 0.3
>>> Per target:          ---  ---
>>>          P           10   10
>>>          N           10   10
>>>          TP           5    4
>>>          TN           4    5
>>> Summary \ Means:     ---  --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55
>>> -0.1 0.45 0.3
>>>        CHI^2         0.4 p=0.94
>>>         ACC         0.45
>>>         ACC%         45
>>>      # of sets        5   ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32
>>>
>>>
>>> In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)]
>>>
>>> In [56]: print ds_.summary()
>>> Dataset: 15x4 at float64, <sa: chunks,targets>, <fa: nonbogus_targets>,
>>> <a: bogus_features,nonbogus_features>
>>> stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959
>>> max=0.737558
>>>
>>> Counts of targets in each chunk:
>>>   chunks\targets  L0  L1
>>>                  --- ---
>>>         0         1   2
>>>         1         1   2
>>>         2         1   2
>>>         3         1   2
>>>         4         1   2
>>>
>>> Summary for targets across chunks
>>>   targets mean std min max #chunks
>>>     L0      1   0   1   1     5
>>>     L1      2   0   2   2     5
>>>
>>> Summary for chunks across targets
>>>   chunks mean std min max #targets
>>>     0     1.5 0.5  1   2      2
>>>     1     1.5 0.5  1   2      2
>>>     2     1.5 0.5  1   2      2
>>>     3     1.5 0.5  1   2      2
>>>     4     1.5 0.5  1   2      2
>>> Sequence statistics for 15 entries from set ['L0', 'L1']
>>> Counter-balance table for orders up to 2:
>>> Targets/Order O1    |  O2    |
>>>      L0:       4 1  |   3 2  |
>>>      L1:       0 9  |   0 8  |
>>> Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8
>>>
>>> In [57]: res = cv(ds_); print cv.ca.stats  # on random and disbalanced
>>> data
>>> ----------.
>>> predictions\targets   L0   L1
>>>             `------   ---  ---  P'  N'  FP  FN  PPV  NPV  TPR  SPC  FDR
>>> MCC   F1  AUC
>>>          L0           1     3   4   11  3   4  0.25 0.64  0.2  0.7 0.75
>>> -0.11 0.22 0.6
>>>          L1           4     7   11  4   4   3  0.64 0.25  0.7  0.2 0.36
>>> -0.11 0.67 0.6
>>> Per target:           ---  ---
>>>          P            5    10
>>>          N            10    5
>>>          TP           1     7
>>>          TN           7     1
>>> Summary \ Means:      ---  --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56
>>> -0.11 0.44 0.6
>>>        CHI^2         3.4  p=0.33
>>>         ACC          0.53
>>>         ACC%        53.33
>>>      # of sets        5    ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08
>>>
>>>
>>> # although overall accuracy might be low, you can see that classifier
>>> tends to misclassify into L1
>>>
>>>
>>> and overall mean accuracy is a "suboptimal" measure in this case.
>>>
>>> What we could have estimated was the mean of the accuracies per each
>>> class -- that one behaves better in case of the disbalanced datasets
>>>
>>> So in this case, with mean_mismatch_error (mean_match_accuracy reported
>>> in the ca.stats) we are getting
>>>
>>> In [63]: (1+7)/(5.+10)
>>> Out[63]: 0.5333333333333333
>>>
>>> whenever we should use for such cases (and may be overall, while also
>>> taking care about chunks, since splits could be disbalanced differently,
>>> in this example I just assume a single "measurement"):
>>>
>>> In [62]: np.mean([1/5., 7/10.])
>>> Out[62]: 0.44999999999999996
>>>
>>> although numbers here are  not really demonstrative, let's consider more
>>> critical case, where all the samples would have gone into the "majority"
>>> class:
>>>
>>> In [70]: cm = mv.ConfusionMatrix()
>>>
>>> In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15)
>>>
>>> In [72]: print cm
>>> ----------.
>>> predictions\targets   L1   L2
>>>             `------   ---  ---  P'  N'  FP  FN  PPV  NPV TPR SPC  FDR
>>> MCC  F1
>>>          L1           0     0   0   15  0   5   nan 0.67  0   1   nan
>>> 0   0
>>>          L2           5    10   15  0   5   0  0.67  nan  1   0  0.33
>>> 0  0.8
>>> Per target:           ---  ---
>>>          P            5    10
>>>          N            10    5
>>>          TP           0    10
>>>          TN           10    0
>>> Summary \ Means:      ---  --- 7.5 7.5 2.5 2.5  nan  nan 0.5 0.5  nan
>>> 0  0.4
>>>        CHI^2          15  p=0.0018
>>>         ACC          0.67
>>>         ACC%        66.67
>>>      # of sets        1
>>>
>>>
>>> whohoo -- 66% correct!
>>> but if we compute the mean of per-class accuracies:
>>>
>>> In [73]: np.mean([0/5., 10/10.])
>>> Out[73]: 0.5
>>>
>>> 1.  Unfortunately "lazy" us didn't implement such a dedicated error
>>>     function yet, BUT  you have that  value as a part of the
>>>
>>>     In [84]: print cm.stats['mean(TPR)']
>>>     0.5
>>>
>>>     which is the mean of TPRs per each label, which in turn is what
>>>     we are interested here
>>>
>>> 2.  I am not quite certain if your above approach (though see below
>>> comment about postproc)  is completely without ground.  Since you assure
>>> that the test set always has one sample of each class, training
>>> set is always consistently disbalanced, and your null distribution
>>> should account for all those "tend to classify as the majority class",
>>> so I expect  p value for the true classification performance being
>>> "legit".
>>>
>>> how does your chance distribution look like? (distr_est.ca.dist_samples)
>>>
>>> meanwhile I will prep the "mean_tpr" function (which should match
>>> mean_match_accuracy in balanced case) and "mean_fnr" (which should match
>>> mean_mismatch_error in balanced case) (if I didn't mix anything
>>> up) to use in such cases.
>>>
>>>
>>> > ### I believe this will apply permutator on each fold created by par.
>>> > par_perm = ChainNode([par, permutator], space=par.get_space())
>>>
>>> > # CV with null-distribution estimation that permutes the training data
>>> for
>>> > # each fold independently
>>> > null_cv = CrossValidation(
>>> >             clf,
>>> >             par_perm,
>>> >             errorfx=mean_mismatch_error)
>>>
>>> such null_cv and cv below will  provide err per each split (do print err
>>> to see
>>> that).  You should add   postproc=mean_sample()   to both of them.
>>>
>>> > # Monte Carlo distribution estimator
>>> > distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
>>> >                        enable_ca=['dist_samples'])
>>>
>>> > # actual CV with H0 distribution estimation
>>> > cv = CrossValidation(clf, par, errorfx=mean_mismatch_error,
>>> >                      null_dist=distr_est, enable_ca=['stats'])
>>>
>>>
>>> > err = cv(ds_mni_bw)
>>>
>>> > p = err.ca.null_prob
>>> > np.asscalar(p)
>>>
>>> --
>>> Yaroslav O. Halchenko
>>> Center for Open Neuroscience     http://centerforopenneuroscience.org
>>> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
>>> Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
>>> WWW:   http://www.linkedin.com/in/yarik
>>>
>>
>>
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20171024/2fa02e71/attachment-0001.html>