[pymvpa] permutation test for unbalanced data

Tue Oct 24 12:32:55 UTC 2017

> well -- all the disbalanced cases are "tricky" to say the least ;)

Imbalanced cases are fine, just use AUC or some other measure that is not
thresholding the results, and all your problems will go away.

On Wed, Oct 18, 2017 at 5:21 AM, Anna Manelis <anna.manelis at gmail.com>
wrote:

> Thanks for your feedback!
>
> 1. Nested cross-validation is done across subjects. For a set of 28
> subjects (16 (gr1) + 12 (gr2)), 27 subjects are selected and the SMLR
> accuracies are computed for several lms (e.g., 0.2, 0.5, 1.0) using a
> leave-one-pair-of-subjects-from-different-groups cross-validation (so
> train on 25 subjects, test on 2). This procedure is repeated 27 times (so
> one different subject is out for each round of NCV). The mean accuracy is
> computed across all splits within each round across all 27 rounds. The lm
> corresponding to the highest mean accuracy is selected as the best
> parameter and used in the full sample (n=28) SMLR.
>
> 2. It would be great to balance training sets also (to have equal number
> per group), but I was not sure if one set of randomly selected subjects
> from a large group  (that is equal to the size of the smaller group) would
> do, or if the procedure should be repeated multiple times (to make sure
> that all subjects were included across training sets). I am also not sure
> how to incorporate generation of equal random sets to the scripts.
>
> Processing time may become an issue. It takes slightly more than 2 min to
> run it for 28 subjects (given 192 splits produced by ChainNode). With more
> training sets (and potentially more subjects or even more unbalanced
> datasets) the processing time can become infinitely longer...
>
> The script with Monte Carlo testing (with 1000 repetitions) described in
> my first e-mail  never finished running (well I stopped it after a day or
> so). There should be some more efficient way of creating a null
> distribution for unbalanced datasets...
>
> 3. Below is the output of
> print cv.ca.stats.matrix
> [[141  68]
>  [ 51 124]]
>
>  Note there are 192 splits.
>
> ACC%=69.01%
> TPR "better" = 0.73
> TPR "worse" = 0.65
>
> Confidence intervals (computed using BinomialProportionCI(width=.95,
> meth='jeffreys')) are
> 64.3%< ACC% < 73.5%.
>
>
> 4. So that new error function in  https://github.com/PyMVPA/
> PyMVPA/pull/541 computes a mean of TPRs?
>
> Thank you,
> Anna.
>
>
>
>
> On Tue, Oct 17, 2017 at 4:59 PM, Yaroslav Halchenko <debian at onerussian.com
> > wrote:
>
>>
>> On Wed, 11 Oct 2017, Anna Manelis wrote:
>>
>> > Dear Experts,
>>
>> > I have an unbalanced dataset (group1=16 subjects ('better'), group2=12
>> > subjects ('worse').
>>
>> is there any groupping of subjects or each has a dedicated "chunk"?
>>
>> > I would like to perform Monte-Carlo testing of the SMLR
>> > between-subject classification analysis (performed on cope images).  I
>> used
>> > the permutation_test.py script as an example and tried to adjust it for
>> my
>> > unbalanced dataset.
>>
>> > Any feedback on whether the script below makes sense is greatly
>> appreciated.
>>
>> well -- all the disbalanced cases are "tricky" to say the least ;)
>>
>> I think, that for the peace of mind, you should also try  your pipeline
>> on the
>> totally random data (ds.samples = np.random.normal(size=ds.shape))  to
>> make
>> sure that "garbage in, garbage out", and nothing is "significant" (well,
>> should
>> only be in "p" portion of the cases ;))
>>
>> > lm = 0.2 # The value was determined through the nested cross-validation
>> > procedure
>>
>> if it was nested, you were most likely to end up with different lm per
>> each split.  or in other words: how was it done and among which lm did
>> you check?  Feels like some double dipping might be going on
>>
>> > clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm)
>>
>> > ### how often do we want to shuffle the data
>> > repeater = Repeater(count=200)
>>
>> > ### permute the training part of a dataset exactly ONCE
>> > permutator = AttributePermutator('targets', limit={'partitions': 1},
>> > count=1)
>>
>> > ### Make sure that each group ("better" and "worse") has the same
>> number of
>> > samples
>> > par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
>> >                   Sifter([('partitions', 2),
>> >                           ('targets', dict(uvalues=['better', 'worse'],
>> >                balanced=True))])
>>
>> so, here you are saying to select the partitioning where you have
>> balanced # of samples in the testing set... so training set will remain
>> dis-balanced.  That would lead SMLR to tend to classify as the "better".
>> How does your cv.ca.stats look like -- is my fear fulfilled?
>>
>> here is an example:
>>
>> In [51]: import mvpa2.suite as mv
>>
>> In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(),
>> errorfx=mv.mean_mismatch_error, enable_ca=['stats'])
>>
>> In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2,
>> nfeatures=4, nonbogus_features=[0,1],snr=0)
>>
>> In [54]: res = cv(ds); print cv.ca.stats  # on random data -- near chance
>> result
>> ----------.
>> predictions\targets  L0   L1
>>             `------  ---  --- P' N'  FP  FN  PPV  NPV  TPR  SPC  FDR
>> MCC  F1  AUC
>>          L0           5    6  11  9  6   5  0.45 0.44  0.5  0.4 0.55 -0.1
>> 0.48 0.3
>>          L1           5    4   9 11  5   6  0.44 0.45  0.4  0.5 0.56 -0.1
>> 0.42 0.3
>> Per target:          ---  ---
>>          P           10   10
>>          N           10   10
>>          TP           5    4
>>          TN           4    5
>> Summary \ Means:     ---  --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55 -0.1
>> 0.45 0.3
>>        CHI^2         0.4 p=0.94
>>         ACC         0.45
>>         ACC%         45
>>      # of sets        5   ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32
>>
>>
>> In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)]
>>
>> In [56]: print ds_.summary()
>> Dataset: 15x4 at float64, <sa: chunks,targets>, <fa: nonbogus_targets>, <a:
>> bogus_features,nonbogus_features>
>> stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959
>> max=0.737558
>>
>> Counts of targets in each chunk:
>>   chunks\targets  L0  L1
>>                  --- ---
>>         0         1   2
>>         1         1   2
>>         2         1   2
>>         3         1   2
>>         4         1   2
>>
>> Summary for targets across chunks
>>   targets mean std min max #chunks
>>     L0      1   0   1   1     5
>>     L1      2   0   2   2     5
>>
>> Summary for chunks across targets
>>   chunks mean std min max #targets
>>     0     1.5 0.5  1   2      2
>>     1     1.5 0.5  1   2      2
>>     2     1.5 0.5  1   2      2
>>     3     1.5 0.5  1   2      2
>>     4     1.5 0.5  1   2      2
>> Sequence statistics for 15 entries from set ['L0', 'L1']
>> Counter-balance table for orders up to 2:
>> Targets/Order O1    |  O2    |
>>      L0:       4 1  |   3 2  |
>>      L1:       0 9  |   0 8  |
>> Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8
>>
>> In [57]: res = cv(ds_); print cv.ca.stats  # on random and disbalanced
>> data
>> ----------.
>> predictions\targets   L0   L1
>>             `------   ---  ---  P'  N'  FP  FN  PPV  NPV  TPR  SPC  FDR
>> MCC   F1  AUC
>>          L0           1     3   4   11  3   4  0.25 0.64  0.2  0.7 0.75
>> -0.11 0.22 0.6
>>          L1           4     7   11  4   4   3  0.64 0.25  0.7  0.2 0.36
>> -0.11 0.67 0.6
>> Per target:           ---  ---
>>          P            5    10
>>          N            10    5
>>          TP           1     7
>>          TN           7     1
>> Summary \ Means:      ---  --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56
>> -0.11 0.44 0.6
>>        CHI^2         3.4  p=0.33
>>         ACC          0.53
>>         ACC%        53.33
>>      # of sets        5    ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08
>>
>>
>> # although overall accuracy might be low, you can see that classifier
>> tends to misclassify into L1
>>
>>
>> and overall mean accuracy is a "suboptimal" measure in this case.
>>
>> What we could have estimated was the mean of the accuracies per each
>> class -- that one behaves better in case of the disbalanced datasets
>>
>> So in this case, with mean_mismatch_error (mean_match_accuracy reported
>> in the ca.stats) we are getting
>>
>> In [63]: (1+7)/(5.+10)
>> Out[63]: 0.5333333333333333
>>
>> whenever we should use for such cases (and may be overall, while also
>> taking care about chunks, since splits could be disbalanced differently,
>> in this example I just assume a single "measurement"):
>>
>> In [62]: np.mean([1/5., 7/10.])
>> Out[62]: 0.44999999999999996
>>
>> although numbers here are  not really demonstrative, let's consider more
>> critical case, where all the samples would have gone into the "majority"
>> class:
>>
>> In [70]: cm = mv.ConfusionMatrix()
>>
>> In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15)
>>
>> In [72]: print cm
>> ----------.
>> predictions\targets   L1   L2
>>             `------   ---  ---  P'  N'  FP  FN  PPV  NPV TPR SPC  FDR
>> MCC  F1
>>          L1           0     0   0   15  0   5   nan 0.67  0   1   nan  0
>>  0
>>          L2           5    10   15  0   5   0  0.67  nan  1   0  0.33  0
>> 0.8
>> Per target:           ---  ---
>>          P            5    10
>>          N            10    5
>>          TP           0    10
>>          TN           10    0
>> Summary \ Means:      ---  --- 7.5 7.5 2.5 2.5  nan  nan 0.5 0.5  nan  0
>> 0.4
>>        CHI^2          15  p=0.0018
>>         ACC          0.67
>>         ACC%        66.67
>>      # of sets        1
>>
>>
>> whohoo -- 66% correct!
>> but if we compute the mean of per-class accuracies:
>>
>> In [73]: np.mean([0/5., 10/10.])
>> Out[73]: 0.5
>>
>> 1.  Unfortunately "lazy" us didn't implement such a dedicated error
>>     function yet, BUT  you have that  value as a part of the
>>
>>     In [84]: print cm.stats['mean(TPR)']
>>     0.5
>>
>>     which is the mean of TPRs per each label, which in turn is what
>>     we are interested here
>>
>> 2.  I am not quite certain if your above approach (though see below
>> comment about postproc)  is completely without ground.  Since you assure
>> that the test set always has one sample of each class, training
>> set is always consistently disbalanced, and your null distribution
>> should account for all those "tend to classify as the majority class",
>> so I expect  p value for the true classification performance being
>> "legit".
>>
>> how does your chance distribution look like? (distr_est.ca.dist_samples)
>>
>> meanwhile I will prep the "mean_tpr" function (which should match
>> mean_match_accuracy in balanced case) and "mean_fnr" (which should match
>> mean_mismatch_error in balanced case) (if I didn't mix anything
>> up) to use in such cases.
>>
>>
>> > ### I believe this will apply permutator on each fold created by par.
>> > par_perm = ChainNode([par, permutator], space=par.get_space())
>>
>> > # CV with null-distribution estimation that permutes the training data
>> for
>> > # each fold independently
>> > null_cv = CrossValidation(
>> >             clf,
>> >             par_perm,
>> >             errorfx=mean_mismatch_error)
>>
>> such null_cv and cv below will  provide err per each split (do print err
>> to see
>> that).  You should add   postproc=mean_sample()   to both of them.
>>
>> > # Monte Carlo distribution estimator
>> > distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
>> >                        enable_ca=['dist_samples'])
>>
>> > # actual CV with H0 distribution estimation
>> > cv = CrossValidation(clf, par, errorfx=mean_mismatch_error,
>> >                      null_dist=distr_est, enable_ca=['stats'])
>>
>>
>> > err = cv(ds_mni_bw)
>>
>> > p = err.ca.null_prob
>> > np.asscalar(p)
>>
>> --
>> Yaroslav O. Halchenko
>> Center for Open Neuroscience     http://centerforopenneuroscience.org
>> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
>> Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
>> WWW:   http://www.linkedin.com/in/yarik
>>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20171024/9835ba98/attachment-0001.html>