[pymvpa] permutation test for unbalanced data
Richard Dinga
dinga92 at gmail.com
Tue Oct 24 12:32:55 UTC 2017
> well -- all the disbalanced cases are "tricky" to say the least ;)
Imbalanced cases are fine, just use AUC or some other measure that is not
thresholding the results, and all your problems will go away.
On Wed, Oct 18, 2017 at 5:21 AM, Anna Manelis <anna.manelis at gmail.com>
wrote:
> Thanks for your feedback!
>
> 1. Nested cross-validation is done across subjects. For a set of 28
> subjects (16 (gr1) + 12 (gr2)), 27 subjects are selected and the SMLR
> accuracies are computed for several lms (e.g., 0.2, 0.5, 1.0) using a
> leave-one-pair-of-subjects-from-different-groups cross-validation (so
> train on 25 subjects, test on 2). This procedure is repeated 27 times (so
> one different subject is out for each round of NCV). The mean accuracy is
> computed across all splits within each round across all 27 rounds. The lm
> corresponding to the highest mean accuracy is selected as the best
> parameter and used in the full sample (n=28) SMLR.
>
> 2. It would be great to balance training sets also (to have equal number
> per group), but I was not sure if one set of randomly selected subjects
> from a large group (that is equal to the size of the smaller group) would
> do, or if the procedure should be repeated multiple times (to make sure
> that all subjects were included across training sets). I am also not sure
> how to incorporate generation of equal random sets to the scripts.
>
> Processing time may become an issue. It takes slightly more than 2 min to
> run it for 28 subjects (given 192 splits produced by ChainNode). With more
> training sets (and potentially more subjects or even more unbalanced
> datasets) the processing time can become infinitely longer...
>
> The script with Monte Carlo testing (with 1000 repetitions) described in
> my first e-mail never finished running (well I stopped it after a day or
> so). There should be some more efficient way of creating a null
> distribution for unbalanced datasets...
>
> 3. Below is the output of
> print cv.ca.stats.matrix
> [[141 68]
> [ 51 124]]
>
> Note there are 192 splits.
>
> ACC%=69.01%
> TPR "better" = 0.73
> TPR "worse" = 0.65
>
> Confidence intervals (computed using BinomialProportionCI(width=.95,
> meth='jeffreys')) are
> 64.3%< ACC% < 73.5%.
>
>
> 4. So that new error function in https://github.com/PyMVPA/
> PyMVPA/pull/541 computes a mean of TPRs?
>
> Thank you,
> Anna.
>
>
>
>
> On Tue, Oct 17, 2017 at 4:59 PM, Yaroslav Halchenko <debian at onerussian.com
> > wrote:
>
>>
>> On Wed, 11 Oct 2017, Anna Manelis wrote:
>>
>> > Dear Experts,
>>
>> > I have an unbalanced dataset (group1=16 subjects ('better'), group2=12
>> > subjects ('worse').
>>
>> is there any groupping of subjects or each has a dedicated "chunk"?
>>
>> > I would like to perform Monte-Carlo testing of the SMLR
>> > between-subject classification analysis (performed on cope images). I
>> used
>> > the permutation_test.py script as an example and tried to adjust it for
>> my
>> > unbalanced dataset.
>>
>> > Any feedback on whether the script below makes sense is greatly
>> appreciated.
>>
>> well -- all the disbalanced cases are "tricky" to say the least ;)
>>
>> I think, that for the peace of mind, you should also try your pipeline
>> on the
>> totally random data (ds.samples = np.random.normal(size=ds.shape)) to
>> make
>> sure that "garbage in, garbage out", and nothing is "significant" (well,
>> should
>> only be in "p" portion of the cases ;))
>>
>> > lm = 0.2 # The value was determined through the nested cross-validation
>> > procedure
>>
>> if it was nested, you were most likely to end up with different lm per
>> each split. or in other words: how was it done and among which lm did
>> you check? Feels like some double dipping might be going on
>>
>> > clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm)
>>
>> > ### how often do we want to shuffle the data
>> > repeater = Repeater(count=200)
>>
>> > ### permute the training part of a dataset exactly ONCE
>> > permutator = AttributePermutator('targets', limit={'partitions': 1},
>> > count=1)
>>
>> > ### Make sure that each group ("better" and "worse") has the same
>> number of
>> > samples
>> > par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
>> > Sifter([('partitions', 2),
>> > ('targets', dict(uvalues=['better', 'worse'],
>> > balanced=True))])
>>
>> so, here you are saying to select the partitioning where you have
>> balanced # of samples in the testing set... so training set will remain
>> dis-balanced. That would lead SMLR to tend to classify as the "better".
>> How does your cv.ca.stats look like -- is my fear fulfilled?
>>
>> here is an example:
>>
>> In [51]: import mvpa2.suite as mv
>>
>> In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(),
>> errorfx=mv.mean_mismatch_error, enable_ca=['stats'])
>>
>> In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2,
>> nfeatures=4, nonbogus_features=[0,1],snr=0)
>>
>> In [54]: res = cv(ds); print cv.ca.stats # on random data -- near chance
>> result
>> ----------.
>> predictions\targets L0 L1
>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>> MCC F1 AUC
>> L0 5 6 11 9 6 5 0.45 0.44 0.5 0.4 0.55 -0.1
>> 0.48 0.3
>> L1 5 4 9 11 5 6 0.44 0.45 0.4 0.5 0.56 -0.1
>> 0.42 0.3
>> Per target: --- ---
>> P 10 10
>> N 10 10
>> TP 5 4
>> TN 4 5
>> Summary \ Means: --- --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55 -0.1
>> 0.45 0.3
>> CHI^2 0.4 p=0.94
>> ACC 0.45
>> ACC% 45
>> # of sets 5 ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32
>>
>>
>> In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)]
>>
>> In [56]: print ds_.summary()
>> Dataset: 15x4 at float64, <sa: chunks,targets>, <fa: nonbogus_targets>, <a:
>> bogus_features,nonbogus_features>
>> stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959
>> max=0.737558
>>
>> Counts of targets in each chunk:
>> chunks\targets L0 L1
>> --- ---
>> 0 1 2
>> 1 1 2
>> 2 1 2
>> 3 1 2
>> 4 1 2
>>
>> Summary for targets across chunks
>> targets mean std min max #chunks
>> L0 1 0 1 1 5
>> L1 2 0 2 2 5
>>
>> Summary for chunks across targets
>> chunks mean std min max #targets
>> 0 1.5 0.5 1 2 2
>> 1 1.5 0.5 1 2 2
>> 2 1.5 0.5 1 2 2
>> 3 1.5 0.5 1 2 2
>> 4 1.5 0.5 1 2 2
>> Sequence statistics for 15 entries from set ['L0', 'L1']
>> Counter-balance table for orders up to 2:
>> Targets/Order O1 | O2 |
>> L0: 4 1 | 3 2 |
>> L1: 0 9 | 0 8 |
>> Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8
>>
>> In [57]: res = cv(ds_); print cv.ca.stats # on random and disbalanced
>> data
>> ----------.
>> predictions\targets L0 L1
>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>> MCC F1 AUC
>> L0 1 3 4 11 3 4 0.25 0.64 0.2 0.7 0.75
>> -0.11 0.22 0.6
>> L1 4 7 11 4 4 3 0.64 0.25 0.7 0.2 0.36
>> -0.11 0.67 0.6
>> Per target: --- ---
>> P 5 10
>> N 10 5
>> TP 1 7
>> TN 7 1
>> Summary \ Means: --- --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56
>> -0.11 0.44 0.6
>> CHI^2 3.4 p=0.33
>> ACC 0.53
>> ACC% 53.33
>> # of sets 5 ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08
>>
>>
>> # although overall accuracy might be low, you can see that classifier
>> tends to misclassify into L1
>>
>>
>> and overall mean accuracy is a "suboptimal" measure in this case.
>>
>> What we could have estimated was the mean of the accuracies per each
>> class -- that one behaves better in case of the disbalanced datasets
>>
>> So in this case, with mean_mismatch_error (mean_match_accuracy reported
>> in the ca.stats) we are getting
>>
>> In [63]: (1+7)/(5.+10)
>> Out[63]: 0.5333333333333333
>>
>> whenever we should use for such cases (and may be overall, while also
>> taking care about chunks, since splits could be disbalanced differently,
>> in this example I just assume a single "measurement"):
>>
>> In [62]: np.mean([1/5., 7/10.])
>> Out[62]: 0.44999999999999996
>>
>> although numbers here are not really demonstrative, let's consider more
>> critical case, where all the samples would have gone into the "majority"
>> class:
>>
>> In [70]: cm = mv.ConfusionMatrix()
>>
>> In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15)
>>
>> In [72]: print cm
>> ----------.
>> predictions\targets L1 L2
>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>> MCC F1
>> L1 0 0 0 15 0 5 nan 0.67 0 1 nan 0
>> 0
>> L2 5 10 15 0 5 0 0.67 nan 1 0 0.33 0
>> 0.8
>> Per target: --- ---
>> P 5 10
>> N 10 5
>> TP 0 10
>> TN 10 0
>> Summary \ Means: --- --- 7.5 7.5 2.5 2.5 nan nan 0.5 0.5 nan 0
>> 0.4
>> CHI^2 15 p=0.0018
>> ACC 0.67
>> ACC% 66.67
>> # of sets 1
>>
>>
>> whohoo -- 66% correct!
>> but if we compute the mean of per-class accuracies:
>>
>> In [73]: np.mean([0/5., 10/10.])
>> Out[73]: 0.5
>>
>> 1. Unfortunately "lazy" us didn't implement such a dedicated error
>> function yet, BUT you have that value as a part of the
>>
>> In [84]: print cm.stats['mean(TPR)']
>> 0.5
>>
>> which is the mean of TPRs per each label, which in turn is what
>> we are interested here
>>
>> 2. I am not quite certain if your above approach (though see below
>> comment about postproc) is completely without ground. Since you assure
>> that the test set always has one sample of each class, training
>> set is always consistently disbalanced, and your null distribution
>> should account for all those "tend to classify as the majority class",
>> so I expect p value for the true classification performance being
>> "legit".
>>
>> how does your chance distribution look like? (distr_est.ca.dist_samples)
>>
>> meanwhile I will prep the "mean_tpr" function (which should match
>> mean_match_accuracy in balanced case) and "mean_fnr" (which should match
>> mean_mismatch_error in balanced case) (if I didn't mix anything
>> up) to use in such cases.
>>
>>
>> > ### I believe this will apply permutator on each fold created by par.
>> > par_perm = ChainNode([par, permutator], space=par.get_space())
>>
>> > # CV with null-distribution estimation that permutes the training data
>> for
>> > # each fold independently
>> > null_cv = CrossValidation(
>> > clf,
>> > par_perm,
>> > errorfx=mean_mismatch_error)
>>
>> such null_cv and cv below will provide err per each split (do print err
>> to see
>> that). You should add postproc=mean_sample() to both of them.
>>
>> > # Monte Carlo distribution estimator
>> > distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
>> > enable_ca=['dist_samples'])
>>
>> > # actual CV with H0 distribution estimation
>> > cv = CrossValidation(clf, par, errorfx=mean_mismatch_error,
>> > null_dist=distr_est, enable_ca=['stats'])
>>
>>
>> > err = cv(ds_mni_bw)
>>
>> > p = err.ca.null_prob
>> > np.asscalar(p)
>>
>> --
>> Yaroslav O. Halchenko
>> Center for Open Neuroscience http://centerforopenneuroscience.org
>> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
>> Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
>> WWW: http://www.linkedin.com/in/yarik
>>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20171024/9835ba98/attachment-0001.html>
More information about the Pkg-ExpPsy-PyMVPA
mailing list