[pymvpa] permutation test for unbalanced data
Anna Manelis
anna.manelis at gmail.com
Tue Oct 24 12:37:29 UTC 2017
Thank you Richard. This is very helpful.
Anna.
On Tue, Oct 24, 2017 at 8:32 AM, Richard Dinga <dinga92 at gmail.com> wrote:
> > well -- all the disbalanced cases are "tricky" to say the least ;)
>
> Imbalanced cases are fine, just use AUC or some other measure that is not
> thresholding the results, and all your problems will go away.
>
> On Wed, Oct 18, 2017 at 5:21 AM, Anna Manelis <anna.manelis at gmail.com>
> wrote:
>
>> Thanks for your feedback!
>>
>> 1. Nested cross-validation is done across subjects. For a set of 28
>> subjects (16 (gr1) + 12 (gr2)), 27 subjects are selected and the SMLR
>> accuracies are computed for several lms (e.g., 0.2, 0.5, 1.0) using a
>> leave-one-pair-of-subjects-from-different-groups cross-validation (so
>> train on 25 subjects, test on 2). This procedure is repeated 27 times (so
>> one different subject is out for each round of NCV). The mean accuracy is
>> computed across all splits within each round across all 27 rounds. The lm
>> corresponding to the highest mean accuracy is selected as the best
>> parameter and used in the full sample (n=28) SMLR.
>>
>> 2. It would be great to balance training sets also (to have equal number
>> per group), but I was not sure if one set of randomly selected subjects
>> from a large group (that is equal to the size of the smaller group) would
>> do, or if the procedure should be repeated multiple times (to make sure
>> that all subjects were included across training sets). I am also not sure
>> how to incorporate generation of equal random sets to the scripts.
>>
>> Processing time may become an issue. It takes slightly more than 2 min to
>> run it for 28 subjects (given 192 splits produced by ChainNode). With more
>> training sets (and potentially more subjects or even more unbalanced
>> datasets) the processing time can become infinitely longer...
>>
>> The script with Monte Carlo testing (with 1000 repetitions) described in
>> my first e-mail never finished running (well I stopped it after a day or
>> so). There should be some more efficient way of creating a null
>> distribution for unbalanced datasets...
>>
>> 3. Below is the output of
>> print cv.ca.stats.matrix
>> [[141 68]
>> [ 51 124]]
>>
>> Note there are 192 splits.
>>
>> ACC%=69.01%
>> TPR "better" = 0.73
>> TPR "worse" = 0.65
>>
>> Confidence intervals (computed using BinomialProportionCI(width=.95,
>> meth='jeffreys')) are
>> 64.3%< ACC% < 73.5%.
>>
>>
>> 4. So that new error function in https://github.com/PyMVPA/
>> PyMVPA/pull/541 computes a mean of TPRs?
>>
>> Thank you,
>> Anna.
>>
>>
>>
>>
>> On Tue, Oct 17, 2017 at 4:59 PM, Yaroslav Halchenko <
>> debian at onerussian.com> wrote:
>>
>>>
>>> On Wed, 11 Oct 2017, Anna Manelis wrote:
>>>
>>> > Dear Experts,
>>>
>>> > I have an unbalanced dataset (group1=16 subjects ('better'), group2=12
>>> > subjects ('worse').
>>>
>>> is there any groupping of subjects or each has a dedicated "chunk"?
>>>
>>> > I would like to perform Monte-Carlo testing of the SMLR
>>> > between-subject classification analysis (performed on cope images). I
>>> used
>>> > the permutation_test.py script as an example and tried to adjust it
>>> for my
>>> > unbalanced dataset.
>>>
>>> > Any feedback on whether the script below makes sense is greatly
>>> appreciated.
>>>
>>> well -- all the disbalanced cases are "tricky" to say the least ;)
>>>
>>> I think, that for the peace of mind, you should also try your pipeline
>>> on the
>>> totally random data (ds.samples = np.random.normal(size=ds.shape)) to
>>> make
>>> sure that "garbage in, garbage out", and nothing is "significant" (well,
>>> should
>>> only be in "p" portion of the cases ;))
>>>
>>> > lm = 0.2 # The value was determined through the nested cross-validation
>>> > procedure
>>>
>>> if it was nested, you were most likely to end up with different lm per
>>> each split. or in other words: how was it done and among which lm did
>>> you check? Feels like some double dipping might be going on
>>>
>>> > clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm)
>>>
>>> > ### how often do we want to shuffle the data
>>> > repeater = Repeater(count=200)
>>>
>>> > ### permute the training part of a dataset exactly ONCE
>>> > permutator = AttributePermutator('targets', limit={'partitions': 1},
>>> > count=1)
>>>
>>> > ### Make sure that each group ("better" and "worse") has the same
>>> number of
>>> > samples
>>> > par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
>>> > Sifter([('partitions', 2),
>>> > ('targets', dict(uvalues=['better', 'worse'],
>>> > balanced=True))])
>>>
>>> so, here you are saying to select the partitioning where you have
>>> balanced # of samples in the testing set... so training set will remain
>>> dis-balanced. That would lead SMLR to tend to classify as the "better".
>>> How does your cv.ca.stats look like -- is my fear fulfilled?
>>>
>>> here is an example:
>>>
>>> In [51]: import mvpa2.suite as mv
>>>
>>> In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(),
>>> errorfx=mv.mean_mismatch_error, enable_ca=['stats'])
>>>
>>> In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2,
>>> nfeatures=4, nonbogus_features=[0,1],snr=0)
>>>
>>> In [54]: res = cv(ds); print cv.ca.stats # on random data -- near
>>> chance result
>>> ----------.
>>> predictions\targets L0 L1
>>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>>> MCC F1 AUC
>>> L0 5 6 11 9 6 5 0.45 0.44 0.5 0.4 0.55
>>> -0.1 0.48 0.3
>>> L1 5 4 9 11 5 6 0.44 0.45 0.4 0.5 0.56
>>> -0.1 0.42 0.3
>>> Per target: --- ---
>>> P 10 10
>>> N 10 10
>>> TP 5 4
>>> TN 4 5
>>> Summary \ Means: --- --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55
>>> -0.1 0.45 0.3
>>> CHI^2 0.4 p=0.94
>>> ACC 0.45
>>> ACC% 45
>>> # of sets 5 ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32
>>>
>>>
>>> In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)]
>>>
>>> In [56]: print ds_.summary()
>>> Dataset: 15x4 at float64, <sa: chunks,targets>, <fa: nonbogus_targets>,
>>> <a: bogus_features,nonbogus_features>
>>> stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959
>>> max=0.737558
>>>
>>> Counts of targets in each chunk:
>>> chunks\targets L0 L1
>>> --- ---
>>> 0 1 2
>>> 1 1 2
>>> 2 1 2
>>> 3 1 2
>>> 4 1 2
>>>
>>> Summary for targets across chunks
>>> targets mean std min max #chunks
>>> L0 1 0 1 1 5
>>> L1 2 0 2 2 5
>>>
>>> Summary for chunks across targets
>>> chunks mean std min max #targets
>>> 0 1.5 0.5 1 2 2
>>> 1 1.5 0.5 1 2 2
>>> 2 1.5 0.5 1 2 2
>>> 3 1.5 0.5 1 2 2
>>> 4 1.5 0.5 1 2 2
>>> Sequence statistics for 15 entries from set ['L0', 'L1']
>>> Counter-balance table for orders up to 2:
>>> Targets/Order O1 | O2 |
>>> L0: 4 1 | 3 2 |
>>> L1: 0 9 | 0 8 |
>>> Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8
>>>
>>> In [57]: res = cv(ds_); print cv.ca.stats # on random and disbalanced
>>> data
>>> ----------.
>>> predictions\targets L0 L1
>>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>>> MCC F1 AUC
>>> L0 1 3 4 11 3 4 0.25 0.64 0.2 0.7 0.75
>>> -0.11 0.22 0.6
>>> L1 4 7 11 4 4 3 0.64 0.25 0.7 0.2 0.36
>>> -0.11 0.67 0.6
>>> Per target: --- ---
>>> P 5 10
>>> N 10 5
>>> TP 1 7
>>> TN 7 1
>>> Summary \ Means: --- --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56
>>> -0.11 0.44 0.6
>>> CHI^2 3.4 p=0.33
>>> ACC 0.53
>>> ACC% 53.33
>>> # of sets 5 ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08
>>>
>>>
>>> # although overall accuracy might be low, you can see that classifier
>>> tends to misclassify into L1
>>>
>>>
>>> and overall mean accuracy is a "suboptimal" measure in this case.
>>>
>>> What we could have estimated was the mean of the accuracies per each
>>> class -- that one behaves better in case of the disbalanced datasets
>>>
>>> So in this case, with mean_mismatch_error (mean_match_accuracy reported
>>> in the ca.stats) we are getting
>>>
>>> In [63]: (1+7)/(5.+10)
>>> Out[63]: 0.5333333333333333
>>>
>>> whenever we should use for such cases (and may be overall, while also
>>> taking care about chunks, since splits could be disbalanced differently,
>>> in this example I just assume a single "measurement"):
>>>
>>> In [62]: np.mean([1/5., 7/10.])
>>> Out[62]: 0.44999999999999996
>>>
>>> although numbers here are not really demonstrative, let's consider more
>>> critical case, where all the samples would have gone into the "majority"
>>> class:
>>>
>>> In [70]: cm = mv.ConfusionMatrix()
>>>
>>> In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15)
>>>
>>> In [72]: print cm
>>> ----------.
>>> predictions\targets L1 L2
>>> `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR
>>> MCC F1
>>> L1 0 0 0 15 0 5 nan 0.67 0 1 nan
>>> 0 0
>>> L2 5 10 15 0 5 0 0.67 nan 1 0 0.33
>>> 0 0.8
>>> Per target: --- ---
>>> P 5 10
>>> N 10 5
>>> TP 0 10
>>> TN 10 0
>>> Summary \ Means: --- --- 7.5 7.5 2.5 2.5 nan nan 0.5 0.5 nan
>>> 0 0.4
>>> CHI^2 15 p=0.0018
>>> ACC 0.67
>>> ACC% 66.67
>>> # of sets 1
>>>
>>>
>>> whohoo -- 66% correct!
>>> but if we compute the mean of per-class accuracies:
>>>
>>> In [73]: np.mean([0/5., 10/10.])
>>> Out[73]: 0.5
>>>
>>> 1. Unfortunately "lazy" us didn't implement such a dedicated error
>>> function yet, BUT you have that value as a part of the
>>>
>>> In [84]: print cm.stats['mean(TPR)']
>>> 0.5
>>>
>>> which is the mean of TPRs per each label, which in turn is what
>>> we are interested here
>>>
>>> 2. I am not quite certain if your above approach (though see below
>>> comment about postproc) is completely without ground. Since you assure
>>> that the test set always has one sample of each class, training
>>> set is always consistently disbalanced, and your null distribution
>>> should account for all those "tend to classify as the majority class",
>>> so I expect p value for the true classification performance being
>>> "legit".
>>>
>>> how does your chance distribution look like? (distr_est.ca.dist_samples)
>>>
>>> meanwhile I will prep the "mean_tpr" function (which should match
>>> mean_match_accuracy in balanced case) and "mean_fnr" (which should match
>>> mean_mismatch_error in balanced case) (if I didn't mix anything
>>> up) to use in such cases.
>>>
>>>
>>> > ### I believe this will apply permutator on each fold created by par.
>>> > par_perm = ChainNode([par, permutator], space=par.get_space())
>>>
>>> > # CV with null-distribution estimation that permutes the training data
>>> for
>>> > # each fold independently
>>> > null_cv = CrossValidation(
>>> > clf,
>>> > par_perm,
>>> > errorfx=mean_mismatch_error)
>>>
>>> such null_cv and cv below will provide err per each split (do print err
>>> to see
>>> that). You should add postproc=mean_sample() to both of them.
>>>
>>> > # Monte Carlo distribution estimator
>>> > distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
>>> > enable_ca=['dist_samples'])
>>>
>>> > # actual CV with H0 distribution estimation
>>> > cv = CrossValidation(clf, par, errorfx=mean_mismatch_error,
>>> > null_dist=distr_est, enable_ca=['stats'])
>>>
>>>
>>> > err = cv(ds_mni_bw)
>>>
>>> > p = err.ca.null_prob
>>> > np.asscalar(p)
>>>
>>> --
>>> Yaroslav O. Halchenko
>>> Center for Open Neuroscience http://centerforopenneuroscience.org
>>> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
>>> Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
>>> WWW: http://www.linkedin.com/in/yarik
>>>
>>
>>
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20171024/2fa02e71/attachment-0001.html>
More information about the Pkg-ExpPsy-PyMVPA
mailing list