[pymvpa] multiple comparison in classification: theoretical question

Thu May 1 15:26:46 UTC 2014

On Tue, 29 Apr 2014, Francisco Pereira wrote:
>    If you try one analysis -- let's say classification -- and use
>    cross-validation you get one result and an associated p-value under a
>    particular null hypothesis. Your result might have occurred by chance, and
>    your p-value tells you what the probability of getting the result you got
>    was if the null were true. This is exactly like the situation you would
>    have if not using cross-validation; whether you are using it only changes
>    how your p-value is computed.
>    If you were to do the analysis multiple times or with multiple options you
>    would have a set of results instead of a single one, and your concern
>    would be entirely valid. This is where you would need to use a multiple
>    comparison correction and, possibly, report all the results (your
>    correction can take into account the fact that they are likely
>    correlated).

+1 and AFAIK noone does it this 'correct' way, thus I am often very
skeptical about many papers demonstrating some weak but "significant"
effects.

>    A separate issue is how to use cross-validation to get around the issue of
>    trying multiple analyses options (e.g. regularization parameter settings).
>    The idea there is running nested cross-validations within the training set
>    to test those analysis options, pick one setting and then use that setting
>    to analyse the test set (and repeat this for every cross-validation fold).

+1  but nested cross-validation with model selection might become too
'flexible' thus again overfitting your data.  So I guess it all would
depend on how big is the space of models explored.

another idea which I have exercised once but didn't push yet forward, so
would not mind collaborating with someone interested:  derive statistics
on the 'mean' (could be weighted/sum) generalization performance(s)
across different estimators.  In an ongoing study I got literally sick
of trying to make sense of different models (varying classification
schemes/feature extractions), that is why I did statistics across
subjects on the 'mean' among those per-subject metrics: here is the
poster presenting those results
http://haxbylab.dartmouth.edu/publications/HGG+12_sfn12_famfaces.png
there were obvious 'cons' on how I have done it, but I think this can
serve the base for doing it 'right'.

By relying on average performance across different metrics I hoped to
reduce (false positives) noise of a single estimation, while bringing
out consistent results which might be weak but consistently
detected by various models.

-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate,     Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik