[pymvpa] multiple comparison in classification: theoretical question
Yaroslav Halchenko
debian at onerussian.com
Thu May 1 15:26:46 UTC 2014
On Tue, 29 Apr 2014, Francisco Pereira wrote:
> If you try one analysis -- let's say classification -- and use
> cross-validation you get one result and an associated p-value under a
> particular null hypothesis. Your result might have occurred by chance, and
> your p-value tells you what the probability of getting the result you got
> was if the null were true. This is exactly like the situation you would
> have if not using cross-validation; whether you are using it only changes
> how your p-value is computed.
> If you were to do the analysis multiple times or with multiple options you
> would have a set of results instead of a single one, and your concern
> would be entirely valid. This is where you would need to use a multiple
> comparison correction and, possibly, report all the results (your
> correction can take into account the fact that they are likely
> correlated).
+1 and AFAIK noone does it this 'correct' way, thus I am often very
skeptical about many papers demonstrating some weak but "significant"
effects.
> A separate issue is how to use cross-validation to get around the issue of
> trying multiple analyses options (e.g. regularization parameter settings).
> The idea there is running nested cross-validations within the training set
> to test those analysis options, pick one setting and then use that setting
> to analyse the test set (and repeat this for every cross-validation fold).
+1 but nested cross-validation with model selection might become too
'flexible' thus again overfitting your data. So I guess it all would
depend on how big is the space of models explored.
another idea which I have exercised once but didn't push yet forward, so
would not mind collaborating with someone interested: derive statistics
on the 'mean' (could be weighted/sum) generalization performance(s)
across different estimators. In an ongoing study I got literally sick
of trying to make sense of different models (varying classification
schemes/feature extractions), that is why I did statistics across
subjects on the 'mean' among those per-subject metrics: here is the
poster presenting those results
http://haxbylab.dartmouth.edu/publications/HGG+12_sfn12_famfaces.png
there were obvious 'cons' on how I have done it, but I think this can
serve the base for doing it 'right'.
By relying on average performance across different metrics I hoped to
reduce (false positives) noise of a single estimation, while bringing
out consistent results which might be weak but consistently
detected by various models.
--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik
More information about the Pkg-ExpPsy-PyMVPA
mailing list