[pymvpa] multiple comparison in classification: theoretical question

Mon May 5 10:08:42 UTC 2014

On 05/01/2014 05:26 PM, Yaroslav Halchenko wrote:
> On Tue, 29 Apr 2014, Francisco Pereira wrote:
>>     If you try one analysis -- let's say classification -- and use
>>     cross-validation you get one result and an associated p-value under a
>>     particular null hypothesis. Your result might have occurred by chance, and
>>     your p-value tells you what the probability of getting the result you got
>>     was if the null were true. This is exactly like the situation you would
>>     have if not using cross-validation; whether you are using it only changes
>>     how your p-value is computed.
>>     If you were to do the analysis multiple times or with multiple options you
>>     would have a set of results instead of a single one, and your concern
>>     would be entirely valid. This is where you would need to use a multiple
>>     comparison correction and, possibly, report all the results (your
>>     correction can take into account the fact that they are likely
>>     correlated).
> +1 and AFAIK noone does it this 'correct' way, thus I am often very
> skeptical about many papers demonstrating some weak but "significant"
> effects.

Having multiple results from different choices in the analysis pipeline is
a common situation. Inferring something out of it is a common problem.
If you try 100 different methods (or model parameters) and pick the one
with highest score, e.g. accuracy, then you are choosing the best one in the
limit of the data you provided. (Note: you might want to account for the
variance of that score, i.e. you may have ties in the ranking).
But then, if you want to report an "unbiased" score of that best method, you
need a new test set to compute it. The one that you used for the ranking
suffers overfitting on the initial dataset.
See for example: http://dx.doi.org/10.1109/WBD.2010.9
(disclaimer: I am the author of the paper)

>>     A separate issue is how to use cross-validation to get around the issue of
>>     trying multiple analyses options (e.g. regularization parameter settings).
>>     The idea there is running nested cross-validations within the training set
>>     to test those analysis options, pick one setting and then use that setting
>>     to analyse the test set (and repeat this for every cross-validation fold).
> +1  but nested cross-validation with model selection might become too
> 'flexible' thus again overfitting your data.  So I guess it all would
> depend on how big is the space of models explored.

I mildly disagree. If your point is to find out the best model, then - in the
limit of the ranking you can get with nested cross-validation and taking
into account the variance of that estimate, thus ties - I would not expect
overfitting, whatever the size of the space of models.
But if you need to assess an unbiased score, e.g. accuracy, of the best
model, after using the data to find out that is the best,
then you need extra test data not used before to get it. I would not expect
this score to overfit the data.

>
> another idea which I have exercised once but didn't push yet forward, so
> would not mind collaborating with someone interested:  derive statistics
> on the 'mean' (could be weighted/sum) generalization performance(s)
> across different estimators.  In an ongoing study I got literally sick
> of trying to make sense of different models (varying classification
> schemes/feature extractions), that is why I did statistics across
> subjects on the 'mean' among those per-subject metrics: here is the
> poster presenting those results
> http://haxbylab.dartmouth.edu/publications/HGG+12_sfn12_famfaces.png
> there were obvious 'cons' on how I have done it, but I think this can
> serve the base for doing it 'right'.
>
> By relying on average performance across different metrics I hoped to
> reduce (false positives) noise of a single estimation, while bringing
> out consistent results which might be weak but consistently
> detected by various models.
>

I guess that in practice this is acceptable. But in principle
you are putting different things into the averaging process. What
qualifies an algorithm, e.g. feature extraction/selection or classification
algorithm or else, to enter the set you are considering and then averaging?
Would a random feature selection algorithm or a random classifier
enter your set of different models? If not, why? :)

I like this topic, and I hope to keep it rolling ;)

Best,

Emanuele