[pymvpa] cross-validation

Tue Mar 16 10:20:22 UTC 2010

Hi Jonas, Yarik and Francisco,

I am partly working on this topic of bias in error estimation so this
discussion is very interesting for me, as well as the references.

I would like to stress the fact that when estimating the error rate of
a learning algorithm (e.g., LinearCSVMC(C=1.0)) you are not getting a
single classifier (e.g., LinearCSVMC(C=1.0) with given support
vectors) out of it unless a single train test is used (the "simplest"
scenario Francisco mentioned). In a resampling scheme (e.g.,
cross-validation) you fit a classifier on the train set of each fold
and estimate the error on the test set for that fold. Averaging the
rate over the folds you get an estimate of the expected error of
learning algorithm and of course not of the single classifiers. The
idea is that applying the same learning algorithm to the whole dataset
you would get a "similar" error rate to when testing it on a brand new
dataset coming from the same distribution. In case of SVMs (and some
others) and leave-one-out this estimator is proved to be almost
unbiased [0]. Yarik, you mentioned issues with bias in this case, can
you send me some references on that? I am eagerly collecting
information on the topic.

In the leave-one-out case you have just one example to compute the
error rate on the test set which can be (as Yarik said) 0% or
100%. This is a poor estimate of the error rate for that fold, but it
will be used just to compute the final average, which is then OK. Note
that the variance of this average value will be big but I think this
is not a problem: in cross-validation the variance depends on the
number/size of the folds so it is quite arbitrary (and of little use
in most of practical cases) since you are free decide this
number/size.

As for the small sample set case I am reading about it and testing
some code. It seems to lead to high bias under some regimes. But it
sounds conceivable to me since small samples can't represent properly
the problem except in very simple cases. Again if you have literature
on this please let me know. I can share mine if interested.

Best,

Emanuele

P.S.: Another reference I would add to the ones suggested by Francisco:
Sudhir Varma, Richard Simon, Bias in error estimation when using
cross-validation for model selection, BMC Bioinformatics, Vol. 7,
No. 1. (23 February 2006), 91.
http://www.biomedcentral.com/1471-2105/7/91

[0]: theorem 12.9 and proposition 7.4 in "Learning with Kernels" by
Scholkopf and Smola, also available in other books and papers.