[pymvpa] cross-validation

Mon Mar 15 18:58:07 UTC 2010

On Mon, 15 Mar 2010, Jonas Kaplan wrote:

> All trials (recall events) are acquired in one continuous scan,
> although they do report some control analyses to test for independence
> of the trials.  I believe each 'trial' that goes into the classifier
> is either a single bold volume or a combination of volumes from one
> recall event, although I can't quite figure this out from the paper.
> Aside from the independence issue though, do you see a problem with
> testing each classifier with only one trial?
will have a look at the paper later on... as for testing in each split
on a single left-out sample:  it is a traditional leave-1-out
cross-validation used in statistical learning... I recall that there
were papers showing that it might be biased (don't remember references
atm).  it works fine most of
the time for the people having lots of samples and not trying to prove
that 0.56 is significant... now, if I am taking only 1 sample (with 1
label) out and test a dummy classifier, which during training just looks
at a number of samples and goes for majority(minority, although that is
less probable ;-)) on counts of labels of samples seeing during
training -- my cross-validation would be heavily biased toward either
perfect mis-classification or 100%-correct classification (if I go for
minority and originally labels were balanced) 
;-)

in reality, when using e.g. SVM, and having little number of samples,
but large number of features and weak (or absent at all) signal, even a
slight disbalance might lead to a classifier which would do simply
'go with the label having the largest number  of samples seen during
training' because that leads to the minima in optimization of the error
function.  that in turn leads to below-chance generalization
performance.

> >>   I'd like to know
> >>   if this is considered a reasonable approach since I have a dataset with
> >>   a small number of trials that might benefit from maximizing the number
> >>   of training trials.
> > how small is small? number of trials/chunks(sessions)/labels?

> About 15 trials in each of 5 categories (labels), spread across
> 4 scanning sessions.  This experiment was not designed with mvpa in
> mind.  The trials are slow and decently spaced from each other, though,
> so the question would be could we treat each trial as a separate chunk
> as I believe the Current Biology paper has done.  

well... I would start with taking each 1 of 4 scanning sessions out and
see how it goes ;)
(since 15%4=3 I guess you have disbalanced presence of conditions across
sessions, right?)

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]