[pymvpa] Biased estimates by leave-one-out cross-validations in PyMVPA 2

Sat Apr 21 23:13:50 UTC 2012

Thanks Yaroslav! I tried the Balancer generator but it didn't help in the
following case of binary classification on random samples:

from mvpa2.suite import *
clf=LinearCSVMC();
cv=CrossValidation(clf,ChainNode([NFoldPartitioner(),Balancer()],space='partitio
ns'))
acc=[]
for i in range(200):
 print i
 ds=Dataset(np.random.rand(200))
 ds.sa['targets']=np.remainder(range(200),2)
 ds.sa['chunks']=range(200)
 results=cv(ds)
 acc.append(1-np.mean(results))

>>>print np.mean(acc),np.std(acc)
0.4106 0.212417960634

Or what's the proper way to set up Balancer()? This generator is not
described in the PyMVPA manual 2.0.1.... :(

Thanks!
Dale

On Fri, Apr 20, 2012 at 12:34 PM, Yaroslav Halchenko <debian at onerussian.com>
wrote:
if we were to talk about bias we would talk about classification of true
effects ;)

you are trying to learn/classify noise on disbalanced sets -- since you
have 'events' == range(200), each sample/event is taken out
separately you have 100 of one target (say 1) and 99 of the other (say
0).  Since it is a pure noise, classifier might choose just say that it
is the target with majority samples (regardless of the actual data which
is noise) since that would simply minimize its objective function during
training.  As a result you would get this "anti-learning" effect.  That
is why we usually suggest to assure to have equal number of samples per
each category in training set (or chain with Balancer generator if data
is disbalanced).

On Fri, 20 Apr 2012, Ping-Hui Chiu wrote:

>    Dear PyMVPA experts,
>    Isn't a leave-one-out cross-validation supposed to produce a smaller
bias
>    yet a larger variance in comparison to N-fold cross-validations when
N<#
>    of samples?

>    I ran a sanity check on binary classification of 200 random samples.
>    4-fold cross-validations produced unbiased estimates (~50% correct),
>    whereas leave-one-out cross-validations consistently produced
>    below-than-chance classification performances (~40% correct). Why?

>    Any insight on this will be highly appreciated!

>    My code is listed below:

>    from mvpa2.suite import *
>    clf = LinearCSVMC();
>    cv_chunks = CrossValidation(clf, NFoldPartitioner(attr='chunks'))
>    cv_events = CrossValidation(clf, NFoldPartitioner(attr='events'))
>    acc_chunks=[]
>    acc_events=[]
>    for i in range(200):
>    �print i
>    �ds=Dataset(np.random.rand(200))
>    �[1]ds.sa['targets']=np.remainder(range(200),2)
>    �[2]ds.sa['events']=range(200)
>    �[3]ds.sa
['chunks']=np.concatenate((np.ones(50),np.ones(50)*2,np.ones(50)*3,np.ones(50)*4))
>    �ds_chunks=cv_chunks(ds)
>    �acc_chunks.append(1-np.mean(ds_chunks))
>    �ds_events=cv_events(ds)
>    �acc_events.append(1-np.mean(ds_events))

>    >>>print np.mean(acc_chunks), np.std(acc_chunks)
>    0.50025 0.0442542370853
>    >>>print np.mean(acc_events), np.std(acc_events)
>    0.40674 0.189247516232

>    Thanks!
>    Dale

> References

>    Visible links
>    1. http://ds.sa/
>    2. http://ds.sa/
>    3. http://ds.sa/

> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

--
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic

_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20120421/c5e7ace1/attachment.html>