[Pkg-exppsy-pymvpa] model selection and other questions

Thu Apr 3 08:13:10 UTC 2008

Yaroslav Halchenko wrote:
>> There are some GPR implementations available [0] but none of
>>     
> oops... either me is too tired or URL was omitted from the trailer.
>
>   
My fault.

http://www.gaussianprocess.org/#code

>> Mine is not yet public because was
>> implemented to solve a specific problem over the last 2 years [1].
>>     
> oops... again... I guess it had to do with PBAIC? ;)
>
>   

My fault again. Yes, PBAIC:
http://www.ebc.pitt.edu/PBAIC.html

>> It is quite usable by me, but hard for anyone else due to the quick dirty
>> coding. Much of my time during the last 2 years were spent on a different
>>     
>>> ...<
>>>       
>> By the way GPR seems to perform nicely on fMRI data :)
>>     
> better it does. I guess you've done comparison to SVM regressions, since
> you've chosen this one specifically over them... was it much superior?
>
>   

I did not compare them deeply. I'm very interested in probabilistic learning
methods and Bayesian methods. From the point of view of being superior
I don't know if there is a winner. Both are kernel methods, so you project
your data into a higher dimensional space (in the exact same way) and
then draw a line (to separate for classification, or to interpolate for
regression).
GPR does a Bayesian linear regression, taking into account prior probability
and predicting means and variances of the test data. SVM draws the line
according to the support vectors (I use SVMs but I'm not an expert). But
if you need a sound probabilistic tool I guess GPR (or GPC) should be
definitely considered.

>> I have no problem in delivering everything I can. The problem is
>> to make it usable.
>>     
> just let others to judge on usability and help you along.
> Probably you have already a clone of our repository, just create
> eo/master (or whatever abbreviation you like for you name) and start by
> pulling at least something in and composing a tiny testcase for it. If
> there is no really something specific to it -- ie it is just a
> classifier -- just add it to corresponding group in
> tests/tests_warehouse_clfs.py and it might become already tested by some
> ...
>   

Thanks for the detailed instructions. I'm looking forward to
start. This is not the only activity in which I'm involved but
I can expect to spend 30% of my time on it.

I usually use subversion but I'm going to move to DVCS in
these weeks because of the many troubles experienced on
svn. I know some basics of bzr and never tried git. But I guess
it should not be painful.

>> Thank you for pointing me to the dev guide. I
>> will read it for sure. I really appreciate you have started defining
>> explicitly coding guidelines etc. This seems quite unusual for
>> other machine learning packages.
>>     
> well... since we work in philosophically different environments (read it
> emacs vs vim vs ...), reaching firm agreement was needed ;-) our naming
> agreements (camelcase mixed with non camelcase) is somewhat evil but at
> least consistent (thanks for pylint).
> unfortunately devguide is far from being complete but it should be a
> good helper to stay "in style" ;)
>
>   

I use emacs and have just a rough idea of pylint. I'll try it.

> Also there is somewhat a discussion on scipy mailing list and google
> summer of code proposal from Anton
> http://slesarev.anton.googlepages.com/proposal
> which might be worth looking at/discussing
>
>   

I'm reading SciPy-dev threads on scikits.learn and read Anton
proposal when he posted. I guess there is the strong need to coordinate
efforts since part of Anton proposal seems to duplicate part of PyMVPA.

>> There are many ways to define "best" parameters for a learning
>> model: maximize accuracy, maximize marginal likelihood, Bayesian
>> methods, a priori knowledge etc. Each of them can be reached with
>> different techniques (standard optimization, grid search, MCMC etc.).
>> I would prefer to have several algorithms to do that task, each with
>> their own strength and weakness. Or at least an architecture to plug
>> them easily when needed.
>>     
> right in the pot. Defining clear interface should help everyone out.
> And I am not sure what should be specification of those at the moment.
> Since you seems to have better exposure to what is needed it would be
> great indeed if you provided some input/ideas.
>
>   

That's an hard task. Anyway I'll try to sketch something. As
I already told I'm really interested in this topic.

>   
>> using SVMs one year ago (from SciPy sandbox, now scikits.learn)
>> maximizing the accuracy on training data this time, but in the same
>> way. It was not efficient, but it worked.
>>     
> accuracy on training data -- is that learning error or some other
> internal assessment (e.g. leave-1-out cross-validation on the
> training data)?
>
>   

It was cross validation, as provided by libsvm. Once you have
a function to evaluate the quality of your hyperparameters then
it is just matter of optimize it. At that time I used scipy.optimize
but it seems somewhat unstable on my datasets. Dmitrey (OpenOpt's
author) suggested the Shor's r-algorithm ("ralg") for my setting.
I tried it on GPR (nice result!) and hope to test on SVR when possible.

Bye,

Emanuele