[Pkg-exppsy-pymvpa] model selection and other questions

Emanuele Olivetti emanuele at relativita.com
Wed Apr 2 21:41:32 UTC 2008

Hi Yaroslav,

Yaroslav Halchenko wrote:
>> I work with probabilistic models and I have some code on Gaussian
>> process regression (GPR) [0] (classification will be available in near
>> future). Since GPR is a kernel method it shares many underlying idea
>> with other kernel methods (e.g., SVM) it can be part of a common
>> architecture.
> yet 1 more thing to refactor a bit in PyMVPA is to adopt ML regressors.
> They are already there but basic class for all classifiers is now
> Classifier and it makes use of ConfusionMatrix for some state variables. That
> should be different for regressors since there is no confusion matrix
> per se.

GPR is about regression (like SVR). I plan to work quite a bit on
regression (the humble sister of classification :) ) so it would be
nice to have the right place for it in PyMVPA.

I'm still in very early stages in understanding your package. For
what it could help,  I'll try to give you feedback on regression
since it is one of my main interests.

> And of cause having new ML tools would be simply GREAT!
> Did you find GPR performing nicely for fMRI data? is there python
> implementation under some OSS license available? ;)

There are some GPR implementations available [0] but none of
them seems to be in python. Mine is not yet public because was
implemented to solve a specific problem over the last 2 years [1].
It is quite usable by me, but hard for anyone else due to the quick dirty
coding. Much of my time during the last 2 years were spent on a different
topic, that's why I did the best to be effective on my problem but
the least for a more general audience.
Now things are changed. Since few days ago I have time to work on the
code for medium/long term and I have explicit commitment to deliver it
to a wider audience.

By the way GPR seems to perform nicely on fMRI data :)

>> About what you call "sensitivity analyzers" (which we name "feature
>> raters"),
> Naming in PyMVPA is still somewhat unstable, and Sensitivity Analyzer is
> one of the candidates for name refactoring ;-) Actually in our case
> SensitivityAnalyzer is just a FeaturewiseDatasetMeasure class...
> But it is not performing rating by default (actually we are yet to add a
> simple Rater transformer) but rather some raw measure per feature.
> And I guess (Michael?) if we stop at FeaturewiseMeasure name as the
> basic class, we could refactor ClassifierBasedSensitivityAnalyzer simply
> into ClassifierSensitivity because that is what it is (that is btw what
> confused Dr. Hanson in naming scheme  -- ANOVA is not a sensitivity per
> se)...

I'd prefer to use standard naming for machine learning terms. You
could object asking "which is the standard naming"... :) . Just take
into account that regression has its feature raters too. I have one.

>> I have implemented the I-Relief algorithm [1] for multi-class
>> ...
> Everything sounds really cool and that is great that you are thinking
> about exposing your implementation to the world ;-) If you decide to
> contribute to PyMVPA -- let us know, we will provide you with full
> access to the repository. Also we have developer guidelines
> http://pkg-exppsy.alioth.debian.org/pymvpa/devguide.html
> to describe at least some of the internals of PyMVPA, thus you can start
> off adopting your algorithms into PyMVPA within your own clone of git
> repository and feed us only with desired pieces whenever you feel so ;-)

I have no problem in delivering everything I can. The problem is
to make it usable. Thank you for pointing me to the dev guide. I
will read it for sure. I really appreciate you have started defining
explicitly coding guidelines etc. This seems quite unusual for
other machine learning packages.

> I am not sure if you are aware of it but let me bring your attention to
> http://mloss.org/about/ which was created primarily to advocate
> open-sourcing of scientific software for various reasons. There is a
> paper which accompanies it
> http://jmlr.csail.mit.edu/papers/v8/sonnenburg07a.html
> which is worth glancing over.

I know of MLOSS and that article. I really think it is a worthwhile
initiative and I want my effort to reach that level.

>> In these days I'm trying/reading PyMVP and scikits.learn code. If you
>> have suggestions on how/where to start please let me know. I'm somewhat
>> proficient in Python, but not that much.
> well.. pymvpa links you must know by now:
> http://pkg-exppsy.alioth.debian.org/pymvpa/
> which leads to
> http://pkg-exppsy.alioth.debian.org/pymvpa/devguide.html
> http://pkg-exppsy.alioth.debian.org/pymvpa/api/index.html

Thanks again for the pointers.

>> I completely agree on
>> having a generic tool to deal with all algorithms and different ways of
>> doing it (grid search, different optimization algorithms etc.). If you
>> are discussing on what to do, please let me know: I'm really interested.
> we are open to discussion any time (though some times we might get
> busy... like now I should have done something else ... not PyMVPA
> related ;-)). But for quick and interesting discussion we should really
> do voice + VNC sessions. Are you skype by any chance?

Interesting. I'll mail you privately my skype ID. But give me a bit of
time to jump on PyMVPA: I don't want to waste your time.

>> Currently I'm using a really nice optimization tool (OpenOpt, from
>> scikits) and optimizing the marginal likelihood of train data (approximating
>> Bayesian model selection) for selecting the model instance of GPR. I want to
>> extend this part (e.g.: optimizing accuracy with cross-validation) and
>> enforce
>> the use of OpenOpt which is definitely better than scipy.optimize.
> We are still blurbing on what should be the best decision on how to
> handle classifier parameters to provide efficient and easy way to
> optimize things. If you could share some code pieces which make use of
> OpenOpt -- that would be terriffic -- that would allow to see the
> situation in wider perspective.

There are many ways to define "best" parameters for a learning
model: maximize accuracy, maximize marginal likelihood, Bayesian
methods, a priori knowledge etc. Each of them can be reached with
different techniques (standard optimization, grid search, MCMC etc.).
I would prefer to have several algorithms to do that task, each with
their own strength and weakness. Or at least an architecture to plug
them easily when needed.

My current code is quite simple: GPR can express analytically the
marginal likelihood of labels given (training) data as a function of
hyperparameters. Using a maximum likelihood point of view I maximize it
using the OpenOpt algorithms (and in my case "ralg" algorithm works pretty
well) starting from many random guess. I did something similar using
SVMs one year ago (from SciPy sandbox, now scikits.learn) maximizing
the accuracy on training data this time, but in the same way. It was not
efficient, but it worked.



More information about the Pkg-exppsy-pymvpa mailing list