[Pkg-exppsy-pymvpa] model selection and other questions

Tue Apr 1 10:14:02 UTC 2008

Hi Michael,

Michael Hanke wrote:
> Hi Emanuele,
>
> Yes, this is definitely the place to discuss about PyMVPA.

Excellent.

>
>> - After the Paris' sprint, is there evidence that PyMVPA
>> will actually substitute scikits.learn in near future?
>
> The current situation is that it will _not_ replace scikits.learn. We
> intend to keep PyMVPA with its workflow separate from scikits.learn. In
> Paris we agreed that scikits should rather be a collection of generic
> algorithms. So in the future PyMVPA will be based on functionality in
> scikits.learn, but the actual user interface with its workflow will not
> necessarily be part of it.
> One reason is that PyMVPA is somewhat focused on neuroimaging data and
> scikits.learn has (or should have) a much wider focus.
>

I'm a NumPy and SciPy user since some years and implemented
some machine learning algorithms for one of the projects I'm
working on. Since I'd like to share them I'm wondering how to
be effective: as far as I can see there are independent
efforts to set up a "self contained"  python packages to  address
specific needs (pyml, learn, pymvpa, etc.). I prefer to
add something to a pre-existing package to avoid useless duplication.
So I'm looking for a nice architecture that can be useful for me
and has hope to gain popularity. My current candidates are
scikits.learn and, more recently, PyMVPA. The learn module seems
to lack a global architecture a bit (most probably because it is an
explicit a choice) in order not to encumber new contributions and be
simple. PyMVPA  is definitely more workflow oriented, and (good news!)
people is actively working on it.

Since you plan to be based on scikits.learn in (near?) future, here
is a new question: what do you suggest to do to me and other possible
contributors? Deliver something first to scikits.learn or to try
to address directly PyMVPA?

I work with probabilistic models and I have some code on Gaussian
process regression (GPR) [0] (classification will be available in near
future). Since GPR is a kernel method it shares many underlying idea
with other kernel methods (e.g., SVM) it can be part of a common
architecture.
About what you call "sensitivity analyzers" (which we name "feature
raters"), I have implemented the I-Relief algorithm [1] for multi-class
problems, together with some simpler rater based on mutual information
or linear regression. More recently I implemented the GPR-ARD [2]
algorithm and applied to neuroimaging data.

In these days I'm trying/reading PyMVP and scikits.learn code. If you
have suggestions on how/where to start please let me know. I'm somewhat
proficient in Python, but not that much.

>> - Is there some model selection solution currently available
>> in PyMVPA? I mean something to infer hyperparameters of the
>> classifiers (like the sigma of the RBF kernel in SVMs) from
>> data? Libsvm provides some extra tools for grid search; I
>> work with optimization techniques that can be generalized to
>> many classifiers etc. What about PyMVPA?
> Having done the first release, this is now one of our main development targets.
> We have plans for a generic optimzation interface, i.e. an OptimizedClassifier
> that can perform a number of model selection algorithms. However, to be
> really flexible, we have to do some work on unifying the parameter
> interface of our classifiers. This will be done during the final phase of
> the integration of the shogun toolbox (http://www.shogun-toolbox.org/),
> which is already somewhat usable.
>
> So, there probably won't be a separate algorithm sitting in top of the
> classifiers (like libsvm grid search script), but another meta-classifier
> that enables parameter optimization for every other type of classifier
> and additionally can be used as a classifier on its own.

The model selection problem is quite crucial IMHO. I completely agree on
having a generic tool to deal with all algorithms and different ways of
doing it (grid search, different optimization algorithms etc.). If you
are discussing on what to do, please let me know: I'm really interested.
Currently I'm using a really nice optimization tool (OpenOpt, from
scikits) and optimizing the marginal likelihood of train data (approximating
Bayesian model selection) for selecting the model instance of GPR. I want to
extend this part (e.g.: optimizing accuracy with cross-validation) and
enforce
the use of OpenOpt which is definitely better than scipy.optimize.

A last question about the license:
- NumPy/SciPy enforces the (modified) BSD license
- scikits adresses a wider range but mainly (modified) BSD and GPL (v2?v3?)
- PyMVPA is distributed under the MIT (X11) license.
- shogun is GPL (v3)

Which is the reason of MIT/X11 license instead of (modified) BSD or GPL?

Kind Regards,

Emanuele

[0]: http://www.gaussianprocess.org/
[1]: http://plaza.ufl.edu/sunyijun/Paper/PAMI_1.pdf
[2]: it is very similar to "Linear SVM Weights" but uses a
squared exponential kernel instead
http://books.nips.cc/papers/files/nips08/0514.pdf