[pymvpa] regression

Fri Dec 3 21:34:11 UTC 2010

Hi Brice,

Sorry for confusing you with imperfections in the interface.  In
development (once again) those were eliminated:

"regression" argument was of sense only for 'regressions' to say that
they should behave as classifiers (regression=False) via simple
quantization.  In versions starting from 0.5.* there will be no
"regression" argument and to use regressions for classification, an
explicit wrapping will be used: RegressionAsClassifier

> five levels of workload (1 to 5). So I was thinking to use a
> regression: SMLR(regression=True).
SMLR is a classifier and cannot serve as a regression;  well,
technically it could under some coarse quantization of continuous data
into categorical labels for bins, but it would not be very effective

> Does it make sense to do a regression when labels are discrete? i.e.
> the example on regression uses a sinusoidal function, i.e. label
> values are continuous. Would it help the regression to add small noise
> to the labels to give a larger set of values?

> Is it better to ensure that all labels have the same number of samples
> in the training dataset?
yes, unless you are using SVM where you could adjust by providing C
values per each class to balance the penalty between two classes e.g.
like we did in Frontiers paper... we pushed all that code today as a
separate git repository, so here is the part I am talking about:
https://github.com/PyMVPA/paper-code-frontiers2008/blob/master/meg_rieger.py#L187

> I manually codded a function to split my
> dataset in training and testing set, randomly picking 70% of the
> samples for the training set, but making sure that all labels have the
> same number of samples in the training set. Is there a way to achieve
> that with the existing splitters in PyMVPA?
yeap, see nperlabel and nrunpersplit in
http://www.pymvpa.org/modref/mvpa.datasets.splitters.html#mvpa.datasets.splitters.Splitter
which would do that for you

> How to measure the performances? I am currently computing the RMSE.

I am a bit lost here -- are we doing regression again? then what
was that question about labels (which are for categorical data... that
is when it makes sense to equalize their counts).

although it could be "applied" to categorical data with purely numeric
category labels, RMSE is primarily for regression testing.  For
classification -- mean mismatch error is the one we are using

> For that I do a 5 folds pseudo cross validation: for each fold I pick
> at random 70% of the samples as part of the training set, with the

unfortunately in 0.4 we didn't have support for random splitting into
training/testing based on some %.  We would have groupped samples into
lets say 10 chunks (randomly if samples independent) and do 10-fold CV.

In 0.6 it should be possible

> same number of samples per labels
oops -- again...

> the testing set. I then train and predict, and measure the RMSE. Can I
> do that using PyMVPA cross validation facility? I know how to do it
> with a classifier, but I am not sure what's the result with a
> regressor. Besides, I could also use the correlation coefficient
> rather than the RMSE, any comment?
yes - you could use CorrErrorFx (1-correlation coefficient, so higher
value is worse since we want to minimize error)

yes, you would need to provide a corresponding error function for
regressions, e.g. for some regression regr:

        cve = CrossValidatedTransferError(
            TransferError(regr, CorrErrorFx()),
            splitter=NFoldSplitter(),
            enable_states=['training_confusion', 'confusion'])

> As for SMLR, I use the standards parameters, lm=0.1 which yields a
> fairly small sparsification (a couple of the weights are null, out of
> 80). Is there a way to automatically tune lm (and other parameters) to
> achieve the best performances?

not yet... you would need to resort to manual implementation of nested
crossvalidation if you would need to choose one based on cross-validation error

> That's a lot of questions....
that is ok... it just might make us longer to answer more than fewer ;)

Cheers,
[1] http://www.frontiersin.org/neuroinformatics/10.3389/neuro.11.003.2009/full
-- 
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic