[pymvpa] retraining

Wed Apr 8 06:07:18 UTC 2009

Okay this email is admittedly way to long... but hopefully some of you 
will find it as exciting as I do!

I looked through some of the retraining stuff, and (for my purposes at 
least) it doesn't seem to address the issue - the main reason I need to 
accelerate retraining is for cross-validation, specifically model 
selection, which will repeatedly change the trained data set etc (and 
basically force full training in each implementation I looked at).

To this end, I have an implementation of a cached-kernel SVM that 
greatly accelerates n-fold cv since it only computes the kernel once for 
an entire cv session.  This can theoretically be interfaced with model 
selection to change model params quickly and kernel params slowly, and 
speed up the entire process further.  I'm having some odd errors with 
alioth right now and can't push (don't know if it's server-side or 
something's actually wrong with my repo - i'm getting weird sha1 file 
permission errors, so i'll try again tomorrow)

This seems like something which would be of great general interest, 
especially w.r.t. model selection, so let me spell out the architecture 
if anyone cares to comment (or wait for my push):

CachedSVM inherits from sg.SVM.  doesn't overide __init__, so you create 
any shogun SVM implementation you want.

implements cache(self, dataset), which will take in *all* data that clf 
is meant to be used with.  This:
1) creates instance of the kernel defined in __init__ as normal, ie 
still respecting normal kernel params, but stores the kernel matrix 
using Kernel.get_kernel_matrix()
2) strips the dataset with dout = dset.selectFeatures([0]), dout.samples 
= N.arange(dset.nsamples), return dout

Usage: dcached = clf.cache(dset); CrossValidatedTransferError(clf, 
...)(dcached)

_train(self, dset) and _predict(self, samples):
1) Accept only integer samples, ie those returned by cache(), which 
indicate the stored sample index
2) Build a new kernel using shogun.Kernel.CustomKernel(), and taking the 
appropriate indices from the samples array from the kernel matrix
3) other training, predicting steps basically copied from sg.SVM

On an NFoldSplitter of normalFeatureDataset(perlabel=500, 
nfeatures=2000, nchunks=4):
Kernel cache: 2.152509 s
Cached CV: 0.492900 s
Normal CV: 12.817234 s

using CrossValidatedTransferError, sg.SVM() and corresponding default 
params in mine

 From my perspective, the benefits of this architecture are:

1) Inheriting from sg.SVM allows for, in theory, any known SVM 
implementation, based on any shogun kernel
2) Otherwise acts as standard clf
3) Easy to take parameters from this (ie after cv model selection) and 
plop them in a normal shogun clf of the same type
4) Bootstrapping should be awesomely fast

Drawbacks:
1) much of the _train and _predict is copy/pasted from the sg.SVM class 
since it couldn't be directly inherited.  It's a bit messy, and will 
require some attention, but this isn't necessarily critical.  The main 
issue is that I don't know all the details of why each bit in those 
routines is important, so I may inadvertently snip out a critical bit.
2) Requires explicitly caching a dataset (or raw samples).  Would be 
better to handle this automatically, but I have the intuition that this 
may slow things down.  Since I'm also building a parameter selection 
utility, it would be nice to have it act identically to a normal clf - 
or maybe I'll just have a subclass of the psel since it is based on CV 
error and could be directly designed to work with this.
3) Shogun-dependent.  Dunno yet about writing a libsvm version, I see 
there is a precomputed kernel option, but I started with shogun due to 
notes below.  Either way, looks like we'd need a version for each.
4) Shogun-SVM dependent.  Would be great to make it general for any 
kernel-based clf, but for now it looks like things will be 
implementation-dependent

Alternate strategies:
1) Add 'custom' to sg.SVM._KNOWN_KERNELS.  However, the interface isn't 
the same (CustomKernel can't be called with lhs, rhs, so I'd have to 
modify sg.SVM._train and _predict) and this doesn't prevent requiring a 
cache() method, since _train is never called with the full data set
2) Inherit from another target which provides a better kernel 
abstraction.  To my knowledge, none exist, and I can't imagine it being 
worth refactoring sg.SVM just for this class.  Libsvm backend looks like 
it would have the same issues.
3) Write a kernel abstraction class, or even just a class that exposes 
the shogun.Kernel interface, to handle the caching directly.  Add this 
to known kernel types in sg.SVM.  Not sure if this has a general use 
case since CustomKernel is the only class who's API differs
4) One possibility for automatic caching is a runtime hash of each new 
input sample.  I expect this to degrade performance, but I don't have an 
intuition to how much.  This may also cause problems of growing in a 
loop depending on implementation.  but, it would allow for complete 
automation without a cache() method

Use cases (where it's better than a standard clf)
1) CV
2) Bootstrapping/jack-knifing for clfs which don't implement retrain() 
or where trainingdata changes
3) others?? could also do something similar for any predefined custom 
kernel, but i can't think of a real use for that at the momement

Fortunately this all turns out to be minimal work - not including stuff 
copied from from SVM, I'm only looking at a handful of lines of code.  I 
love python :)

Thoughts?  Sorry for the wind...

Btw yarik -- congratulations!!
-s

Yaroslav Halchenko wrote:
> heh heh... indeed... phd defense tomorrow... computers/software does not
> cooperate etc...
>
> so... retraining -- it is only partially implemented now in pymvpa
>
> sg.SVM
> SVR
>
> have some retraining capabilities and in sg.SVM it is somewhat messy...
> in SVR it is somewhat not working (as I discovered recently while
> changing kernel parameters).
>
>
> but those are the codes which you can check out to start crafting your
> way through 'retraining' -- it should help in many cases (lots of
> features, not that many samples -- so whenever lots of time is taken to
> precompute the kernels). It should be really nice to have in SMLR... but
> it is not there.  Meta classifiers (split,featureselection) also do not
> have it all now (iirc).
>
> I will come back to it after this Wed ;) but you are welcome to inspect
> what is already done and suggest what should be done
>
> On Mon, 30 Mar 2009, Michael Hanke wrote:
>
>   
>> Hi Scott,
>>     
>
>   
>> On Mon, Mar 30, 2009 at 02:01:06PM -0400, Scott Gorlin wrote:
>>     
>>> Hi guys,
>>>       
>
>   
>>> What's the latest thoughts on precalculated/retrainable classifiers?  
>>> I.e., it could greatly speed up cross-validated model selection... it 
>>> looks like libsvm's precomputed kernel isn't implemented though, is this 
>>> due to a wrapper limitation?  Or are people generally using other 
>>> classifiers which have clf.retrain() implemented?
>>>       
>
>   
>>> If I were to implement this, any tips on where to begin?
>>>       
>
>   
>> Yarik is the expert on classifier retraining, but he is currently quite
>> busy and it might take a while until he can reply...
>>     
>
>
>   
>> Michael
>>