[pymvpa] retraining
Scott Gorlin
gorlins at MIT.EDU
Wed Apr 8 06:07:18 UTC 2009
Okay this email is admittedly way to long... but hopefully some of you
will find it as exciting as I do!
I looked through some of the retraining stuff, and (for my purposes at
least) it doesn't seem to address the issue - the main reason I need to
accelerate retraining is for cross-validation, specifically model
selection, which will repeatedly change the trained data set etc (and
basically force full training in each implementation I looked at).
To this end, I have an implementation of a cached-kernel SVM that
greatly accelerates n-fold cv since it only computes the kernel once for
an entire cv session. This can theoretically be interfaced with model
selection to change model params quickly and kernel params slowly, and
speed up the entire process further. I'm having some odd errors with
alioth right now and can't push (don't know if it's server-side or
something's actually wrong with my repo - i'm getting weird sha1 file
permission errors, so i'll try again tomorrow)
This seems like something which would be of great general interest,
especially w.r.t. model selection, so let me spell out the architecture
if anyone cares to comment (or wait for my push):
CachedSVM inherits from sg.SVM. doesn't overide __init__, so you create
any shogun SVM implementation you want.
implements cache(self, dataset), which will take in *all* data that clf
is meant to be used with. This:
1) creates instance of the kernel defined in __init__ as normal, ie
still respecting normal kernel params, but stores the kernel matrix
using Kernel.get_kernel_matrix()
2) strips the dataset with dout = dset.selectFeatures([0]), dout.samples
= N.arange(dset.nsamples), return dout
Usage: dcached = clf.cache(dset); CrossValidatedTransferError(clf,
...)(dcached)
_train(self, dset) and _predict(self, samples):
1) Accept only integer samples, ie those returned by cache(), which
indicate the stored sample index
2) Build a new kernel using shogun.Kernel.CustomKernel(), and taking the
appropriate indices from the samples array from the kernel matrix
3) other training, predicting steps basically copied from sg.SVM
On an NFoldSplitter of normalFeatureDataset(perlabel=500,
nfeatures=2000, nchunks=4):
Kernel cache: 2.152509 s
Cached CV: 0.492900 s
Normal CV: 12.817234 s
using CrossValidatedTransferError, sg.SVM() and corresponding default
params in mine
From my perspective, the benefits of this architecture are:
1) Inheriting from sg.SVM allows for, in theory, any known SVM
implementation, based on any shogun kernel
2) Otherwise acts as standard clf
3) Easy to take parameters from this (ie after cv model selection) and
plop them in a normal shogun clf of the same type
4) Bootstrapping should be awesomely fast
Drawbacks:
1) much of the _train and _predict is copy/pasted from the sg.SVM class
since it couldn't be directly inherited. It's a bit messy, and will
require some attention, but this isn't necessarily critical. The main
issue is that I don't know all the details of why each bit in those
routines is important, so I may inadvertently snip out a critical bit.
2) Requires explicitly caching a dataset (or raw samples). Would be
better to handle this automatically, but I have the intuition that this
may slow things down. Since I'm also building a parameter selection
utility, it would be nice to have it act identically to a normal clf -
or maybe I'll just have a subclass of the psel since it is based on CV
error and could be directly designed to work with this.
3) Shogun-dependent. Dunno yet about writing a libsvm version, I see
there is a precomputed kernel option, but I started with shogun due to
notes below. Either way, looks like we'd need a version for each.
4) Shogun-SVM dependent. Would be great to make it general for any
kernel-based clf, but for now it looks like things will be
implementation-dependent
Alternate strategies:
1) Add 'custom' to sg.SVM._KNOWN_KERNELS. However, the interface isn't
the same (CustomKernel can't be called with lhs, rhs, so I'd have to
modify sg.SVM._train and _predict) and this doesn't prevent requiring a
cache() method, since _train is never called with the full data set
2) Inherit from another target which provides a better kernel
abstraction. To my knowledge, none exist, and I can't imagine it being
worth refactoring sg.SVM just for this class. Libsvm backend looks like
it would have the same issues.
3) Write a kernel abstraction class, or even just a class that exposes
the shogun.Kernel interface, to handle the caching directly. Add this
to known kernel types in sg.SVM. Not sure if this has a general use
case since CustomKernel is the only class who's API differs
4) One possibility for automatic caching is a runtime hash of each new
input sample. I expect this to degrade performance, but I don't have an
intuition to how much. This may also cause problems of growing in a
loop depending on implementation. but, it would allow for complete
automation without a cache() method
Use cases (where it's better than a standard clf)
1) CV
2) Bootstrapping/jack-knifing for clfs which don't implement retrain()
or where trainingdata changes
3) others?? could also do something similar for any predefined custom
kernel, but i can't think of a real use for that at the momement
Fortunately this all turns out to be minimal work - not including stuff
copied from from SVM, I'm only looking at a handful of lines of code. I
love python :)
Thoughts? Sorry for the wind...
Btw yarik -- congratulations!!
-s
Yaroslav Halchenko wrote:
> heh heh... indeed... phd defense tomorrow... computers/software does not
> cooperate etc...
>
> so... retraining -- it is only partially implemented now in pymvpa
>
> sg.SVM
> SVR
>
> have some retraining capabilities and in sg.SVM it is somewhat messy...
> in SVR it is somewhat not working (as I discovered recently while
> changing kernel parameters).
>
>
> but those are the codes which you can check out to start crafting your
> way through 'retraining' -- it should help in many cases (lots of
> features, not that many samples -- so whenever lots of time is taken to
> precompute the kernels). It should be really nice to have in SMLR... but
> it is not there. Meta classifiers (split,featureselection) also do not
> have it all now (iirc).
>
> I will come back to it after this Wed ;) but you are welcome to inspect
> what is already done and suggest what should be done
>
> On Mon, 30 Mar 2009, Michael Hanke wrote:
>
>
>> Hi Scott,
>>
>
>
>> On Mon, Mar 30, 2009 at 02:01:06PM -0400, Scott Gorlin wrote:
>>
>>> Hi guys,
>>>
>
>
>>> What's the latest thoughts on precalculated/retrainable classifiers?
>>> I.e., it could greatly speed up cross-validated model selection... it
>>> looks like libsvm's precomputed kernel isn't implemented though, is this
>>> due to a wrapper limitation? Or are people generally using other
>>> classifiers which have clf.retrain() implemented?
>>>
>
>
>>> If I were to implement this, any tips on where to begin?
>>>
>
>
>> Yarik is the expert on classifier retraining, but he is currently quite
>> busy and it might take a while until he can reply...
>>
>
>
>
>> Michael
>>
More information about the Pkg-ExpPsy-PyMVPA
mailing list