[pymvpa] Using FeatureSelectionClassifier for feature elmination

Wed Jul 16 20:36:05 UTC 2008

I placed some comments below but let me summarize on top here -- it might make
things clearer (I hope), but I must warn you -- I am not that great writer, not
a native English speaker, and Emanuele might correct/extend me on terminology,
as ML people refer to things. And my writing came out to be loong. If you feel
that it is descriptive enough, I would appreciate your 'rewrite' or 'fix' of it
so it could be added to the manual or elsewhere in the documentation.  

Terminology:

 dataset = data + labels + chunks, so is sufficient for supervised
           training (and for splitting)
 data = data, ie no labels/chunks provided

Few Dogmas:

 Any classifier has no clue about splits (besides SplitClassifier). It
 just trains somehow on the dataset it obtains in train(), and
 predicts on the data (thus without labels) it obtains in 'predict()'

 To validate how well a classifier performs, common practice is to
 use cross-validation, ie split the data into training and testing
 datasets, so we provide 1 part to the classifier for training and 
 then see how well it performs on the data it hasn't seen (ie 2nd part)
 (Thus we have CrossValidatedTransferError)

That is pretty much all the dogmas ;-) and the role of having 2 splits
seems to be obvious, right?

Some classifiers might have some hyperparameters which need to be adjusted
while it gets trained, ie while it sees only training data and not testing data
(which we held away for cross-validation). Those hyperparameters could be
various, including a set of features which classifier decides to use
internally. And we come to FeatureSelectionClassifier which is a
meta-classifier (ie it is not a classifier itself, it can take any "slave"
classifier, perform some feature selection in advance, select those features
and then train 'slave' classifier on those features. But externally it looks
(or should look) like a classifier, ie fullfill the specialization of the
Classifier base-class). Here is a list of relevant arguments to the
constructor of such classifier:

  clf : Classifier
	classifier based on which mask classifiers is created
  feature_selection : FeatureSelection
	whatever `FeatureSelection` comes handy
  testdataset : Dataset
	optional dataset which would be given on call to feature_selection

So testdataset is actually optional, and lets forget about it for now.
Also description of clf seems to be too vague... we will fix it

how does such know which features to select?  It uses some FeatureSelection.
Simple example of such feature selection (as could be found lets say in
mvpa/clfs/warehouse.py) is smth like

       SensitivityBasedFeatureSelection(
           OneWayAnova(),
           FractionTailSelector(0.05, mode='select', tail='upper')),

so, it simply does anova on training dataset and selects 5% of most important
features. So for such feature selection it doesn't even need a classifier ;-)
And apparently no additional splitting is needed anywhere within such a classifier.

	It is important imho to emphasize once again, that such feature selection
	within train() doesn't see testing dataset. Some people do unfair game -- they
	select features, or tune some other hyperparameters on the whole dataset, and
	then say that they got an unbiased cross validation estimate. That is
	cheating.

More interesting would be to select features for training linear SVM based on a
set of most significant according to separating hyperplane normal coefficients.
Then as a sensitivity analyzer we could use the same classifier as we will be
training with only that selected subset of the features:
<pasting from warehouse.py>

    sample_linear_svm = clfs['linear', 'svm'][0]

    clfs += \
        FeatureSelectionClassifier(
            sample_linear_svm,
            SensitivityBasedFeatureSelection(
               sample_linear_svm.getSensitivityAnalyzer(transformer=Absolute),
               FractionTailSelector(0.05, mode='select', tail='upper')),
            descr="LinSVM on 5%(SVM)")

so here we first defined an instance of linear SVM which is first used to
assess sensitivity (in SensitivityBasedFeatureSelection) and then to be trained
on those 5% of the features. In other words it looks like 1-step RFE ;-)

More interestingly imho  we don't have to use the same classifier:

    clfs += \
         FeatureSelectionClassifier(
             LinearCSVMC(),
             SensitivityBasedFeatureSelection(
                SMLRWeights(SMLR(lm=0.1, implementation="C")),
                RangeElementSelector(mode='select')),
             descr="LinSVM on SMLR(lm=0.1) non-0")

so, we are selecting features based on the sensitivities provided by SMLR
classifier (actually just selecting features with non-0 weights, but with
other Selector we can get some other subset of features), and then we are
training Linear SVM on it.

So, all those FeatureSelectionClassifiers behave just as regular classifiers,
they just do feature selection using training dataset and prior to
training of the 'slave' classifier.

Let me now try to discuss RFE...

What do we all know: in RFE we recursively prune features in the classifier
until we achieve minimum or maximum of some criterion. Natively such criterion
could be simply a cross-validation error... but classifier in train() obtains
just training dataset (lets forget about testdataset for now), thus to get an
assessment of cross validation, for RFE we would need to split the dataset
again, or use some criterion which doesn't require 2 separate datasets (ie 1
for training, 1 for testing).

If we split to do cross-validation within training dataset, then
additional aspect gets into the way: if we split the data lets say into 10
splits, we will get 10 sensitivity assessments. Thus we can have 10 independent
progresses of RFE (lets name it scenario #1) selecting 10 (hopefully mostly
overlapping) sets of features.

Or at each step of RFE we can get an average of those 10 sensitivities
(scenario #2), and decide on which features to prune based on that single
sensitivity map.

Another possibility is to take sensitivity of the classifier without doing any
splitting of the training dataset, and then only temporarily split dataset to
do cross-validation (scenario #3) to get needed assessment of generalization performance.

Any of the above scenarios can be easily crafted with custom loop and calls to
those "1-step RFE" I described above.

On my and Michael's attempts to do RFE scenario #2 showed to be most
stable, ie not chasing bogus-features, so we decided to create an easy to
implement such RFE interface, so that no custom loops etc is needed.

	 For scenario #1, look in warehouse for
	"This classifier will do RFE while taking transfer error to testing" in warehouse.py
	I think it would implement that one.

	Scenario #3 -- it needs a little adjustment in SplitClassifier, which would
	actually use splits only for generalization estimation, but would predict()
	using classifier trained on full dataset, thus its sensitivity would be such as
	well. or may be it could even fit somehow in current implementation via
	crafting within FeatureSelectionClassifier specialization... but lets leave it
	for later discussions.

Lets now look at Scenario #2 definition ;-)

(taken from commented out section of warehouse.py so it means at least I
didn't test this construct recently, but it might work out of the box ;))

  rfesvm_split = SplitClassifier(LinearCSVMC())

  clfs += \
    FeatureSelectionClassifier(
      clf = LinearCSVMC(),
      feature_selection = RFE(             # on features selected via RFE
          # based on sensitivity of a clf which does splitting internally
          sensitivity_analyzer=rfesvm_split.getSensitivityAnalyzer(),
          transfer_error=ConfusionBasedError(
             rfesvm_split,
             confusion_state="confusion"),
             # and whose internal error we use
          feature_selector=FractionTailSelector(
                             0.2, mode='discard', tail='lower'),
                             # remove 20% of features at each step
          update_sensitivity=True),
          # update sensitivity at each step
      descr='LinSVM+RFE(splits_avg)' )

so, the new party in this game is SplitClassifier, which is again yet another
meta-classifier, which in this case takes Linear SVM to be a 'slave'. It
is a twin-brother of CrossValidatedTransferError, and supposedly can do
everything what that error does, but besides that it is a valid classifier (ie can be trained/tested)

Thus SplitClassifier is parametrized with a splitter:

              splitter : Splitter
                `Splitter` to use to split the dataset prior training

So, prior to training SplitClassifier splits training dataset (like in those 10
pieces I described above), dedicates a separate classifier  to each split,
trains it on the training part of the split, and then computes transfer error
to the 'testing' part of that split. If SplitClassifier instance is later on
asked to 'predict' some new data, it uses (by default) MaximalVote strategy to
derive an answer. In ML world such strategy is close to boosting I believe (or
may be even bagging). Summary about performance of split classifier internally
on those splits of training dataset is available within 'confusion' state
variable.

Now lets get back to the code snippet. With RFE we just want to do some feature
selection (hence the name), thus natively we need FeatureSelectionClassifier.
RFE is just yet one of the methods on how to select the features so RFE is a
subclass of FeatureSelection. To perform RFE we need few parameters (for full list see
?RFE.__init__), and in our example we instruct it to use

 sensitivity of the split classifier (by default it just takes the mean across
 sensitivities of separate splits, thus it is scenario #2).

 also criterion is just an rfesvm_split.confusion, thus we have to use
ConfusionBasedError, which just accesses value of that confusion, or it could
be smth as simple as 

 lambda *args: rfesvm_split.confusion.error

then the rest is easy: feature_selector is a helper which selects
features for us (discard 20% in current example), and update_sensitivity says
that we would like to reassess sensitivity of rfesvm at each step of RFE, since
otherwise we could simply compute sensitivity with all the features and start
pruning without explicitly retraining classifier for each selected subset of
features to get a new sensitivity (implicitly we still need to retrain it
because we use ConfusionBasedError, thus it would alter while we are altering
subset of the features. update_sensitivity = False is specific if we use
sensitivity_analyzer which is not classifier based, ie smth like Anova.

Ho ho -- so that is it ;-) hopefully it made some sense and you are welcome to
see comments to your questions below. Please let us know if what I just
described makes any sense. Also you can track progress of RFE, if you enable
debug id RFEC (should be just RFE sooner or later), or if you really want to
see what is happening inside of RFE call, then enable even RFEC_. But if you
don't trust anyone, then also enable CLF, and you will see most of the things
which happen below the hood ;-) 

> I'm aware that if I have a dataset I need to split it into 3 parts;  
it depends actually.... you can decide when to stop feature selection
based on some other performance criterion than cross-validation. For
instance with GPR it might be log_marginal_likelihood of the classifier.
This way you don't actually need to split into 3, but rather in 2 parts.
I am not sure if that is making things clearer... might be just more
confusing ;-) SMLR internally does feature selection too, thus doesn't
not rely on any other criterion such as generalization, thus once again
3rd split is not required.

> however, for me there stil exists some confusion about the role each  
> of those parts is supposed to play.  I'd like to try to tackle this  
> question first, then move on to specifics for PyMVPA's implementation.

> In my original understanding, the three-way dataset split was as  
> follows:

> (1) Split for RFE (half of which is used to train the classifier, the  
> other half to determine accuracy to pick features)
> (2) Split for training the classifier (with a subset of the features)
> (3) Split for testing the trained classifier

> However, it would seem (by Yaroslav's email) that in fact we need  
> these three splits:
> (1) Split for RFE training
> (2) Split for RFE validation
> (3) Split for testing (i.e. generalization)

> Partly my confusion comes from the fact that we would seem to be using  
> the same split (1) to both train a classifier to select features and  
> to use as the classifier for testing (i.e. generalization).  Is this a  
> valid approach?

the point is that testing/generalization should be done after we trained
the classifier. So if it is a FeatureSelectionClassifier then it should
select the features.

> Finally, with respect to PyMVPA, if we use a  
> FeatureSelectionClassifier (and presumably use it in the traditional  
> way Classifier objects are used), then what meaning do clf.train/ 
> clf.predict have.
FeatureSelectionClassifier is intended to be just a classifier which
does feature selection first. Feature selection could be as simple as
'select 5% of most diagnostic features according to some measure'.

> As I understand it, we can provide a testdataset to the  
> FeatureSelectionClassifier, which results in the following:

> first, we initialize a classifier of type FeatureSelectionClassifier,  
> along with split (2) above as the testdataset.  Then, we call  
> clf.train on split (1) above, then clf.predict on split (3) above, in  
> order to perform generalization.  I assume that, since we are using a  
> FeatureSelectionClassifier, it automatically selects the correct  
> features during generalization that were selected during training.

> Do I have the right understanding on this?

> Sorry for the confusion -- but I'd be very happy to write some of the  
> documentation for this stuff if/when I get a better handle on it.

> Thanks,
> James.

> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa

-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student  Ph.D. @ CS Dept. NJIT
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW:     http://www.linkedin.com/in/yarik