[pymvpa] RFE question 2.0

Thu Nov 20 07:29:11 UTC 2008

> Ok, I found this section of warehouse.py and am currently testing it  
> out, but I still have a question:
>
> can I use CVTE w/ the SplitClassifier(FeatureSelectionClassifier(RFE))  
> combo?  Otherwise it's irritating to have to come up w/ the data splits 
> myself, then call train and test on the split classifier...
;) love to hear it -- PyMVPA imposes correct behavior onto you -- to
don't mess with this "low level" splitting ;-)

as for the question: here is the excerpt from the warehouse which should
work and seems what you are asking

   rfesvm = LinearCSVMC()

    # This classifier will do RFE while taking transfer error to testing
    # set of that split. Resultant classifier is voted classifier on top
    # of all splits, let see what that would do ;-)
    #clfs += \
    #  SplitClassifier(                      # which does splitting internally
    #   FeatureSelectionClassifier(
    #    clf = LinearCSVMC(),
    #    feature_selection = RFE(             # on features selected via RFE
    #        sensitivity_analyzer=\
    #            rfesvm.getSensitivityAnalyzer(transformer=Absolute),
    #        transfer_error=TransferError(rfesvm),
    #        stopping_criterion=FixedErrorThresholdStopCrit(0.05),
    #        feature_selector=FractionTailSelector(
    #                           0.2, mode='discard', tail='lower'),
    #                           # remove 20% of features at each step
    #        update_sensitivity=True)),
    #        # update sensitivity at each step
    #    descr='LinSVM+RFE(N-Fold)')

just remove #

> Also, I'm not entirely sure whether I understand the role of  
> FeatureSelectionClassifier here (and I think this may have been part of 
> my problem to begin with) -- is this the *same* classifier we'll use to 
> perform generalization?
In the example of code you gave below -- it is such a classifier... 
in the example above -- it is a SplitClassifier.

>   Or is it *just* there for feature selection?  
well --
FeatureSelectionClassifier is indeed to create a classifier which first selects
some features and then trains some other classifier only on those selected
features.

> The way I currently understand it, the split classifier is passing the 
> FeatSelClassifier some training data used to perform feature selection 
> (in this case RFE), and then it's testing on the held-out exemplars for 
> that split.  If this is the case, then how is the internal splitting 
> happening for RFE to be statistically valid? (i.e, that the training data 
> is split as well to get a stopping set for RFE)

Let me first make a remark -- CVTE and SplitClassifier have very much in
common: they both split given dataset and train given classifier on each
split. 

Differences are:

* The task of CVTE is to spit out a single number -- error. the task of
SplitClassifier is to be a classifier -- so for each split it generates
separate trained classifier, and at the end having N such classifiers (1 per
split) it makes a decision in 'predict' using maximal vote (by default)

* CVTE operates on the classifier it is given without creating its
clones/copies (actually it operates on TransferError which wraps the
classifier, since we might have different 'error functions' -- for instance
corrcoef other for regression). SplitClassifier clones/copies given classifier
N times, so it creates a list of them and keeps them alive to be used in
'predict'.

For classification you could actually always use SplitClassifier and just then
ask for its .confusion, which should be the same as CVTE.confusion, but then
you would waste memory if you are just interested in cross-validated
performance (thus no classifier cloning is needed)

Does it make sense so far?

 Now lets consider what happens when SplitClassifier is wrapped around
FeatureSelectionClassifier, and in turn that SplitClassifier is given into
CVTE:

1. CVTE gets some dataset and generates splits: training/testing. Lets name
then 'outer' splits

2. Then CVTE (via TransferError) trains SplitClassifier by providing it only
the 'outer training' data

3. SplitClassifier obtains outer-training data and generates his own splits
(training/testing, let them be 'inner') and 1 classifier per each split.

4. SplitClassifier provides both inner training and testing splits it generated
to FeatselClassifier to train it. Testing set is provided only (for now) for
such obfuscated feature selections as RFE, where such inner testing set can be
used to figure out when to stop removal of the features. Since
SplitClassifier never sees outer testing set, everything is cool in terms of
CVTE estimation of the transfer error.

Does it make more sense now?

> Here's my code now (note that I'm not sure if CVTE is really appropriate 
> here, but I'm still confused about the functionality) -- there are so 
> many levels!!!

Well -- RFE is indeed a complicated beast, and that is why actually even
original its 'use' was wrong -- just imagine having no outer CVTE, i.e. taking
full dataset, generating the splits and using their transfer to decide when to
stop? Actually it might be even more evil, but let me stop here for today
;-)

We could chat over the phone tomorrow to clear things up interactively...
I am not sure though yet when and where would be the best time -- I will drop
you the line tomorrow (ie today ;))

-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student  Ph.D. @ CS Dept. NJIT
Office: (973) 353-1412 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW:     http://www.linkedin.com/in/yarik