[pymvpa] RFE question 2.0

Thu Nov 20 01:37:39 UTC 2008

On Nov 19, 2008, at 5:19 PM, Yaroslav Halchenko wrote:

> the thing is that you should use
> Splitclassifier not around your basic SVM but around that
> FeatureSelectionClassifier -- look at the source of  
> mvpa.clfs.warehouse.
> RFE classifiers are commented out but there -- look at the one which
> starts with SplitClassifier
Ok, I found this section of warehouse.py and am currently testing it  
out, but I still have a question:

can I use CVTE w/ the SplitClassifier(FeatureSelectionClassifier(RFE))  
combo?  Otherwise it's irritating to have to come up w/ the data  
splits myself, then call train and test on the split classifier...

Also, I'm not entirely sure whether I understand the role of  
FeatureSelectionClassifier here (and I think this may have been part  
of my problem to begin with) -- is this the *same* classifier we'll  
use to perform generalization?  Or is it *just* there for feature  
selection?  The way I currently understand it, the split classifier is  
passing the FeatSelClassifier some training data used to perform  
feature selection (in this case RFE), and then it's testing on the  
held-out exemplars for that split.  If this is the case, then how is  
the internal splitting happening for RFE to be statistically valid?  
(i.e, that the training data is split as well to get a stopping set  
for RFE)

Here's my code now (note that I'm not sure if CVTE is really  
appropriate here, but I'm still confused about the functionality) --  
there are so many levels!!!

def do_rfe(dataset, percent):
     debug.active = ['CLF']

     rfesvm = LinearCSVMC()

     FeatureSelection = RFE(sensitivity_analyzer=OneWayAnova(),  
transfer_error=TransferError(rfesvm), \
         feature_selector=FractionTailSelector(percent / 100.0,  
mode='select', tail='upper'), update_sensitivity=False)

     clf = FeatureSelectionClassifier(clf = LinearCSVMC(), \
         # on features selected via RFE
         feature_selection = FeatureSelection,  
enable_states=['confusion'])
         # update sensitivity at each step (since we're not using the  
same CLF as sensitivity analyzer)

     #clf.states.enable('feature_ids')

     split_clf = SplitClassifier(clf=clf, enable_states=['confusion'])

     split_clf.states.enable('feature_ids')

     cv = CrossValidatedTransferError(TransferError(split_clf),  
NFoldSplitter(cvtype=1), enable_states=['confusion'])

     error = cv(dataset)

     print 'Error:  ' + `error`

     return split_clf.confusion, split_clf.feature_ids

Many thanks for any input!

Cheers,
James.