[pymvpa] RFE question 2.0

Wed Nov 19 22:19:59 UTC 2008

Hi James,

sorry -- can do just a sketchy reply atm (still away from home/work)

the thing is that you should use
Splitclassifier not around your basic SVM but around that
FeatureSelectionClassifier -- look at the source of mvpa.clfs.warehouse.
RFE classifiers are commented out but there -- look at the one which
starts with SplitClassifier

actually it might be the way close to what you do, but then you
shouldn't use TransferError... there might be example for that in the
warehouse as well

I will reply in details more tomorrow if it still doesn't work for you.

Yarik

On Wed, 19 Nov 2008, James M. Hughes wrote:

>
> Ok, so I'm really confused now why my code doesn't work, as I thought I 
> had it figured out.  I'm going to post the code, then my understanding of 
> what it does, then hopefully we can sort this out.
>
> On Nov 19, 2008, at 10:15 AM, Yaroslav Halchenko wrote:
>
>> To do RFE in unbiased way with your scenario, in clfs.warehouse (where
>> I guess you could took RFE sample for you code), sample classifier
>> with RFE is wrapped within SplitClassifier, so within each outer
>> training split, it does splitting again to determine stopping point in
>> an unbiased way. And then SplitClassifier itself makes a decision by
>> voting across multiple classifiers (1 per each inner split).
>
> Based on this, I modified my code to the following:
>
> def do_rfe(dataset, percent):
>     debug.active = ['CLF']
>
>     rfesvm_split = SplitClassifier(LinearCSVMC())
>
>     FeatureSelection = RFE(sensitivity_analyzer=OneWayAnova(),  
> transfer_error=TransferError(rfesvm_split), \
>         feature_selector=FractionTailSelector(percent / 100.0,  
> mode='select', tail='upper'), update_sensitivity=False)
>
>     clf = FeatureSelectionClassifier(clf = rfesvm_split, \
>         # on features selected via RFE
>         feature_selection = FeatureSelection,  
> enable_states=['confusion'])
>         # update sensitivity at each step (since we're not using the  
> same CLF as sensitivity analyzer)
>
>     clf.states.enable('feature_ids')
>
>     cv = CrossValidatedTransferError(TransferError(clf),  
> NFoldSplitter(cvtype=1), enable_states=['confusion'])
>
>     error = cv(dataset)
>
>     print 'Error:  ' + `error`
>
>     return clf.confusion, clf.feature_ids
>
> and yet still the error
>
> <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute  
> 'samples'
>
> persists; it gets thrown after what appears to be feature selection on  
> one leave-one-out set of splits.
>
> Now, based on what Yarik wrote, I thought this would be fine, for the  
> following reasons:
>
> CrossValidatedTransferError (CVTE) should do leave-one-out splits of the 
> input data, computing transfer error bsed on clf's transfer error.  Each 
> call to train clf should also include an RFE step, because we're using a 
> FeatureSelectionClassifier, which in turn uses an SVM to perform RFE, but 
> it's a SplitClassifier, so it should provide a held-out RFE/training set 
> for training and use a single RFE/testing exemplar to calculate error 
> (this is of course done for each split).  Then, once these features are 
> extracted, the held-out testing exemplar held out via CVTE is tested 
> against the classifier trained w/ the selected features.
>
> So schematically, it would look like this (I thought a diagram might  
> help -- note that I only filled in all the action for one split, but the 
> rest would look similar):
>

>
>
> I'm still unclear as to where the classifier that's actually input into 
> the FeatureSelectionClassifier gets trained -- but I guess this is the 
> complexity issue that was brought up before; for each of n-1 original 
> splits, we have (technically) n-2 splits, each of which gets its own 
> classifier instance -- is this right?  So actually when the test data is 
> tested, at the Clf w/ selected features stage, it's really testing on 
> (n-2) classifiers, instead of a single one?  Each of which has its own 
> features selected?
>
> The next part below is a lingering uncertainty for me, because I don't  
> really understand why we can't put an arbitrary testing CLF into the  
> actual RFE object (the part in the schematic which says 'train  
> classifier for criterion testing' -- but I guess if we want to actually 
> *use* one of these trained classifiers, then we would need to have it be 
> trained then; still, I don't get why they can't just get "retrained" by 
> the FeatureSelectionClassifier object when the features themselves are 
> returned -- this is an unclear "feature," IMHO).  As for the next part 
> below, feel free to read or disregard at the moment.  Thanks!
>
> ****
> Moreover, I'm not really sure why this breaks (apparently even worse,  
> because it doesn't even get through more than one data split) when I  
> replace 'transfer_error=TransferError(rfesvm_split)' with  
> 'transfer_error=TransferError(LinearCSVMC())' -- isn't the transfer  
> error being computed on the training and testing sets provided by the  
> SplitClassifier one level above to the RFE algorithm?  Thus, if we  
> wanted to use a different classifier altogether (and probably we  
> should), it shouldn't break!  Or is it that the transfer_error is  
> computed independently of the internal RFE algorithm?  (In this case, it 
> seems that for any particular iteration of RFE, training data is  
> supplied (along w/ validation data), the features are selected based on 
> whatever sensitivity measure is chosen, then these features only (or 
> feature ids) are returned, and the outer split classifier trains its own 
> classifier instances and uses these for testing on the held out testing 
> data from the CVTE original NFoldSplitter...?
> ***
>
> Thanks for ANY and ALL help!  Hopefully we can really clear up the API  
> and the documentation after this -- I'm not sure, because I might be the 
> only one who thinks it's confusing, but personally it seems a bit like 
> magic to know just which classifiers go where when using meta- 
> classifiers, although conceptually it's quite simple and  
> straightforward.  I think the API could use a lot of work in clarity,  
> though, and I'm willing, once I get this, to update it :)
>
> Thanks!!!
>
> James.

-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student  Ph.D. @ CS Dept. NJIT
Office: (973) 353-1412 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW:     http://www.linkedin.com/in/yarik