[pymvpa] FeatureSelectionClassifier (in RFE) occasionally returns full features set

Sun Apr 26 01:04:05 UTC 2009

actually I should have discovered the problem before asking you to
upload the data...

in your code you use
N_FEATURES = 30
...
      feature_selector=FixedNElementTailSelector(N_FEATURES,
	  tail='upper', mode='select'),   

so you aren't doing RFE per se ;) you just select 30 features right
on first step of RFE.... then, those 30 features lead to higher
generalization error than if you took all of them, therefore initial
dataset with all features is taken as the result.

to see that you had just to enable RFE debug target (or all RFE ones)
with

debug.active += ['RFE.*']

to see what is happening:

In [12]:## working on region in file /tmp/python-8102meB.py...
[RFEC ] DBG:           Step 0: nfeatures=3022
[RFEC ] DBG:           Step 0: nfeatures=3022 error=0.2125 best/stop=1/0 
[RFEC_] DBG:           Sensitivity: [-0.00507313  0.00025722  0.00159871 ..., -0.00212875  0.00078268
 -0.00027174], nfeatures_selected=30, selected_ids: [ 120  338  341  356  462  472  483  501  517  571  573  574  594  612  619
  634  635  636  659  676  677  760  778  779  796  872 1109 1338 1545 1677]
[RFEC ] DBG:           Step 1: nfeatures=30
[RFEC ] DBG:           Step 1: nfeatures=30 error=0.2500 best/stop=0/0 
[RFEC_] DBG:           Sensitivity: [ 0.09779742  0.16359045  0.02775154  0.09486282 -0.0804099  -0.04392221
 -0.06721182  0.09752928  0.03872871  0.08811431  0.14541801  0.13167303
  0.13925132  0.03046704  0.04748648  0.09525846 -0.04226041  0.06917038
  0.03207438  0.06333298  0.01423283  0.02703152  0.16574083  0.05634531
  0.11383484  0.03402658  0.07105218 -0.02116503  0.24369252  0.20591227], nfeatures_selected=30, selected_ids: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29]
[RFEC ] DBG:           Step 2: nfeatures=30
[RFEC ] DBG:           Step 2: nfeatures=30 error=0.2500 best/stop=0/0 
[RFEC_] DBG:           Sensitivity: [ 0.09779742  0.16359045  0.02775154  0.09486282 -0.0804099  -0.04392221
 -0.06721182  0.09752928  0.03872871  0.08811431  0.14541801  0.13167303
  0.13925132  0.03046704  0.04748648  0.09525846 -0.04226041  0.06917038
  0.03207438  0.06333298  0.01423283  0.02703152  0.16574083  0.05634531
  0.11383484  0.03402658  0.07105218 -0.02116503  0.24369252  0.20591227], nfeatures_selected=30, selected_ids: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29]
[RFEC ] DBG:           Step 3: nfeatures=30
[RFEC ] DBG:           Step 3: nfeatures=30 error=0.2500 best/stop=0/0 
[RFEC_] DBG:           Sensitivity: [ 0.09779742  0.16359045  0.02775154  0.09486282 -0.0804099  -0.04392221
 -0.06721182  0.09752928  0.03872871  0.08811431  0.14541801  0.13167303
  0.13925132  0.03046704  0.04748648  0.09525846 -0.04226041  0.06917038
  0.03207438  0.06333298  0.01423283  0.02703152  0.16574083  0.05634531
  0.11383484  0.03402658  0.07105218 -0.02116503  0.24369252  0.20591227], nfeatures_selected=30, selected_ids: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29]

....

see original RFE definition on how to actually do RFE ;) or just try SMLR
which might be more efficient, who knows ;)

On Sat, 25 Apr 2009, Yaroslav Halchenko wrote:

> at first I thought that I know what is the reason, but then I realized
> that it shouldn't be... didn't test though. to expedite things would you
> mind uploading your data + code to the address I will provide you in a
> followup email? ;)

> On Sat, 25 Apr 2009, Vadim Axel wrote:

> >    Hi,
> >    I implemented some simple RFE logic, similar to what was described
> >    here: [1]http://www.pymvpa.org/featsel.html
> >    At the end of the classification procedure, I verify the the features
> >    that were selected based on what was described here:
> >    [2]http://www.pymvpa.org/faq.html#how-do-i-know-which-features-were-fin
> >    ally-selected-by-a-classifier-doing-feature-selection
> >    Now the problem: sometimes the resulted number of selected features is
> >    the exact number, which is required (I use FixedNElementTailSelector),
> >    whereas in some other case, for completely unknown reason, I get full
> >    set of features. The issue is really weired, since for two sessions of
> >    a subject I get selected feature set, but for two other sessions of the
> >    same subject I get full feature set. I suspect, that the problem might
> >    be in updating the feature_ids variable and not with classification,
> >    because the classification error rate was pretty low.
> >    Attached my code. Is it any problem with it?
> >    I can also upload my dataset (~50 Mb zip). I didn't succeed to
> >    reproduce it with smaller amount of data.
> >    Thanks for your help,
> >    Vadim

> > Ссылки

> >    1. http://www.pymvpa.org/featsel.html
> >    2. http://www.pymvpa.org/faq.html#how-do-i-know-which-features-were-finally-selected-by-a-classifier-doing-feature-selection
-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student  Ph.D. @ CS Dept. NJIT
Office: (973) 353-1412 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW:     http://www.linkedin.com/in/yarik