[pymvpa] FeatureSelectionClassifier (in RFE) occasionally returns full features set

Sat Apr 25 08:10:42 UTC 2009

Hi,

I implemented some simple RFE logic, similar to what was described here:
http://www.pymvpa.org/featsel.html
At the end of the classification procedure, I verify the the features that
were selected based on what was described here:
http://www.pymvpa.org/faq.html#how-do-i-know-which-features-were-finally-selected-by-a-classifier-doing-feature-selection

Now the problem: sometimes the resulted number of selected features is the
exact number, which is required (I use FixedNElementTailSelector), whereas
in some other case, for completely unknown reason, I get full set of
features. The issue is really weired, since for two sessions of a subject I
get selected feature set, but for two other sessions of the same subject I
get full feature set. I suspect, that the problem might be in updating the
feature_ids variable and not with classification, because the classification
error rate was pretty low.

Attached my code. Is it any problem with it?
I can also upload my dataset (~50 Mb zip). I didn't succeed to reproduce it
with smaller amount of data.

Thanks for your help,
Vadim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20090425/8701ef18/attachment.htm>
-------------- next part --------------

from mvpa.suite import *
import os

#data dir
data_dir = "Absolute path to where all the data (volumes, mask and design) is stored" 
ROI_mask = "F_31_-70_1"
data_file_feature_select = "do_ns_07_09";
attr_file_feature_select = 'fosdesign_3_4_split_2.txt'

# load PyMVPA example dataset
attr_feature_select = SampleAttributes(os.path.join(data_dir, attr_file_feature_select))
dataset_feature_select = NiftiDataset(samples=os.path.join(data_dir, data_file_feature_select),
                       labels=attr_feature_select.labels,
                       chunks=attr_feature_select.chunks,
                       mask=os.path.join(data_dir, ROI_mask))

#dataset_feature_select = MaskedDataset(samples=N.random.normal(size=(480,5,5,5)),
#				       labels=attr_feature_select.labels)

# do chunkswise linear detrending on dataset
detrend(dataset_feature_select, perchunk=True, model='linear')

# zscore dataset relative to baseline ('rest') mean
zscore(dataset_feature_select, perchunk=True, baselinelabels=[0],
       targetdtype='float32')

# select class 1 and 2 for this demo analysis
# would work with full datasets (just a little slower)
dataset_feature_select = dataset_feature_select.selectSamples(
                N.array([l in [1,2] for l in dataset_feature_select.labels],
                        dtype='bool'))

N_FEATURES = 30

rfesvm_split = SplitClassifier(LinearCSVMC())
clf = \
 FeatureSelectionClassifier(
  clf = LinearCSVMC(),
  # on features selected via RFE
  feature_selection = RFE(
      # based on sensitivity of a clf which does
      # splitting internally
      sensitivity_analyzer=rfesvm_split.getSensitivityAnalyzer(),
      transfer_error=ConfusionBasedError(
         rfesvm_split,
         confusion_state="confusion"),
         # and whose internal error we use
      feature_selector=FixedNElementTailSelector(N_FEATURES, tail='upper', mode='select'),   
      #feature_selector=FractionTailSelector(
                         #0.2, mode='discard', tail='lower'),
                         # remove 20% of features at each step
      update_sensitivity=True),
      # update sensitivity at each step
  descr='LinSVM+RFE(splits_avg)',
  enable_states = ['feature_ids'])

clf.train(dataset_feature_select)
print clf.feature_ids.size

#cv = CrossValidatedTransferError(
            #TransferError(clf),
            #OddEvenSplitter(),
            #enable_states=['confusion'],
            #harvest_attribs=['transerror.clf.getSensitivityAnalyzer(force_training=False)()'])
## and run it
#error = cv(dataset_feature_select)
#print clf.feature_ids.size