[pymvpa] redundant voxels

Thu Apr 1 16:21:51 UTC 2010

> Hi all.  I have a question about methods that's not really about
> pymvpa per se -- if this is too inappropriate for this mailing list,
questions relevant to multi-variate analysis of neural data are always
welcome to this list ;)

> please let me know.  I have a structural dataset with one volume per
> subject and a lot of redundant voxels (i.e., two voxels that have the
> same value for all subjects).  I have a preprocessing step outside
> python that creates a mask that effectively strips the redundancy (i.e.,
> masks out all but one of each set of identical voxels, separately within
> each ROI).
do you mean smth like

In [12]: from mvpa.datasets.miscfx import *
In [13]: remove_invariant_features?
Type:		function
Base Class:	<type 'function'>
String Form:	<function remove_invariant_features at 0x3b8d848>
Namespace:	Interactive
File:		/home/yoh/proj/pymvpa/pymvpa/mvpa/datasets/miscfx.py
Definition:	remove_invariant_features(dataset)
Docstring:
    Returns a new dataset with all invariant features removed.

;-)

> Then I use shogun SVR in PyMVPA to get an error measure for
> each voxel.  My naive assumption was that removing redundant voxels
> would never result in a larger error.  But while this is true in most
> cases, a substantial minority (about a quarter) of the ROIs do benefit
> from including the redundant voxels.  The difference in error is
> relatively small (mostly in the 1-3% range), but I was surprised it
> happens so often.  So I was hoping I could get an expert opinion on
> whether or not I should be alarmed by this (i.e., it me
> ans I'm doing something wrong).
imho:

I would expect no difference in error, iff (if and only if) you had "equivalent
redundancy" among bogus and relevant features -- then it would indeed should
not alter most of the classifiers.

little example:

from mvpa.suite import *

cvte = CrossValidatedTransferError(TransferError(sg.SVM()),
                                   NFoldSplitter(),
                                   postproc=mean_sample())
mvpa.seed(2)
ds = normal_feature_dataset(perlabel=10, nlabels=2, nchunks=2, nfeatures=10,
                            nonbogus_features=[3,7], snr=1.5)
print "Original dataset: ", np.asscalar(cvte(ds))
print "Equivalent redundancy: ", np.asscalar(cvte(hstack((ds, ds))))
print "More of useful features: ", np.asscalar(cvte(hstack((ds,) + (ds[:, ds.a.nonbogus_features],)*5)))
print "More of bogus features: ", np.asscalar(cvte(hstack((ds,) + (ds[:, ds.a.bogus_features],)*5)))

results in

Original dataset:  0.25
Equivalent redundancy:  0.25
More of useful features:  0.15
More of bogus features:  0.35

The reason is: classifiers take different assumptions on different
things, such as 

* distributions of the weights assigned to the features (coefficients of
separating hyperplane) in case of simple linear classification.  Therefore,
relative number of useful (even redundant) features in the pool of features
would alter on how much weight those useful features have within the derived
classification function.  

* distance functions -- contribution of features, e.g. in the case
 of kNN (in the example above take kNN(2)):

Original dataset:  0.35
Equivalent redundancy:  0.35
More of useful features:  0.25
More of bogus features:  0.5

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]