[pymvpa] RFE and dataset splits

Wed Jul 6 20:32:16 UTC 2011

Hi Kimberly,

sorry for the delay -- we are finally back from HBM and getting back
into the routine pace...   I am about to start the RFE on your dataset
but I immediately spotted that data wasn't normed to become
SVM-friendly:

    In [8]: print ds.summary()
    Dataset: 48x135168 at float64, <sa: chunks,targets,time_coords,time_indices>, <fa: voxel_indices>, <a: imghdr,imgtype,mapper,voxel_dim,voxel_eldim>
    stats: mean=0.00193459 std=1.16806 var=1.36437 min=-139.459 max=116.782

So, please do (in your script to may be see if everything is ok
yourself):

1.
    zscore(ds, chunks_attr=None)

which would do standartization through all samples (it is meaningless to
do it per each chunk in your case since you have only 2 samples per each
chunk), which would lead you to

    stats: mean=9.39588e-16 std=0.574634 var=0.330204 min=-6.70418 max=6.77941

2.  I guess you didn't really mask (excluded non-brain voxels) the
volume, thus have lots of invariant features  (e.g. having 0s through
all samples):

so you could get rid of them:

    [25]: ds = remove_invariant_features(ds)
    In [29]: print ds.summary()
    Dataset: 48x44633 at float64, <sa: chunks,targets,time_coords,time_indices>, <fa: voxel_indices>, <a: imghdr,imgtype,mapper,voxel_dim,voxel_eldim>
    stats: mean=2.84548e-15 std=1 var=1 min=-6.70418 max=6.77941

which should just help it to get through faster ;-) 

let's hope that it was that ;)

On Thu, 23 Jun 2011, Yaroslav Halchenko wrote:

> On Thu, 23 Jun 2011, Kimberly Zhou wrote:
> >    cv_results=cvte(avgds)
> >    gives a 64.6% accuracy, shouldn't RFE start at something like 36% error
> >    and improve from there? (SMLR, slightly lower, 58.3% acc, but still
> >    better than chance?)
> >    >...<
> >    If the cross-validation performance is better than chance, I am
> >    guessing it must be RFE that is not working, and not the dataset. Would
> >    that be a correct assumption?

> I would come to the same conclusion ;)

> Is there a chance you could share that dataset?  would make it easier to
> figure out WTF

> just do

> h5save('/tmp/dataset.hdf5', avgds)

> and make that file available online or in dropbox or whatever other
> means... not sure if email would tollerate the size though but you could
> try emailing it to debian at onerussian.com
-- 
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic