[pymvpa] Dataset with multidimensional feature vector per voxel

Mon Nov 9 21:26:45 UTC 2015

Few fixs up lister below inline in code 

On Mon, 09 Nov 2015, Ulrike Kuhl wrote:

> Hello again!

> I'm still stuck at my analysis using multiple diffusion-derived-parameter-values for the same voxel.
> Thanks to your help, I think I've got the dataset together, but I run into an error when attempting the final searchlight approach.

> This is what I've got by now for setting up the dataset (defined as a function):

> def ds_setup ( sub_list, param_list, data_path ):
>     "Setting up the dataset, loading data etc."

>     # make space:
>     dss = []
>     modality_all = []
>     modality_idx_all = []
>     voxel_idx_all = []
>     subject_all = []
>     targets_all = []
>     chunks_all = []

>     for sub_index, sub in enumerate(sub_list):
>         if sub.startswith('sub1'):
>             learn = 1
>         elif sub.startswith('sub0'):
>             learn = 0
>         else:
>             raise ValueError("Do not know what to do with %s" % sub)

>         # dss_subj will contain datasets for different features for the same subject
>         ds_param = []
>         # idea: collect the feature attributes and glue it all together in the end
>         modality_param = []
>         modality_idx_param = []
>         subject_param = []
>         targets_param = []
>         chunks_param = []
>         voxel_idx_param = []

>         for suf_index, suf in enumerate(param_list):
>             ds = fmri_dataset(data_path + '/%s/%s_%s.nii.gz' % (sub,sub,suf))

wasn't there some kind of a mask?

>             ds.fa['modality'] = [suf]             # for each feature, set the modality attribute
>             ds.fa['modality_index'] = [suf_index] # this as numeric might come handy for searchlights later
>             ds.fa['targets'] = [learn]
>             ds.fa['chunks'] = [sub_index]

those targets and chunks should go to '.sa' (samples attributes)
not '.fa' feature attributes...   you don't need all those manually
constructed _param lists -- everything should be simply contained in the
dataset(s).  It is somewhat important since then necessary checks would
be done during hstack/vstacking those datasets and if some fa or sa
values don't match (i.e you don't have exactly aligning features or
samples) -- those fa/sa would be dropped... is that what you have
observed may be?

>             modality_param.extend(ds.fa['modality'].value)
>             modality_idx_param.extend(ds.fa['modality_index'].value)
>             subject_param.extend(ds.fa['subject'].value)
>             targets_param.extend(ds.fa['targets'].value)
>             chunks_param.extend(ds.fa['chunks'].value)
>             voxel_idx_param.extend(ds.fa['voxel_indices'].value)
>             ds_param.append(ds)

>         ds_subj = hstack(ds_param)
>         # collect future feature attributes:
>         modality_all.append(modality_param)
>         modality_idx_all.append(modality_idx_param)
>         voxel_idx_all.append(voxel_idx_param)

>         # collect future sample attributes
>         targets_all.append(learn)
>         chunks_all.append(sub_index)
>         subject_all.append(sub)

>         dss.append(ds_subj)

>     dsall = vstack(dss)

>     # create the actual dataset
>     DS = Dataset(dsall)
>     DS.fa['modality'] = modality_all[0]
>     DS.fa['modality_idx'] = modality_idx_all[0]
>     DS.fa['voxel_indices'] = voxel_idx_all[0]
>     DS.sa['targets'] = targets_all
>     DS.sa['chunks'] = chunks_all
>     DS.sa['subject'] = subject_all

>     return DS;

> This gives me the following: DS is a dataset with dimensions 'no_subjects' * ('no_voxels' * 'no_parameters'). Importantly, each feature has the corresponding voxel indices associated with it - since the different parameters are just concatenated, the sequence of these indice-arrays repeats for each parameter. 

> I've simulated a little set of perfect two group toy data (= 100% SNR) and fed the corresponding dataset into an SVM leave-one-out-cross-validation like this:

> clf = LinearCSVMC()
> cvte = CrossValidation(clf, NFoldPartitioner(),

you could say explicitly here what you would like to cross-validate
over -- makes code easier to read (and then you don't need to define
overgeneric 'chunks'):

NFoldPartitioner(attr='subject')

> ...                        errorfx=lambda p, t: np.mean(p == t),
> ...                        enable_ca=['stats'])
> cv_results = cvte(DS)

> The result is indeed perfect - classification performance is 100%. :-) So far, so good. 

> However, when I want to use this dataset for a searchlight approach, I run into an error. Here's the code:

> sl = sphere_searchlight(cvte, radius=3, postproc=mean_sample())
> res = sl(DS)
> print res_clean

> ValueError: IndexQueryEngine has insufficient information about the dataset spaces. It is required to specify an ROI generator for each feature space in the dataset (got: ['voxel_indices'], #describable: 125000, #actual features: 375000).

> So the code picks up on the fact that there are multiple instances of the same voxel-index-array within one sample (three parameters = three times the same index-array). I wrongly expected that it would take all values with the corresponding indices into account for MVPA. 

> How do I "glue" these multiple values per sample for the same voxel index-array together such that the searchlight algorithm works? Am I on the right track at all or did I get carried away?

that is why I defined that 'modality_index' ... Let me explain what is
happening briefly in hope that it makes sense ;-) -- our "search for the
neighborhood" by default relies only on .fa.voxel_indices and has an
assumption that those are somewhat unique.  Now that you have 3
different values per each physical voxel it gets confused.  So we just
need to tell it more about what is the neighborhood really is and not
rely on sphere_searchlight helper

sl  = Searchlight(
   cvte, 
   IndexQueryEngine(
       voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
       modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
   postproc=mean_sample())

if you run this sl(DS) it will work BUT it would pretty much do the same
analysis 3 times.  So we could restrict it to search for
"modality_index" neighbors only when we are looking at
modality_index==0.

sl  = Searchlight(
   cvte, 
   IndexQueryEngine(   # so we look for neighbors in both space and across modality_index
       voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
       modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
   postproc=mean_sample(),
   roi_ids=np.where(DS.fa.modality_index == 0)[0]
)

This should produce results of necessary "1 brain" dimensionality ;)

I know -- this all is a bit of unnatural.  Primarily such functionality
was created to carry out "hybrid" spatial-temporal searchlighting where
you could explore  both space and time.  But it seems that such use-case
becomes popular enough that we better implement it without requiring so
much of code/magic ;)

Let me know how it goes

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik