[pymvpa] Dataset with multidimensional feature vector per voxel

Tue Nov 17 09:23:32 UTC 2015

Dear Yaroslav, dear all,

thanks for your explanations - generating the dataset is working out. :-) 

I have been playing around with classification for a while now. Unfortunately, there is one really basic thing I don't get when I look at the result: the 'accuracy' is perfect on perfect data (that does not contain any noise). That means that the output accuracy map has a 1 at those voxels within the searchlight space that contain the signal difference between two groups - nice! 
All other voxels in the original data are identical for both groups. Classification accuracy at these voxels is 0, but shouldn't accuracy be at the 0.5 level, showing 'random' classification performance?

If I use other error functions within the cross-validation like the 'mean_mismatch_error' the situation is completely reversed - error at the signal encoding voxels is 0 and 1 everywhere else. Again I would have expected 0.5 representing random classification error.

Am I just confusing how accuracy and error are computed or is there still something wrong with my code? Likewise, if I add random noise to the toy data classification fails completely (all voxels at 0 for accuracy or at 1 for error).

##############################
# Here's the code for classification using the 'setup_ds' function we talked about so far:

[...]

DS_clean = setup_ds ( sub_list, param_list, data_path )

clf_clean = LinearCSVMC()
cvte_clean = CrossValidation(clf_clean, NFoldPartitioner(attr='subject'),errorfx=lambda p, t: np.mean(p == t),enable_ca=['stats'],postproc=mean_sample())
# alternatively for getting computing an error: cvte_clean = CrossValidation(clf_clean, NFoldPartitioner(attr='subject'),errorfx=mean_mismatch_error,enable_ca=['stats'],postproc=mean_sample())

sl  = Searchlight(
   cvte_clean,
   IndexQueryEngine(                                 # so we look for neighbors in both space and across modality_index
       voxel_indices=Sphere(3),                      # that is pretty much what sphere_searchlight(radius=3) does
       modality_index=Sphere(len(param_list))),      # cover all parameter values we have
   postproc=mean_sample(),
   enable_ca=['stats'],
   roi_ids=np.where(DS_clean.fa.modality_index == 0)[0]  # restrict to search for "modality_index" neighbors when modality_index==0
)

res_clean = sl(DS_clean)

# get accuracy values per voxel
sphere_accuracy = res_clean.samples
data_path='/path/to/data'
map2nifti(DS_clean, sphere_accuracy).to_filename(data_path)

##############################

Thanks again for your help!
Ulrike

----- Original Message -----
From: "Yaroslav Halchenko" <debian at onerussian.com>
To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa at lists.alioth.debian.org>
Sent: Monday, 9 November, 2015 22:26:45
Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel

Few fixs up lister below inline in code 

On Mon, 09 Nov 2015, Ulrike Kuhl wrote:

> Hello again!

> I'm still stuck at my analysis using multiple diffusion-derived-parameter-values for the same voxel.
> Thanks to your help, I think I've got the dataset together, but I run into an error when attempting the final searchlight approach.

> This is what I've got by now for setting up the dataset (defined as a function):

> def ds_setup ( sub_list, param_list, data_path ):
>     "Setting up the dataset, loading data etc."

>     # make space:
>     dss = []
>     modality_all = []
>     modality_idx_all = []
>     voxel_idx_all = []
>     subject_all = []
>     targets_all = []
>     chunks_all = []

>     for sub_index, sub in enumerate(sub_list):
>         if sub.startswith('sub1'):
>             learn = 1
>         elif sub.startswith('sub0'):
>             learn = 0
>         else:
>             raise ValueError("Do not know what to do with %s" % sub)

>         # dss_subj will contain datasets for different features for the same subject
>         ds_param = []
>         # idea: collect the feature attributes and glue it all together in the end
>         modality_param = []
>         modality_idx_param = []
>         subject_param = []
>         targets_param = []
>         chunks_param = []
>         voxel_idx_param = []

>         for suf_index, suf in enumerate(param_list):
>             ds = fmri_dataset(data_path + '/%s/%s_%s.nii.gz' % (sub,sub,suf))

wasn't there some kind of a mask?

>             ds.fa['modality'] = [suf]             # for each feature, set the modality attribute
>             ds.fa['modality_index'] = [suf_index] # this as numeric might come handy for searchlights later
>             ds.fa['targets'] = [learn]
>             ds.fa['chunks'] = [sub_index]

those targets and chunks should go to '.sa' (samples attributes)
not '.fa' feature attributes...   you don't need all those manually
constructed _param lists -- everything should be simply contained in the
dataset(s).  It is somewhat important since then necessary checks would
be done during hstack/vstacking those datasets and if some fa or sa
values don't match (i.e you don't have exactly aligning features or
samples) -- those fa/sa would be dropped... is that what you have
observed may be?

>             modality_param.extend(ds.fa['modality'].value)
>             modality_idx_param.extend(ds.fa['modality_index'].value)
>             subject_param.extend(ds.fa['subject'].value)
>             targets_param.extend(ds.fa['targets'].value)
>             chunks_param.extend(ds.fa['chunks'].value)
>             voxel_idx_param.extend(ds.fa['voxel_indices'].value)
>             ds_param.append(ds)

>         ds_subj = hstack(ds_param)
>         # collect future feature attributes:
>         modality_all.append(modality_param)
>         modality_idx_all.append(modality_idx_param)
>         voxel_idx_all.append(voxel_idx_param)

>         # collect future sample attributes
>         targets_all.append(learn)
>         chunks_all.append(sub_index)
>         subject_all.append(sub)

>         dss.append(ds_subj)

>     dsall = vstack(dss)

>     # create the actual dataset
>     DS = Dataset(dsall)
>     DS.fa['modality'] = modality_all[0]
>     DS.fa['modality_idx'] = modality_idx_all[0]
>     DS.fa['voxel_indices'] = voxel_idx_all[0]
>     DS.sa['targets'] = targets_all
>     DS.sa['chunks'] = chunks_all
>     DS.sa['subject'] = subject_all

>     return DS;

> This gives me the following: DS is a dataset with dimensions 'no_subjects' * ('no_voxels' * 'no_parameters'). Importantly, each feature has the corresponding voxel indices associated with it - since the different parameters are just concatenated, the sequence of these indice-arrays repeats for each parameter. 

> I've simulated a little set of perfect two group toy data (= 100% SNR) and fed the corresponding dataset into an SVM leave-one-out-cross-validation like this:

> clf = LinearCSVMC()
> cvte = CrossValidation(clf, NFoldPartitioner(),

you could say explicitly here what you would like to cross-validate
over -- makes code easier to read (and then you don't need to define
overgeneric 'chunks'):

NFoldPartitioner(attr='subject')

> ...                        errorfx=lambda p, t: np.mean(p == t),
> ...                        enable_ca=['stats'])
> cv_results = cvte(DS)

> The result is indeed perfect - classification performance is 100%. :-) So far, so good. 

> However, when I want to use this dataset for a searchlight approach, I run into an error. Here's the code:

> sl = sphere_searchlight(cvte, radius=3, postproc=mean_sample())
> res = sl(DS)
> print res_clean

> ValueError: IndexQueryEngine has insufficient information about the dataset spaces. It is required to specify an ROI generator for each feature space in the dataset (got: ['voxel_indices'], #describable: 125000, #actual features: 375000).

> So the code picks up on the fact that there are multiple instances of the same voxel-index-array within one sample (three parameters = three times the same index-array). I wrongly expected that it would take all values with the corresponding indices into account for MVPA. 

> How do I "glue" these multiple values per sample for the same voxel index-array together such that the searchlight algorithm works? Am I on the right track at all or did I get carried away?

that is why I defined that 'modality_index' ... Let me explain what is
happening briefly in hope that it makes sense ;-) -- our "search for the
neighborhood" by default relies only on .fa.voxel_indices and has an
assumption that those are somewhat unique.  Now that you have 3
different values per each physical voxel it gets confused.  So we just
need to tell it more about what is the neighborhood really is and not
rely on sphere_searchlight helper

sl  = Searchlight(
   cvte, 
   IndexQueryEngine(
       voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
       modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
   postproc=mean_sample())

if you run this sl(DS) it will work BUT it would pretty much do the same
analysis 3 times.  So we could restrict it to search for
"modality_index" neighbors only when we are looking at
modality_index==0.

sl  = Searchlight(
   cvte, 
   IndexQueryEngine(   # so we look for neighbors in both space and across modality_index
       voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
       modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
   postproc=mean_sample(),
   roi_ids=np.where(DS.fa.modality_index == 0)[0]
)

This should produce results of necessary "1 brain" dimensionality ;)

I know -- this all is a bit of unnatural.  Primarily such functionality
was created to carry out "hybrid" spatial-temporal searchlighting where
you could explore  both space and time.  But it seems that such use-case
becomes popular enough that we better implement it without requiring so
much of code/magic ;)

Let me know how it goes

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        

_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
-- 
Max Planck Institute for Human Cognitive and Brain Sciences 
Department of Neuropsychology (A219) 
Stephanstraße 1a 
04103 Leipzig 

Phone: +49 (0) 341 9940 2625 
Mail: kuhl at cbs.mpg.de 
Internet: http://www.cbs.mpg.de/staff/kuhl-12160