[pymvpa] advice for constructing a dataset for use in pyMVPA

Yaroslav Halchenko debian at onerussian.com
Thu Feb 12 14:58:31 UTC 2009


Thanks Scott,

let me comment a bit

> are you doing cross-fold validation?  if so, using subjects as chunks 
> would mean you train the classifiers on one set of subjects and use it 
> to classify the one(s) left out.  typically you don't want to do this, 
> so i would say keep each subject as a completely separate data set and 
> use runs as chunks, then decide how you want to compare the output over 
> all subjects (ie, t-test, whathaveyou...)

setup of the analysis depends on what is actually believed to be an
effect -- ie if it is as in GLM -- uniform activations of the areas,
then processing full dataset (all subjects) holding a subject out for
cross-validation might be quite reasonable. If effects are local
distributed codes, then imho I would start processing each subject
separately, hence holding out a single run for cross-validation.  At the
end it is possible to just add up all confusion matrices, to get a
'summary' confusion matrix across all subjects

> Jo Etzel wrote:
> > "conf", etc.). I never need to classify on more than one text label at a 
> > time (i.e. just "color", not "color" and "conf"), though I do need to 
so do you do 1-class classification? libsvm has that facility but I've
never used it yet within PyMVPA, so not sure on side-effects ;-)
or did you mean "color" - vs - all_other_labels?

> > subset the data based on these labels prior to classification (i.e. 
> > classify "conf" for certain "colors" only).

you can select samples by using .selectSamples() or smth like
	dataset_colors_only = dataset['labels', ['colors']]
should work too


> > In this case, I think that the "samples" are my blocks
do you mean that you want to look at average activation within a block
as a single sample?

> , and I have two 
> > levels of "chunks" - runs and subjects. My "labels" are my block types 
> > ("color", etc).

correct. note, that if you are providing labels as literals (ie
"color" instead of a numeric code 1), mention that in the constructor of
*Dataset with labels_map=True, or even provide your explicit map, ie
labels_map={"color":1, "blah":2, ...}

> > Do I want all of my data in *one* NiftiDataset object or separate ones 
> > for each subject?
once again -- depends on how you want to set things up and what effects
you are looking for

> > 1 - use fslmerge to convert my (one-for-each-block) analyze files into 
> > one large 4D nifti.gz file, containing all the files for all subjects.
as Scott mentioned, it is not strictly required, but overall might lead
to quicker "loading" of the data if it all comes from a single file

> > 2 - make attributes_literal.txt files, one for each labeling I need (one 
> > for "color", one for "conf", etc). These will be used for the labels 
> > part of NiftiDataset, read by SampleAttributes. The labels in these 
> > files need to be in the same order as my volumes in the nifti.gz files.
sounds correct

> > 3 - define arrays to label my files by chunks. I think I will need a 2D 
> > array: the first column giving the subject number and the second the 
> > run, with the rows in the same order as my nifti.gz files.
you can do that way, or
if you load each subject separately and do 'cross-subjects'
cross-validation, then just assign unique chunk id to
whole dataset for that subject prior concatenation of datasets of
different subjects

> > 4 - write python code to create my NiftiDataset object, using my analyze 
> > image (0 for voxels to exclude >0 for voxels to include) as a "mask" if 
> > I want to restrict my analysis to those voxels.
you can load Analyze image as easily with NiftiDataset as nifti files,
so no need to manually convert it, BUT it might be desired to convert
mask and all data to nifti first and see if they remain correct (ie no
evil things due to incorrect orientation, order of slices etc)

-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student  Ph.D. @ CS Dept. NJIT
Office: (973) 353-1412 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW:     http://www.linkedin.com/in/yarik        



More information about the Pkg-ExpPsy-PyMVPA mailing list