[pymvpa] Permutation testing and Nipype

Tue Aug 11 20:20:47 UTC 2015

Okay, so I did a little more investigating of this and I cannot
replicate my original problem. Now it's looking like it's taking a
long time just because the permutation testing is taking a long time.

At the bottom of this message is the script I used for testing the
timing. Using python 2.7.6 and PyMVPA version 2.4.0, I time the script
as follows:

     python2.7 -O -m timeit -n 1 -r 1 'import test' 'test.main()'

The dataset I'm loading in has 3504 trials that we're using and 29462 voxels.

I get the following times:
     perm_num=1, ids=(0,1)  : 161sec
     perm_num=1, ids=(0,2)  : 316sec
     perm_num=1, ids=(0,3)  : 531sec
     perm_num=1, ids=(0,4)  : 687sec
     perm_num=5, ids=(0,1)  : 435sec

Which makes me realize that there's no way I can get 100 permutations
and 5 searchlights (which is about what I was looking at earlier) in
1.5 hours. I don't know what changed -- going back through my commits
I haven't changed any of the relevant code since then; it's possible I
made a mistake and accidentally did 10 permutations or something like
that.

Regardless, this is still taking way too long. Does anyone have any
idea how to speed it up? It looks like it's a good idea to have jobs
run a bunch of permutations in one function, but split up the
searchlights, which is what I'm doing at the moment, but I still need
to do something else to speed it up.

Thanks,
Bill

test.py script:

def main(perm_num=5,ids=(0,1)):
    from mvpa2.suite import
h5load,LinearCSVMC,Repeater,AttributePermutator,NFoldPartitioner,CrossValidation,ChainNode,MCNullDist,sphere_searchlight

    ds=h5load('dataset.hdf5')
    clf=LinearCSVMC()
    repeater=Repeater(count=perm_num)
    permutator = AttributePermutator('targets',limit={'partitions':1},count=1)
    nf = NFoldPartitioner(attr='subject',cvtype=1,count=None,selection_strategy='random')
    null_cv = CrossValidation(clf,ChainNode([nf,permutator],space=nf.get_space()))
    distr_est =
MCNullDist(repeater,tail='left',measure=null_cv,enable_ca=['dist_samples'])
    cv = CrossValidation(clf,nf,null_dist=distr_est,pass_attr=[('ca.null_prob','fa',1)])
    print 'running...'
    sl = sphere_searchlight(cv,radius=3,center_ids=range(ids[0],ids[1]),enable_ca='roi_sizes',pass_attr=[('ca.roi_sizes','fa')])
    res=sl(ds)

On Tue, Aug 11, 2015 at 12:00 PM, Bill Broderick <billbrod at gmail.com> wrote:
> On Mon, Aug 10, 2015 at 5:33 PM, Yaroslav Halchenko
> <debian at onerussian.com> wrote:
>> it would help to know what/at what level you are permutting etc,
>> and what is that timing issue (does nipype kills tasks if they run "too"
>> long, unlikely)?
>
> I'm running my analysis with leave-one-subject-out cross-validation
> (so combining all runs for each subject), permuting the labels in the
> training set in two categories 100 times. I originally was running the
> whole brain in one job, but found that took too long (didn't get
> killed by nipype or our SGE cluster, but it was taking too long to be
> feasible), so I'm using sphere_searchlight's center_ids option to
> split permutation testing into a a bunch of smaller jobs, each with
> about 5 searchlights. Here's what my function looks like:
>
>     clf = LinearCSVMC()
>     repeater = Repeater(count=100)
>     permutator = AttributePermutator('targets',limit={'partitions':1},count=1)
>     nf = NFoldPartitioner(attr='subject')
>     null_cv = CrossValidation(clf,ChainNode([nf,permutator],space=nf.get_space()),errorfx=mean_mismatch_error)
>     distr_est =
> MCNullDist(repeater,tail='left',measure=null_cv,enable_ca=['dist_samples'])
>     cv = CrossValidation(clf,nf,null_dist=distr_est,pass_attr=[('ca.null_prob','fa',1)],errorfx=mean_mismatch_error)
>     sl = sphere_searchlight(cv,radius=3,center_ids=range(sl_range[0],sl_range[1]),enable_ca='roi_sizes',pass_attr=[('ca.roi_sizes','fa')])
>     sl_res = sl(ds)
>     null_dist = cv.null_dist.ca.dist_samples
>
> where sl_range is a tuple, passed to the function, defining which
> searchlights to run. In my current set up, the above function is a
> Nipype MapNode, iterating on sl_range, such that when it reaches this
> function it creates many versions of this job (currently I'm working
> with about 5000), each running permutation testing on different
> searchlights. These are all submitted in parallel to the SGE cluster,
> which allows users to submit as many jobs as they want but limits them
> to running jobs at 200-some nodes at a time.
>
> When I split this into about 5000 jobs, I ran into an issue with
> Nipype where each of these jobs would finish running (in about 1.5
> hours) but the Nipype master job that spawned them would take a very
> long time to realize they were done (as in, it would find one an
> hour), so it never finished and moved on. If I split this into fewer
> jobs, it doesn't run into this issue, but each job takes a lot longer.
> So either I could figure out what's going on with Nipype or could just
> not take as long for permutations.
>
> Is that clear? Has anyone run into similar issues or found a way to
> run the permutation faster?
>
> Thanks,
> Bill