[pymvpa] [mvpa-toolbox] Q about faster GNB searchlight script: quick way to compute sphere voxels? Plus some Matlab code, in case it helps

Thu Apr 29 15:45:39 UTC 2010

On Thu, 29 Apr 2010, Rajeev Raizada wrote:
> I was chatting with Yarik and Michael the other day
> about their ultra-fast Gaussian Naive Bayes searchlight script.
> I think it's a nice idea. Thanks for making it!
You are welcome!
And it never hurts to reiterate the acknowledgment:

$> grep -A1 Kudos mvpa/measures/gnbsearchlight.py
    Kudos for the idea and showing that it indeed might be beneficial
    over generic Searchlight with GNB go to Francisco Pereira.

> The script gains its speed by exploiting a nice trick
> that I hadn't thought of:
> because GNB doesn't use any info about particular
> combinations of voxels, you can calculate the means and s.d.s
> of all the voxels at once at the beginning of each cross-validation rep,
> and then reuse those precomputed values for any searchlight sphere.
and part of gnbsearchlight.py is due to taking advantage of yet another trick:

precompute sums and sums of squares on all the data prior splitting per each
group of samples which would appear in the same split/label during
cross-validation (figured out in steps marked 1-3 in .py code).  So we
wouldn't need to redo sums across all samples for each split but rather use
previously computed sums* to derive means/stds per each particular split
without actually splitting the data.

> I'm still far too lacking in fluency in Python to be able to really
> read through and understand individual PyMVPA scripts,
> so I reimplemented the GNB part in Matlab:
> http://www.nmr.mgh.harvard.edu/~raj/Matlab/raj_searchlight_gnb.m

> That script is nice and fast

Lets compare? ;) 

Francisco? could you share the testbed dataset which I've used to
time the performance.

> but it relies on the coords of the sphere voxels
> having been pre-computed in advance.
sure - since they are the same across splits -- that is step 4 in .py

> That still takes a fair chunk of CPU time:
> http://www.nmr.mgh.harvard.edu/~raj/Matlab/premake_sphere_coords_list.m

hm... shouldn't be too much really -- how long it takes for you on a full
brain?

> As far as I can tell from the gnbsearchlight.py script,
> it assumes that the sphere radius is one voxel.
hm... where? may be I managed to screw up?

any searchlight in PyMVPA (see BaseSearchlight in searchlight.py) takes

        queryengine : QueryEngine
          Engine to use to discover the "neighborhood" of each feature.
          See :class:`~mvpa.misc.neighborhood.QueryEngine`.

which is conventionally a sphere with arbitrary radius if you use
sphere_*searchlight
factory methods, which by default take radius in voxels and equal to 1.

> One-voxel-radius sphere-coords can be computed on the fly very quickly,
> simply by listing the six +/-1 x,y,z voxels attached to the sphere center.
yeap, so up to 7 voxels in a sphere

> But I was wondering if you guys have a quick trick
> for computing the voxel coords of searchlight spheres which
> are of larger radius. Once you get past radius 1 or 2,
> the lookup-table approach starts to get pretty impractical.
hm... shouldn't be too bad... I will do some timing and report back.

also, for GNBSearchlight which fast, this precomputation of neighbors indeed
becomes a significant portion of the runtime.  To demolish such a contribution
upon subsequent runs on the same dataset with possibly permutted attributes
(permutation testing) we have CachedQueryEngine which reuses results of
previous queries (unless dataset was changed, so cache is not appropriate any
longer).

> My best attempt so far is that script
> http://www.nmr.mgh.harvard.edu/~raj/Matlab/premake_sphere_coords_list.m
> It's not very good though. It still takes a long time.
but what is the 'long time'?

> At least you only need to run it once, but it's still slow.
> Any advice on faster tricks greatly appreciated!
I am sorry but Matlab scripting already escaped my memories so I can't
clearly deduce what premake_ is doing.

I see that the tight loop does some find() which might one of the
slowdowns... also I don't know how fast matlab's cell arrays are.  Isn't there
some kind of profiler to deduce which lines take most of the time?

also to me it looks a bit too complex for a particular implementation for
any 'template' lookups in 3D, which would need just having given a mask and radius to

1. considering only radius and possibly voxel sizes, generate a
'neighborhood template' ones which would be increments for the tentative
neighborhood elements.

2. sweep through xyz coordinates, add template from above to them (so no
distance_cube should be inside the loop) and see what 'sticks out' or doesn't
belong to the mask, which should be done efficiently with matrix
operations.

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]