[pymvpa] Surface searchlight taking 6 to 8 hours

Thu Jul 23 10:45:42 UTC 2015

> On 22 Jul 2015, at 20:11, John Baublitz <jbaub at bu.edu> wrote:
> 
> I have been battling with a surface searchlight that has been taking 6 to 8 hours for a small dataset. It outputs a usable analysis but the time it takes is concerning given that our lab is looking to use even higher resolution fMRI datasets in the future. I profiled the searchlight call and it looks like approximately 90% of those hours is spent mapping in the function from feature IDs to linear voxel IDs (the function feature_id2linear_voxel_ids).

From mvpa2.misc.surfing.queryengine, you are using the SurfaceVoxelsQueryEngine, not the SurfaceVerticesQueryEngine? Only the former should be using the feature_id2linear_voxel_ids function. 

(When instantiating a query engine through disc_surface_queryengine, the Vertices variant is the default; the Voxels variant is used then output_modality=‘volume’).

For the typical surface-based analysis, the output is a surface-based dataset, and the SurfaceVerticesQueryEngine is used for that. When using the SurfaceVoxelsQueryEngine, the output is a volumetric dataset.

> I looked into the source code and it appears that it is using the in keyword on a list which has to search through every element of the list for each iteration of the list comprehension and then calls that function for each feature. This might account for the slowdown. I'm wondering if there is a way to work around this or speed it up.

When using the SurfaceVoxelsQueryEngine, the euclidean distance between each node (on the surface) and each voxel (in the volume) is computed. My guess is that this is responsible for the slow-down. This could probably be made faster by dividing the 3D space into blocks and assigning nodes and vertices to each block, and then compute distances between nodes and voxels only within each block and across neighbouring ones. (a somewhat similar approach is taken in mvpa2.support.nibabel.Surface.map_to_high_resolution_surf). But that would take some time to implement and test. How important is this feature for you?  Is there a particular reason why you would want the output to be a volumetric, not surface-based, dataset?