[pymvpa] GNBSearchlight below/above chance accuracy ... again

Sat Jul 23 09:30:01 UTC 2016

Hi,
I am sorry about your bellow chance accuracy, it's always very annoying. Do
you also have bellow chance accuracy with other than classifiers than GNB?
So is it just speed that is your concern? you can try M1NNSearchlight, that
should be also efficiently implemented, but i think the results will be
very similar. Other than that, you can decrease number of folds, mask out
white matter, down sample number of SL spheres or just rent an amazon
cluster. Also, it seems like the BROCCOLI will have a searchlight
implementation soon, so you might be able to run SVM searchlight on a GPU,
but it wont be easy. You will have to time it what is fastest, but
definitely try Lasso from glmnet
http://www.pymvpa.org/generated/mvpa2.clfs.glmnet.GLMNET_C.html or from
sklearn.

In your little simulation, you will get same results with SVM too.
Comparing it with binomial distribution is not fair, since binomial samples
are independent, but your accuracies over CV folds are not.

On Fri, Jul 22, 2016 at 8:39 PM, basile pinsard <basile.pinsard at gmail.com>
wrote:

> Hi PyMVPA community,
>
> I wanted to have an advice on a problem I have using PyMVPA.
> My pipeline includes a Searchlight on BOLD data, for which I used the
> optimized GNBSearchlight because I plan to run ~100 permutations to perform
> statistical testing and it is the only one offering reasonable processing
> time (or maybe the optimized KNN).
>
> I have 2 classes x 8 samples for each (1 sample per chunk), the
> partitioner (thanks @Yaroslav) I use is:
> prtnr_2fold_factpart = FactorialPartitioner(
>     NFoldPartitioner(cvtype=2,attr='chunks'),
>     attr='targets',
>     selection_strategy='equidistant',
>     count=32)
> this way I repeatedly take out 2 samples of each of the 2 classes for
> testing and train on the remaining 2x6 samples, 'equidistant' allows all
> the samples to be tested approximately the same number of time, thus being
> equally represented in the final accuracy score.
>
> The problem is that the distribution of accuracy in searchlight map is
> very wide with significantly below-chance classification, and the results
> are very variable across scans/subjects.
>
> So what I did to check if there was any problem in the analysis was to
> replace my BOLD signal with random data from normal distribution, thus
> removing any potential temporal dependency (even if the design was using
> DeBruijn cycles for balancing carry-over effects) that could also result
> from GLM (GLM-LSS, Mumford 2012), detrending or else.
>
> Results, I get some accuracy from ~10% to ~90%, far below above chance
> expected by normal approximation to binomial distribution (25-75%).
> It seems that either from the design, pipeline or algorithm the
> information is found by chance in the random data.
>
> I took the neighborhood of where I got these results and ran a
> cross-validation using the same partitioner but with GNB, LinearCSVMC, LDA.
> GNB gives the same accuracy, so this is not the optimized GNBSearchlight
> that causes this
> LinearCSVMC and LDA gives about chance (50%) accuracy for the same
> neighborhood.
>
> This can be reproduced by creating a random dataset from scratch with 2
> classes and randomly selecting some features:
> ds_rand2=dataset_wizard(
>      np.random.normal(size=(16,10000)),
>      targets=[0,1]*8,
>      chunks=np.arange(16))
> cvte=CrossValidation(
>     GNB(common_variance=True),
>     prtnr_2fold_factpart,
>     errorfx=mean_match_accuracy)
> np.max([cvte(ds_rand2[:,np.random.randint(0,ds_rand2.nfeatures,size=64)]).samples.mean()
> for i in range(1000)])
> 0.8828125
> np.min([cvte(ds_rand2[:,np.random.randint(0,ds_rand2.nfeatures,size=64)]).samples.mean()
> for i in range(1000)])
> 0.1484375
>
> So is there something specific to GNB that gives this kind of lucky
> overfitting of random data when use many times as in Searchlight?
> Also as this lucky features are included in multiple overlapping
> neighborhood it results in nice blobs in the searchlight which sizes
> depends on radius.
> I tried the GNB with and without common_variance (thus piecewise quadratic
> or linear) and it is quite similar.
> Does anybody have been using it to produce sensible results?
> Maybe it work better with more that 2 classes.
>
> LDA when applied to more features than samples is incredibly slow, thus is
> unrealistic for searchlight and even more with permutation testing, but I
> have seen it used in many papers (maybe not with permutation though), so i
> wonder if it is PyMVPA algorithm, or my python setup.
> Do you think an optimized LDA searchlight would be possible or there is
> lengthy computation (eg: matrix inversion) that cannot be factorized?
>
> Otherwise what kind of classifier would you recommend, that would not be
> too computationally intensive? Or maybe I have to deal with that?
>
> Many thanks for any idea about that.
>
> --
> Basile Pinsard
>
> *PhD candidate, *
> Laboratoire d'Imagerie Biomédicale, UMR S 1146 / UMR 7371, Sorbonne
> Universités, UPMC, INSERM, CNRS
> *Brain-Cognition-Behaviour Doctoral School **, *ED3C*, *UPMC, Sorbonne
> Universités
> Biomedical Sciences Doctoral School, Faculty of Medicine, Université de
> Montréal
> CRIUGM, Université de Montréal
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20160723/062d6396/attachment-0001.html>