[pymvpa] Parallelization
Matteo Visconti di Oleggio Castello
matteo.visconti at gmail.com
Tue Nov 28 20:45:47 UTC 2017
Hi Marco,
I think there are a bunch of conflated issues here.
- First, there was an error in my code, and that's why you got the
error " UnboundLocalError:
local variable 'best_clf' referenced before assignment". I updated the
gist, and now the example code for running the parallelization should be ok
and should work as a blueprint for your code (
https://gist.github.com/mvdoc/0c2574079dfde78ea649e7dc0a3feab0).
- You are correct in changing the backend to 'threading' for this
particular case because of the pickling error.
- However, I think that the example for nested_cv.py didn't work from the
start, even without parallelization. The last change was 6 years ago, and
I'm afraid that things changed in between and the code wasn't updated. I
opened an issue on github to keep track of it (
https://github.com/PyMVPA/PyMVPA/issues/559).
On Fri, Nov 24, 2017 at 3:57 PM, marco tettamanti <mrctttmnt at gmail.com>
wrote:
> Dear Matteo,
> thank you for kindly replying!
>
> Yes, I do have the latest versions of joblib (0.11) and sklearn (0.19.1),
> see at the bottom of this email.
>
> The problem seems independent of either running in jupyter, or evoking
> ipython or python directly in the console.
>
> I am now wondering whether there may be something wrong in my snippet.
> When first running your gist, I encountered an:
>
> UnboundLocalError: local variable 'best_clf' referenced before
> assignment
>
> which I solved by moving the best_clf declaration a few lines down:
>
> -----------------------------------------
> #best_clfs = {} #moved down 7 lines
> confusion = ConfusionMatrix()
> verbose(1, "Estimating error using nested CV for model selection")
> partitioner = partitionerCD
> splitter = Splitter('partitions')
> tms = Parallel(n_jobs=2)(delayed(_run_one_partition)(isplit, partitions)
> for isplit, partitions in enumerate(partitionerCD.
> generate(fds)))
> best_clfs = {}
> for tm in tms:
> confusion += tm.ca.stats
> best_clfs[tm.measure.descr] = best_clfs.get(tm.measure.descr, 0) + 1
> -----------------------------------------
>
> But now, running the snippet in ipython/python specifically for the SVM
> parallelization issue, I saw the error message popping up again:
>
> UnboundLocalError: local variable 'best_clf' referenced before
> assignment
>
> May this be the culprit? As a reminder, the full snippet I am using is
> included in my previous email.
>
> Thank you and very best wishes,
> Marco
>
>
> In [21]: mvpa2.wtf(exclude=['runtime','process']) ##other possible
> arguments (['sources',
> Out[21]:
> Current date: 2017-11-24 21:23
> PyMVPA:
> Version: 2.6.3
> Hash: 9c07e8827819aaa79ff15d2db10c420a876d7785
> Path: /usr/lib/python2.7/dist-packages/mvpa2/__init__.pyc
> Version control (GIT):
> GIT information could not be obtained due "/usr/lib/python2.7/dist-packages/mvpa2/..
> is not under GIT"
> SYSTEM:
> OS: posix Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2
> (2017-10-15)
> Distribution: debian/buster/sid
> EXTERNALS:
> Present: atlas_fsl, cPickle, ctypes, good
> scipy.stats.rv_continuous._reduce_func(floc,fscale), good
> scipy.stats.rv_discrete.ppf, griddata, gzip, h5py, hdf5, ipython, joblib,
> liblapack.so, libsvm, libsvm verbosity control, lxml, matplotlib, mdp, mdp
> ge 2.4, mock, nibabel, nose, numpy, numpy_correct_unique, pprocess, pylab,
> pylab plottable, pywt, pywt wp reconstruct, reportlab, running ipython env,
> scipy, skl, statsmodels
> Absent: afni-3dinfo, atlas_pymvpa, cran-energy, datalad,
> elasticnet, glmnet, good scipy.stats.rdist, hcluster, lars, mass, nipy,
> nipy.neurospin, numpydoc, openopt, pywt wp reconstruct fixed, rpy2,
> scipy.weave, sg ge 0.6.4, sg ge 0.6.5, sg_fixedcachesize, shogun,
> shogun.krr, shogun.lightsvm, shogun.mpd, shogun.svmocas, shogun.svrlight,
> weave
> Versions of critical externals:
> ctypes : 1.1.0
> h5py : 2.7.1
> hdf5 : 1.10.0
> ipython : 5.5.0
> joblib : 0.11
> lxml : 4.1.0
> matplotlib : 2.0.0
> mdp : 3.5
> mock : 2.0.0
> nibabel : 2.3.0dev
> numpy : 1.13.1
> pprocess : 0.5
> pywt : 0.5.1
> reportlab : 3.4.0
> scipy : 0.19.1
> skl : 0.19.1
> Matplotlib backend: TkAgg
>
>
> On 24/11/2017 17:32, Matteo Visconti di Oleggio Castello wrote:
>
> Hi Marco,
>
> some ideas in random order
>
> - what version of sklearn/joblib are you using? I would make sure to use
> the latest version (0.11), perhaps not importing it from sklearn (unless
> you have the latest sklearn version, 0.19.1)
> - are you running the code in a jupyter notebook? There might be some
> issues with that (see https://github.com/joblib/joblib/issues/174). As a
> test you might try to convert your notebook to a script and then run it
>
>
>
> On 23/11/2017 12:07, marco tettamanti wrote:
>
> Dear Matteo (and others),
> sorry, I am again asking for your help!
>
> I have experimented with the analysis of my dataset using an adaptation of
> your joblib-based gist.
> As I wrote before, it works perfectly, but not with some classifiers: SVM
> classifiers always cause the code to terminate with an error.
>
> If I set:
> myclassif=clfswh['!gnpp','!skl','svm'] #Note that 'gnnp' and
> 'skl' were excluded for independent reasons
> the code runs through without errors.
>
> However, with:
> myclassif=clfswh['!gnpp','!skl']
> I get the following error:
> MaybeEncodingError: Error sending result:
> '[TransferMeasure(measure=SVM(svm_impl='C_SVC', kernel=LinearLSKernel(),
> weight=[], probability=1,
> weight_label=[]), splitter=Splitter(space='partitions'),
> postproc=BinaryFxNode(space='targets'), enable_ca=['stats'])]'. Reason:
> 'TypeError("can't
> pickle SwigPyObject objects",)'
>
> After googling for what may cause this particular error, I have found that
> the situation improves slightly (i.e. more splits executed, sometimes even
> all splits) by importing the following:
> import os
> from sklearn.externals.joblib import Parallel, delayed
> from sklearn.externals.joblib.parallel import parallel_backend
> and then specifying just before 'Parallel(n_jobs=2)':
> with parallel_backend('threading'):
> However, also in this case, the code invariably terminates with a long
> error message (I only report an extract, but in case I can send the whole
> error message):
> <type 'str'>: (<type 'exceptions.UnicodeEncodeError'>,
> UnicodeEncodeError('ascii',
> u'JoblibAttributeError\n____________________________________
> _______________________________________\nMultiprocessing
> exception:\n................................................
> ...........................\n/usr/lib/python2.7/runpy.py in
> _run_module_as_main(mod_name=\'ipykernel_launcher\',
> alter_argv=1)\n 169 pkg_name = mod_name.rpartition(\'.\')[0]\n
> 170
> main_globals = sys.modules["__main__"].__dict__\n 171 if
> alter_argv:\n 172 sys.argv[0] = fname\n 173 return
> _run_code(code,
> main_globals, None,\n--> 174
>
>
> I think I have sort of understood that the problem is due to some failure
> in pickling the parallelized jobs, but I have no clues if and how it can be
> solved.
> Do you have any suggestions?
>
> Thank you and very best wishes,
> Marco
>
> p.s. This is again the full code:
>
> ########## * ##########
> ##########
>
> PyMVPA:
> Version: 2.6.3
> Hash: 9c07e8827819aaa79ff15d2db10c420a876d7785
> Path: /usr/lib/python2.7/dist-packages/mvpa2/__init__.pyc
> Version control (GIT):
> GIT information could not be obtained due "/usr/lib/python2.7/dist-packages/mvpa2/..
> is not under GIT"
> SYSTEM:
> OS: posix Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2
> (2017-10-15)
>
>
> print fds.summary()
> Dataset: 36x534 at float32, <sa: chunks,targets,time_coords,time_indices>,
> <fa: voxel_indices>, <a: imgaffine,imghdr,imgtype,
> mapper,voxel_dim,voxel_eldim>
> stats: mean=0.548448 std=1.40906 var=1.98546 min=-5.41163 max=9.88639
> No details due to large number of targets or chunks. Increase maxc and
> maxt if desired
> Summary for targets across chunks
> targets mean std min max #chunks
> C 0.5 0.5 0 1 18
> D 0.5 0.5 0 1 18
>
>
> #Evaluate prevalent best classifier with nested crossvalidation
> verbose.level = 5
>
> partitionerCD = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
> Sifter([('partitions', 2), ('targets', ['C', 'D'])])], space='partitions')
> # training partitions
> for fds_ in partitionerCD.generate(fds):
> training = fds[fds_.sa.partitions == 1]
> #print list(zip(training.sa.chunks, training.sa.targets))
> # testing partitions
> for fds_ in partitionerCD.generate(fds):
> testing = fds[fds_.sa.partitions == 2]
> #print list(zip(testing.sa.chunks, testing.sa.targets))
>
> #Helper function (partitionerCD recursively acting on dstrain, rather than
> on fds):
> def select_best_clf(dstrain_, clfs):
> """Select best model according to CVTE
> Helper function which we will use twice -- once for proper nested
> cross-validation, and once to see how big an optimistic bias due
> to model selection could be if we simply provide an entire dataset.
> Parameters
> ----------
> dstrain_ : Dataset
> clfs : list of Classifiers
> Which classifiers to explore
> Returns
> -------
> best_clf, best_error
> """
> best_error = None
> for clf in clfs:
> cv = CrossValidation(clf, partitionerCD)
> # unfortunately we don't have ability to reassign clf atm
> # cv.transerror.clf = clf
> try:
> error = np.mean(cv(dstrain_))
> except LearnerError, e:
> # skip the classifier if data was not appropriate and it
> # failed to learn/predict at all
> continue
> if best_error is None or error < best_error:
> best_clf = clf
> best_error = error
> verbose(4, "Classifier %s cv error=%.2f" % (clf.descr, error))
> verbose(3, "Selected the best out of %i classifiers %s with error %.2f"
> % (len(clfs), best_clf.descr, best_error))
> return best_clf, best_error
>
> # This function will run all classifiers for one single partitions
> myclassif=clfswh['!gnpp','!skl'][5:6] #Testing a single SVM classifier
> def _run_one_partition(isplit, partitions, classifiers=myclassif): #see §§
> verbose(2, "Processing split #%i" % isplit)
> dstrain, dstest = list(splitter.generate(partitions))
> best_clf, best_error = select_best_clf(dstrain, classifiers)
> # now that we have the best classifier, lets assess its transfer
> # to the testing dataset while training on entire training
> tm = TransferMeasure(best_clf, splitter,postproc=
> BinaryFxNode(mean_mismatch_error,space='targets'), enable_ca=['stats'])
> tm(partitions)
> return tm
>
> #import os
> #from sklearn.externals.joblib import Parallel, delayed
> #from sklearn.externals.joblib.parallel import parallel_backend
>
> # Parallel estimate error using nested CV for model selection
> confusion = ConfusionMatrix()
> verbose(1, "Estimating error using nested CV for model selection")
> partitioner = partitionerCD
> splitter = Splitter('partitions')
> # Here we are using joblib Parallel to parallelize each partition
> # Set n_jobs to the number of available cores (or how many you want to use)
> #with parallel_backend('threading'):
> # tms = Parallel(n_jobs=2)(delayed(_run_one_partition)(isplit,
> partitions)
> tms = Parallel(n_jobs=2)(delayed(_run_one_partition)(isplit, partitions)
> for isplit, partitions in enumerate(partitionerCD.
> generate(fds)))
> # Parallel retuns a list with the results of each parallel loop, so we
> need to
> # unravel it to get the confusion matrix
> best_clfs = {}
> for tm in tms:
> confusion += tm.ca.stats
> best_clfs[tm.measure.descr] = best_clfs.get(tm.measure.descr, 0) + 1
>
> ##########
> ########## * ##########
>
>
>
>
>
>
>
> On 13/11/2017 09:12, marco tettamanti wrote:
>
> Dear Matteo,
> grazie mille, this is precisely the kind of thing I was looking for: it
> works like charm!
> Ciao,
> Marco
>
> On 11/11/2017 21:44, Matteo Visconti di Oleggio Castello wrote:
>
> Hi Marco,
>
> in your case, I would then recommend looking into joblib to parallelize
> your for loops (https://pythonhosted.org/joblib/parallel.html).
>
> As an example, here's a gist containing part of the PyMVPA's nested_cv
> example where I parallelized the loop across partitions. I feel this is
> what you might want to do in your case, since you have a lot more folds.
>
> Here's the gist:
> https://gist.github.com/mvdoc/0c2574079dfde78ea649e7dc0a3feab0
>
>
> On 10/11/2017 21:13, marco tettamanti wrote:
>
> Dear Matteo,
> thank you for the willingness to look into my code.
>
> This is taken almost verbatim from http://dev.pymvpa.org/
> examples/nested_cv.html, except for the leave-one-pair-out partitioning,
> and a slight reduction in the number of classifiers (in the original
> example, they are around 45).
>
> Any help or suggestion would be greatly appreciated!
> All the best,
> Marco
>
>
> ########## * ##########
> ##########
>
> PyMVPA:
> Version: 2.6.3
> Hash: 9c07e8827819aaa79ff15d2db10c420a876d7785
> Path: /usr/lib/python2.7/dist-packages/mvpa2/__init__.pyc
> Version control (GIT):
> GIT information could not be obtained due "/usr/lib/python2.7/dist-packages/mvpa2/..
> is not under GIT"
> SYSTEM:
> OS: posix Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2
> (2017-10-15)
>
>
> print fds.summary()
> Dataset: 36x534 at float32, <sa: chunks,targets,time_coords,time_indices>,
> <fa: voxel_indices>, <a: imgaffine,imghdr,imgtype,
> mapper,voxel_dim,voxel_eldim>
> stats: mean=0.548448 std=1.40906 var=1.98546 min=-5.41163 max=9.88639
> No details due to large number of targets or chunks. Increase maxc and
> maxt if desired
> Summary for targets across chunks
> targets mean std min max #chunks
> C 0.5 0.5 0 1 18
> D 0.5 0.5 0 1 18
>
>
> #Evaluate prevalent best classifier with nested crossvalidation
> verbose.level = 5
>
> partitionerCD = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'),
> Sifter([('partitions', 2), ('targets', ['C', 'D'])])], space='partitions')
> # training partitions
> for fds_ in partitionerCD.generate(fds):
> training = fds[fds_.sa.partitions == 1]
> #print list(zip(training.sa.chunks, training.sa.targets))
> # testing partitions
> for fds_ in partitionerCD.generate(fds):
> testing = fds[fds_.sa.partitions == 2]
> #print list(zip(testing.sa.chunks, testing.sa.targets))
>
> #Helper function (partitionerCD recursively acting on dstrain, rather than
> on fds):
> def select_best_clf(dstrain_, clfs):
> """Select best model according to CVTE
> Helper function which we will use twice -- once for proper nested
> cross-validation, and once to see how big an optimistic bias due
> to model selection could be if we simply provide an entire dataset.
> Parameters
> ----------
> dstrain_ : Dataset
> clfs : list of Classifiers
> Which classifiers to explore
> Returns
> -------
> best_clf, best_error
> """
> best_error = None
> for clf in clfs:
> cv = CrossValidation(clf, partitionerCD)
> # unfortunately we don't have ability to reassign clf atm
> # cv.transerror.clf = clf
> try:
> error = np.mean(cv(dstrain_))
> except LearnerError, e:
> # skip the classifier if data was not appropriate and it
> # failed to learn/predict at all
> continue
> if best_error is None or error < best_error:
> best_clf = clf
> best_error = error
> verbose(4, "Classifier %s cv error=%.2f" % (clf.descr, error))
> verbose(3, "Selected the best out of %i classifiers %s with error %.2f"
> % (len(clfs), best_clf.descr, best_error))
> return best_clf, best_error
>
> #Estimate error using nested CV for model selection:
> best_clfs = {}
> confusion = ConfusionMatrix()
> verbose(1, "Estimating error using nested CV for model selection")
> partitioner = partitionerCD
> splitter = Splitter('partitions')
> for isplit, partitions in enumerate(partitionerCD.generate(fds)):
> verbose(2, "Processing split #%i" % isplit)
> dstrain, dstest = list(splitter.generate(partitions))
> best_clf, best_error = select_best_clf(dstrain, clfswh['!gnpp','!skl'])
> best_clfs[best_clf.descr] = best_clfs.get(best_clf.descr, 0) + 1
> # now that we have the best classifier, lets assess its transfer
> # to the testing dataset while training on entire training
> tm = TransferMeasure(best_clf, splitter,
> postproc=BinaryFxNode(mean_mismatch_error,
> space='targets'), enable_ca=['stats'])
> tm(partitions)
> confusion += tm.ca.stats
>
> ##########
> ########## * ##########
>
>
>
>
>
>
> On 10/11/2017 15:43, Matteo Visconti di Oleggio Castello wrote:
>
> What do you mean with "cycling over approx 40 different classifiers"? Are
> you testing different classifiers? If that's the case, a possibility is to
> create a script that takes as argument the type of classifiers and runs the
> classification across all folds. In that way you can submit 40 jobs and
> parallelize across classifiers.
>
> If that's not the case, because the folds are independent and deterministic
> I would create a script that performs the classification on blocks of folds
> (say fold 1 to 30, 31, to 60, etc...), and then submit different jobs, so
> to parallelize there.
>
> I think that if you send a snippet of the code you're using it can be more
> evident which are good points for parallelization.
>
>
> On 10/11/2017 09:57, marco tettamanti wrote:
>
> Dear Matteo and Nick,
> thank you for your responses.
> I take the occasion to ask some follow-up questions, because I am struggling to
> make pymvpa2 computations faster and more efficient.
>
> I often find myself in the situation of giving up with a particular analysis,
> because it is going to take far more time that I can bear (weeks, months!). This
> happens particularly with searchlight permutation testing (gnbsearchlight is
> much faster, but does not support pprocess), and nested cross-validation.
> As for the latter, for example, I recently wanted to run nested cross-validation
> in a sample of 18 patients and 18 controls (1 image x subject), training the
> classifiers to discriminate patients from controls in a leave-one-pair-out
> partitioning scheme. This yields 18*18=324 folds. For a small ROI of 36 voxels,
> cycling over approx 40 different classifiers takes about 2 hours for each fold
> on a decent PowerEdge T430 Dell server with 128GB RAM. This means approx. 27
> days for all 324 folds!
> The same server is equipped with 32 CPUs. With full parallelization, the same
> analysis may be completed in less than one day. This is the reason of my
> interest and questions about parallelization.
>
> Is there anything that you experts do in such situations to speed up or make the
> computation more efficient?
>
> Thank you again and best wishes,
> Marco
>
>
>
> On 10/11/2017 10:07, Nick Oosterhof wrote:
>
> There have been some plans / minor attempts for using parallelisation more
> parallel, but as far as I know we only support pprocces, and only for (1)
> searchlight; (2) surface-based voxel selection; and (3) hyperalignment. I
> do remember that parallelisation of other functions was challenging due to
> some getting the conditional attributes set right, but this is long time
> ago.
>
>
> On 09/11/2017 18:35, Matteo Visconti di Oleggio Castello wrote:
>
> Hi Marco,
> AFAIK, there is no support for parallelization at the level of
> cross-validation. Usually for a small ROI (such a searchlight) and with
> standard CV schemes, the process is quite fast, and the bottleneck is
> really the number of searchlights to be computed (for which parallelization
> exists).
>
> In my experience, we tend to parallelize at the level of individual
> participants; for example we might set up a searchlight analysis with
> however n_procs you can have, and then submit one such job for every
> participant to a cluster (using either torque or condor).
>
> HTH,
> Matteo
>
> On 09/11/2017 10:08, marco tettamanti wrote:
>
> Dear all,
> forgive me if this has already been asked in the past, but I was wondering
> whether there has been any development meanwhile.
>
> Are there any chances that one can generally apply parallel computing (multiple
> CPUs or clusters) with pymvpa2, in addition to what is already implemented for
> searchlight (pprocess)? That is, also for general cross-validation, nested
> cross-validation, permutation testing, RFE, etc.?
>
> Has anyone had succesful experience with parallelization schemes such as
> ipyparallel, condor or else?
>
> Thank you and best wishes!
> Marco
>
>
>
> --
> Marco Tettamanti, Ph.D.
> Nuclear Medicine Department & Division of Neuroscience
> IRCCS San Raffaele Scientific Institute
> Via Olgettina 58
> I-20132 Milano, Italy
> Phone ++39-02-26434888 <+39%2002%202643%204888>
> Fax ++39-02-26434892 <+39%2002%202643%204892>
> Email: tettamanti.marco at hsr.it
> Skype: mtettamantihttp://scholar.google.it/citations?user=x4qQl4AAAAAJ
>
>
>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
--
Matteo Visconti di Oleggio Castello
Ph.D. Candidate in Cognitive Neuroscience
Dartmouth College
+1 (603) 646-8665
mvdoc.me || github.com/mvdoc || linkedin.com/in/matteovisconti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20171128/7ae7a718/attachment-0001.html>
More information about the Pkg-ExpPsy-PyMVPA
mailing list