[pymvpa] Combinatorial MVPA

Thu Dec 10 19:42:21 UTC 2015

I should note, I created a modified Splitter (called
CombinatorialSplitter), which returns the datasets that I want. The only
edits are to lines 119 and 133 in splitters.py
<https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/generators/splitters.py>
:

119: for isplit, split in enumerate(cfgs): ---> for isplit, split in
enumerate(itertools.combinations(cfgs,self.__combination_val)):
133: filter_ = splattr_data == split ----> filter_ = np.array([i in split
for i in splattr_data])

My plan then is to run the analysis like so (each dataset has
fa['network_ids'], with a number identifying each network; I haven't fully
tested this yet):

splitter = CombinatorialSplitter('network_ids',25)
#clf, nf as normal
cv =
CrossValidation(clf,nf,errorfx=corr_error,pass_attr=[('fa.network_ids','fa',1)])
res = []
for i in splitter.generate(ds):
    res.append(cv(i)
res_all = hstack(res)

Then res_all is 3 x 26 (3 runs, 26 because 26 c 1 = 26), with a separate
row for each, column for each combination. and each column has
fa.network_ids which identifies which networks were included. My plan was
then to use the data to perform a calculation like I described in the last
message.

On Thu, Dec 10, 2015 at 2:34 PM, Bill Broderick <billbrod at gmail.com> wrote:

> Sorry, I realize my description of the problem wasn't very clear.
>
> ds.summary() looks like this (after dropping the time points without
> targets):
> Dataset: 54x26 at float64, <sa: chunks,targets>, <fa: network_ids>
> stats: mean=0.0879673 std=1.08415 var=1.17538 min=-3.56651 max=4.62532
> No details due to large number of targets or chunks. Increase maxc and
> maxt if desired
> Summary for chunks across targets
>   chunks  mean  std  min max #targets
>     1    0.333 0.471  0   1     18
>     2    0.333 0.471  0   1     18
>     3    0.333 0.471  0   1     18
> Number of unique targets > 20 thus no sequence statistics
>
> "Volume for each volume" was a typo, I meant value for each volume. For
> this subject, the full dataset is 1452 x 26, where we have 1452 time points
> (across all runs) and 26 time courses. We then label each time point as
> with the reaction time if it happens during an event (we're regressing to
> the reaction time) or 'other' if it doesn't. We then are left with 54 time
> points, across 3 runs.
>
> Re: Richard's comment, we're interested in problem 1: we want to evaluate
> the predictive power of each feature. We hypothesize that three of these
> time courses are much more important than the rest. Based on earlier work
> in the lab (Carter et al. 2012
> <https://www.sciencemag.org/content/337/6090/109.full>), I was thinking
> of using something similar to the Unique Combinatorial Performance (UCP) to
> evaluate the contribution of each time course. UCP was used for pairs of
> ROIs: For each ROI i, the UCP is the average additional classification
> accuracy observed when a model was constructed using data from i with a
> second ROI j (for every j). Since we're looking at individual time courses,
> I was thinking of looking at the difference in accuracies between analyses
> containing time course i and those not containing time course i (or those
> containing time courses i and j and those not containing time courses i and
> j, etc.).
>
> Does that make sense?
>
> I agree that feature selection methods don't seem appropriate here and
> something like what I outlined above seems intuitively appealing -- does it
> sound like a reasonable way to evaluate predictive power of each time
> course?
>
> Thanks,
> Bill
>
> On Wed, Dec 9, 2015 at 7:10 PM, Richard Dinga <dinga92 at gmail.com> wrote:
>
>> Bill Broderick wrote:
>>
>> > However, to determine which timecourse is contributing the most to the
>> > classifiers performance,
>>
>> > see which timecourses or which combination
>> > of time courses caused the greatest drop in performance when removed.
>>
>> I wrote:
>> > You might take a look at Relief algorithm (also implemented in PyMVPA),
>> > that is less hacky approach to your feature weighting problem.
>>
>>
>> Yaroslav Halchenko wrote:
>>
>> > there is yet another black hole of methods to assess contribution of
>> > each feature to performance of the classifier.  The irelief, which was
>> > mentioned is one of them...
>>
>> > So what is your classification performance if you just do
>> > classsification on all features?  which one could you obtain if you do
>> > feature selection, e.g. with SplitRFE (which would eliminate features to
>> > attain best performance within each cv folds in nested cv)
>>
>>
>> I think there are (at least) 2 separate problems.
>>
>> 1. How to evaluate predictive power for every feature in order to
>> interpret data
>> 2. How to evaluate importance of features for a classifier in order to
>> understand a model and possibly select set of features to get best
>> performance.
>>
>> Feature selection methods like Lasso or RFE (as far as I know) would omit
>> most of redundant/higly correlated features, therefore making a 1.
>> impossible. It still might me a good idea for other reasons.
>>
>> _______________________________________________
>> Pkg-ExpPsy-PyMVPA mailing list
>> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20151210/b9af6c35/attachment-0001.html>