[pymvpa] balancing leave-one-out

Fri Aug 22 01:56:13 UTC 2014

On Sun, 17 Aug 2014, Ben Acland wrote:

>    ... is probably not exactly the right subject for this question, but here
>    goes.
>    I'm having some issues with leave-one-out cross-validation. Specifically,
>    I've got two subject groups and am having trouble getting NFoldPartitioner
>    to leave the same number of samples from each group in the training
>    dataset.
>    A little more detail. My dataset has two subject groups and three trial
>    types. I've already made sure that the dataset (ds) is balanced. Below, I
>    make a dataset containing only one trial type (whose name is contained in
>    the variable 'name'), then try to set up cross-validation as described:
>    clf_ds = ds[ds.sa.name == ename, ds.fa.event_offsetidx == offset]
>    clf_ds.sa["targets"] = clf_ds.sa.sub_group
>    rep = mp.Repeater(count=PERM_COUNT)
>    perm = mp.AttributePermutator('sub_group', limit={'partitions': 1},
>    count=1)
>    part = mp.NFoldPartitioner(attr="subject")
>    clf = mp.LinearCSVMC()
>    null_cv = CrossValidation(
>        clf,
>        mp.ChainNode([part, perm],
>                     space=part.get_space()),
>        postproc=mean_sample())
>    distr_est = mp.MCNullDist(rep,
>                              tail='left',
>                              measure=null_cv,
>                              enable_ca=['dist_samples'])
>    cvmcc = mp.CrossValidation(clf,
>                               part,
>                               postproc=mean_sample(),
>                               null_dist=distr_est,
>                               enable_ca=['stats'])
>    result = cvte(clf_ds)
>    So, I'm clearly doing something wrong. If I build the generator myself,
>    then look at its output, I get this:
>    gen = mp.ChainNode([part, perm], space=part.get_space())
>    asdf = gen.generate(clf_ds)
>    dd = asdf.next()
>    [(p, sg, len(np.where(np.logical_and(dd.sa.partitions==p,
>    dd.sa.sub_group==sg))[0])) for p in np.unique(dd.sa.partitions) for sg in
>    np.unique(ds.sa.sub_group)]
>    [(1, 'ctrl', 89), (1, 'scz', 83), (2, 'ctrl', 0), (2, 'scz', 6)]
>    Wah wahh, not what we're looking for.
>    I tried another approach using Balancer, but got a number of other
>    exceptions. I can go into that if needed, but there's gotta be a simple
>    way to do what I'm trying to do. To be specific, for each fold I want to
>    leave out one sample from group 'ctrl,' and one from group 'scz.'

;-) I have first searched the post for "?" but didn't find... so found this
which I would take as a question:  sounds like you might achieve it (may
be not the most elegant way) but through use of Sifter (copy/paste from our
unittests).

    npart = ChainNode([
    ## so we split based on superord
        NFoldPartitioner(len(ds.sa['superord'].unique),
                         attr='subord'),
        ## so it should select only those splits where we took 1 from
        ## each of the superord categories leaving things in balance
        Sifter([('partitions', 2),
                ('superord',
                 { 'uvalues': ds.sa['superord'].unique,
                   'balanced': True})
                 ]),
                   ], space='partitions')

a bit convoluted but generic... let me know if still tricky or not appropriate

>    The answer is probably sitting right there staring me in the face, but if
>    so then it's been staring me in the face for days now with no result.

it seems to becoming more and more frequent use case for cross-validation
in factorial designs -- we should just cook a proper dedicated partitioner for
it I guess.

-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist,            Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik