[pymvpa] balancing leave-one-out
Yaroslav Halchenko
debian at onerussian.com
Fri Aug 22 01:56:13 UTC 2014
On Sun, 17 Aug 2014, Ben Acland wrote:
> ... is probably not exactly the right subject for this question, but here
> goes.
> I'm having some issues with leave-one-out cross-validation. Specifically,
> I've got two subject groups and am having trouble getting NFoldPartitioner
> to leave the same number of samples from each group in the training
> dataset.
> A little more detail. My dataset has two subject groups and three trial
> types. I've already made sure that the dataset (ds) is balanced. Below, I
> make a dataset containing only one trial type (whose name is contained in
> the variable 'name'), then try to set up cross-validation as described:
> clf_ds = ds[ds.sa.name == ename, ds.fa.event_offsetidx == offset]
> clf_ds.sa["targets"] = clf_ds.sa.sub_group
> rep = mp.Repeater(count=PERM_COUNT)
> perm = mp.AttributePermutator('sub_group', limit={'partitions': 1},
> count=1)
> part = mp.NFoldPartitioner(attr="subject")
> clf = mp.LinearCSVMC()
> null_cv = CrossValidation(
> clf,
> mp.ChainNode([part, perm],
> space=part.get_space()),
> postproc=mean_sample())
> distr_est = mp.MCNullDist(rep,
> tail='left',
> measure=null_cv,
> enable_ca=['dist_samples'])
> cvmcc = mp.CrossValidation(clf,
> part,
> postproc=mean_sample(),
> null_dist=distr_est,
> enable_ca=['stats'])
> result = cvte(clf_ds)
> So, I'm clearly doing something wrong. If I build the generator myself,
> then look at its output, I get this:
> gen = mp.ChainNode([part, perm], space=part.get_space())
> asdf = gen.generate(clf_ds)
> dd = asdf.next()
> [(p, sg, len(np.where(np.logical_and(dd.sa.partitions==p,
> dd.sa.sub_group==sg))[0])) for p in np.unique(dd.sa.partitions) for sg in
> np.unique(ds.sa.sub_group)]
> [(1, 'ctrl', 89), (1, 'scz', 83), (2, 'ctrl', 0), (2, 'scz', 6)]
> Wah wahh, not what we're looking for.
> I tried another approach using Balancer, but got a number of other
> exceptions. I can go into that if needed, but there's gotta be a simple
> way to do what I'm trying to do. To be specific, for each fold I want to
> leave out one sample from group 'ctrl,' and one from group 'scz.'
;-) I have first searched the post for "?" but didn't find... so found this
which I would take as a question: sounds like you might achieve it (may
be not the most elegant way) but through use of Sifter (copy/paste from our
unittests).
npart = ChainNode([
## so we split based on superord
NFoldPartitioner(len(ds.sa['superord'].unique),
attr='subord'),
## so it should select only those splits where we took 1 from
## each of the superord categories leaving things in balance
Sifter([('partitions', 2),
('superord',
{ 'uvalues': ds.sa['superord'].unique,
'balanced': True})
]),
], space='partitions')
a bit convoluted but generic... let me know if still tricky or not appropriate
> The answer is probably sitting right there staring me in the face, but if
> so then it's been staring me in the face for days now with no result.
it seems to becoming more and more frequent use case for cross-validation
in factorial designs -- we should just cook a proper dedicated partitioner for
it I guess.
--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik
More information about the Pkg-ExpPsy-PyMVPA
mailing list