[pymvpa] Pkg-ExpPsy-PyMVPA Digest, Vol 66, Issue 4
jason.d.gors at dartmouth.edu
Wed Aug 7 15:47:29 UTC 2013
> Today's Topics:
> 1. Re: No samples of a class in a chunk (J.A. Etzel)
> Message: 1
> Date: Tue, 06 Aug 2013 16:04:12 -0500
> From: "J.A. Etzel" <jetzel at artsci.wustl.edu>
> To: pkg-exppsy-pymvpa at lists.alioth.debian.org
> Subject: Re: [pymvpa] No samples of a class in a chunk
> Message-ID: <520164CC.9050203 at artsci.wustl.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> It doesn't look like anyone's replied to this yet, so here's my two cents.
> I think of this sort of situation as a case of imbalance - there aren't
> equal numbers of examples of each class in each training/testing set
> (aka chunk). This happens in all sorts of situations, such as when which
> trials are included depends upon participant behavior (e.g.
> correctly-performed trials).
> There isn't a universally appropriate strategy to regain balance, but
> either the chunks or the examples will need to be changed.
> For example, in one dataset we wanted to do leave-one-run-out
> cross-validation, but the imbalance was too great (e.g. some runs with
> very few examples), so we combined runs, for leave-three-runs-out
> cross-validation. We combined temporally adjacent runs (e.g. 1-3, 4-6,
> 7-9) to make sure we didn't somehow inflate the accuracy.
Could you explain the reasoning here about combining temporally adjacent
runs to make sure to not inflate accuracy scores. First, I'm assuming that
what you're claiming is that adjacent runs would likely be more similar to
one another than farther apart runs -- so, like you mention, leaving out
adjacent runs might give you lower accuracy scores for those left out runs,
but why is something like this more desirable than, say, just leaving out a
random split of 1/3 of the data? Unless you had some kind of reason to
look at the temporal nature of classification with previous and/or
subsequent runs, it's not clear to me why this is needed. Maybe I'm a bit
too unsure as to what the pipeline was in doing this, but I don't fully
understand the reasoning. Also, it seems that leaving out runs 1-3 or 7-9
(with 9 being the last run) could fulfill that assumption nicely, but if
4-6 were used as the left out split, then this seems less likely to fit
that assumption -- run 4 would be just as similar (i.e. temporally close)
to run 3 as to run 5, and run 6 would be just as similar to run 5 as to run
7; likewise, if you used any set of 3 adjacent runs that weren't the first
3 runs or the last three runs, then you'd have the same issue for every
triple, such that 2 out of 3 of the triple would be just as similar to
non-left-out runs as to the 3 runs in that left-out triple anyways.
Whereas the first 3 runs would have only 1 run, run 3, as temporally close
to non-left-out runs; likewise for run 7 in the last triple of runs. Why
not just use a random split of 1/3 of the data or do split-halves (or
something similar) for hold data?
Dept. of Psychological and Brain Sciences
6207 Moore Hall
Hanover, NH 03755
Phone: (603) 646-9689 Fax: (603) 646-1419
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Pkg-ExpPsy-PyMVPA