[pymvpa] Splitter and nperlabel='equal'

Fri Mar 26 19:00:20 UTC 2010

I am sorry for feeling being didactic today, but 

why don't you simply look what splitter gives you?

ie smth like what I do below on some silly dataset

In [21]: print ds.summary()
Dataset / float64 10 x 1
uniq: 2 labels 2 chunks
stats: mean=-0.0652967 std=1.17973 var=1.39176 min=-2.27894 max=1.58153

Counts of labels in each chunk:
  chunks\labels  0   1
                --- ---
       0         1   2
       1         3   4

Summary per label across chunks
  label mean std min max #chunks
   0      2   1   1   3     2
   1      3   1   2   4     2

Summary per chunk across labels
  chunk mean std min max #labels
   0     1.5 0.5  1   2     2
   1     3.5 0.5  3   4     2

In [22]: print '\n'.join([d.summary() for d in list(NFoldSplitter()(ds))[0]])
Dataset / float64 7 x 1
uniq: 2 labels 1 chunks
stats: mean=-0.402729 std=1.1694 var=1.36749 min=-2.27894 max=1.25512

Counts of labels in each chunk:
  chunks\labels  0   1
                --- ---
       1         3   4

Summary per label across chunks
  label mean std min max #chunks
   0      3   0   3   3     1
   1      4   0   4   4     1

Summary per chunk across labels
  chunk mean std min max #labels
   1     3.5 0.5  3   4     2

Dataset / float64 3 x 1
uniq: 2 labels 1 chunks
stats: mean=0.722045 std=0.750201 var=0.562802 min=-0.246372 max=1.58153

Counts of labels in each chunk:
  chunks\labels  0   1
                --- ---
       0         1   2

Summary per label across chunks
  label mean std min max #chunks
   0      1   0   1   1     1
   1      2   0   2   2     1

Summary per chunk across labels
  chunk mean std min max #labels
   0     1.5 0.5  1   2     2

In [23]: print '\n'.join([d.summary() for d in list(NFoldSplitter(nperlabel='equal')(ds))[0]])
Dataset / float64 6 x 1
uniq: 2 labels 1 chunks
stats: mean=-0.612826 std=1.13421 var=1.28642 min=-2.27894 max=1.25512

Counts of labels in each chunk:
  chunks\labels  0   1
                --- ---
       1         3   3

Summary per label across chunks
  label mean std min max #chunks
   0      3   0   3   3     1
   1      3   0   3   3     1

Summary per chunk across labels
  chunk mean std min max #labels
   1      3   0   3   3     2

Dataset / float64 2 x 1
uniq: 2 labels 1 chunks
stats: mean=0.292305 std=0.538677 var=0.290173 min=-0.246372 max=0.830983

Counts of labels in each chunk:
  chunks\labels  0   1
                --- ---
       0         1   1

Summary per label across chunks
  label mean std min max #chunks
   0      1   0   1   1     1
   1      1   0   1   1     1

Summary per chunk across labels
  chunk mean std min max #labels
   0      1   0   1   1     2

On Fri, 26 Mar 2010, Timothy Vickery wrote:

>    Hi,

>    I have a couple of questions about the nperlabel parameter of the
>    Splitter class (NFoldSplitter, actually). I have unequal numbers of
>    each class within each scan, and also across scans, so I have been
>    manually balancing the number of exemplars used from each class in each
>    chunk by throwing out random trials from the over-represented class
>    before classification. I'd like to take advantage of the
>    nperlabel='equal' option on my splitter to do this for me, but I have a
>    couple of questions about how this affects the error rate, which I
>    could not figure out from the documentation (sorry if I missed
>    something obvious):

>    - Suppose I am using NFoldSplitter to leave one chunk out, and I have
>    11 examples of C1 and 13 examples of C2 in chunk 1, but only 8 C1 and
>    10 C2 for chunk 2. Will the NFoldSplitter with nperlabel='equal' force
>    the number of examples of each category from each chunk down to 8? Or
>    will it use 11 of each class for chunk 1, and 8 of each class for chunk
>    2?

>    - If it is the latter (balanced separately within chunks), how is the
>    error rate determined with the CrossValidatedTransferError class? Does
>    the error rate reflect the simple average error across folds (error run
>    1 + error run 2)/2, or is the average weighted by the number of
>    exemplars from each fold (equivalent to the total error / total number
>    of tests)? If it is averaging fold performance, is there a way to force
>    it to report the overall test case performance, instead? The simple
>    average over fold performance would seem to be skewed by better or
>    worse performance on chunk 2 in the example above, since it has fewer
>    test cases.

>    Thanks for your help!

>    -Tim

> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]