[pymvpa] mvpa2.clfs.transerror.chisquare

Fri May 12 19:00:39 UTC 2017

On Fri, 12 May 2017, Matteo Visconti di Oleggio Castello wrote:

>      On May 12, 2017, at 10:58, Yaroslav Halchenko <debian at onerussian.com>
>      wrote:
>      just to prevent possible confusion let me first state that chi2 (one way
>      or another) is not appropriate for majority of our use cases as a
>      criterion to state that we have an effect of interest!  why? because

>      In [9]: print chi2_contingency(np.array([[10,0],[0,10]]),
>      correction=False)[1]
>      7.74421643104e-06

>      In [10]: print chi2_contingency(np.array([[0,10],[10,0]]),
>      correction=False)[1]
>      7.74421643104e-06

>      i.e. it is not representative of accuracy but representative of some
>      notion of "information" present in the contingency table, and
>      perfect misclassification would be considered as good as perfect correct
>      classification.

>    Agreed that using chi square might not be the best way to test a
>    classifierâ**s accuracy (for example, it wonâ**t catch biased classifiers,
>    and my intuition is that permutation testing would be better for that).
>    However, wouldnâ**t perfect misclassification be as informative as perfect
>    classification? Or what does it mean when you have perfect
>    misclassification? When would you expect that such a case would happen
>    with real data (and does it happen)?

you can get perfect misclassification if you have no signal and
classifier is susceptible to disbalance (e.g. stats based classifiers
suchas GNB, LDA etc wouldn't care as much, SVM -- would)...  and e.g.
you have perfectly balanced dataset and then do leave-one-sample-out.
So this way you have in training slight disbalance toward one class
which classifier chooses to be the one to assign to any testing
data, in the testing a label of the opposite class - perfect
misclassification

but there were CS papers about what special layout of data points could
lead to misclassifications. someone would need to search the history of
the list here ;)

what we see in reality at times (also was reported on the list) is
some biases toward misclassification.  Some times they get avoided by
changing partitioning or preprocessing without clearly grasping what
initially lead to it ;)

>      to add more to the confusion, here in the confusion matrix we actually
>      report
>          stats['CHI^2'] = chisquare(self.__matrix, exp='indep_rows')
>      ;-)

>    As Marco pointed out, pymvpa computes goodness-of-fit, and this is the
>    output with your example:
>    In [53]: mchisquare(np.array([[10, 0], [0, 10]]), exp='indep_rows')
>    Out[53]: (20.0, 0.00016974243555282632)
>    In [54]: mchisquare(np.array([[0, 10], [10, 0]]), exp='indep_rows')
>    Out[54]: (20.0, 0.00016974243555282632)

but IIRC it is not a "plain" goodness of fit on values of contingency
table (as if it was if exp='uniform') due to  exp='indep_rows'    ;-)

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik