[pymvpa] same sensitivity values for all dataset splits from cross validation using tutorial script snippet
Michael Hanke
mih at debian.org
Mon Jun 30 19:33:42 UTC 2014
Hi,
I am sorry you had to go through this...
On Mon, Jun 30, 2014 at 02:18:07PM +0100, Meng Liang wrote:
> I am trying to obtain the sensitivity values for all splits of the
> dataset during leave-one-out cross-validation (classification using
> SVM). I found in the tutorial "Classification Model Parameters –
> Sensitivity Analysis" (
> http://www.pymvpa.org/tutorial_sensitivity.html ) that
> RepeatedMeasure(sensana, NFoldPartitioner()) should give the
> sensitivity values for each fold. Here are the code snippet I used in
> my script slightly adapted from the tutorial:
>
> clf = LinearNuSVMC()
> cv = CrossValidation(clf, NFoldPartitioner(),enable_ca=['stats'])
> sensana = clf.get_sensitivity_analyzer()
> cv_sensana = RepeatedMeasure(sensana, NFoldPartitioner())
> error = cv(ds)
> sensmap_cv = cv_sensana(ds)
> 'print sensmap_cv.shape'
>
> gave me: (14L, 87L).
>
> I have 14 subjects and I am using leave-one-subject-out
> cross-validation, and there are 87 features. So the data structure
> seems correct. However, when I look at the values of this 14x87 array,
> all the rows in the array contain exactly the same values (i.e., the
> first row looks the same with all the other rows).
I am afraid you found a bug in the documentation (more specifically a
bit of code that has not been properly adjusted when we switched from
dataset splitters to dataset partitioners -- just mentioning it for
those who have been around for that long...).
The reason for the behavior you observe is that, in contrast to what is
advertised in the tutorial, RepeatedMeasure does not split any dataset.
It does what it says on the label: it repeats a measure, for whatever
datasets come out of the provided generator -- in your case
NFoldPartitioner. However, partitioners only add a sample attribute to a
dataset that indicate the current partitioning scheme -- they do not
split a dataset -- hence you are actually computing sensitivities,
repeatedly, from the identical dataset.
If you want to compute the sensitivities on the respective training
samples of each data fold (which I think you do) you need to change that
line to:
cv_sensana = RepeatedMeasure(sensana,
ChainNode((NFoldPartitioner(),
Splitter('partitions',
attr_values=(1,)))))
This change amends the partitioner with a splitter that actually takes
out the training samples of each fold and feeds them into the
sensitivity measure.
> A related question about normalizing the sensitivity values: in the
> "Closing Words" of the tutorial on the same webpage, it says: "It
> should also be noted that sensitivities can not be directly compared
> to each other, even if they stem from the same algorithm and are just
> computed on different dataset splits. In an analysis one would have to
> normalize them first." My question is: if we cannot compare the
> sensitivity values from different data splits without normalizing them
> first, why can we average them or take the maximum value across data
> splits without applying any normalization (the example script snippets
> in the tutorial seem to do so)? I would imagine that the average or
> the max value would also be affected by the scale of the data.
Yes, you are right: they could be normalized even more (the dataset in
the tutorial, however, is a single subject and it was z-scored upfront.
So it is not that bad...
Sorry for the bug. I filed a bug report and we'll fix it ASAP.
Michael
--
J.-Prof. Dr. Michael Hanke
Psychoinformatik Labor, Institut für Psychologie II
Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, Geb.24
Tel.: +49(0)391-67-18481 Fax: +49(0)391-67-11947 GPG: 4096R/7FFB9E9B
More information about the Pkg-ExpPsy-PyMVPA
mailing list