[pymvpa] effect of signal on null distributions
jetzel at artsci.wustl.edu
Wed Feb 13 23:06:34 UTC 2013
I think (at least some of the) signal-related null distribution effects
are coming from how we made the simulated datasets: I generated data
using (what I think is something like) the
mvpa2.misc.data_generators.normal_feature_dataset command and started
getting very non-normal, signal-dependent null distributions, despite
not changing any other part of my code.
I looked at how the simulated datasets were made from the comment of a
colleague, who suggested looking at the variance of the input data: how
does the variance change with the signal changes?
In my original simulation the variance increased minimally with signal.
This is how I made the values for one voxel:
c(rnorm(40, mean=-0.15, sd=1), rnorm(40, mean=0.15, sd=1));
In words, I chose 40 entries from a standard normal distribution with
standard deviation 1 and mean -0.15 to be the 40 values for that voxel
in class A, and 40 entries from a normal distribution with standard
deviation 1 and mean 0.15 to be the values for that voxel in class B.
Reading the description of normal_feature_dataset, I first chose 40
entries from a standard normal distribution with standard deviation 1
and mean 0 to be the 40 values for that voxel in class A. I then divided
each of the 40 values by the signal-to-noise ratio (e.g. 0.3), and set
those values for class B. Finally, I divided each value by the largest
absolute value of any of the 80 values.
Dividing by a constant (the signal-to-noise) makes fairly large
differences in the variance of the A and B samples as the constant
changes, even after normalization, and I think that might be driving the
"widening" of the null distribution with increased signal.
Does my description seem accurate for normal_feature_dataset? In
particular, do you set the class B values as class A / snr, or generate
a separate set of normal values for A and B, then divide the B values?
What do the null distribution for different amounts of signal look like
if you try the simulations making the data as I describe (values from
two normal distributions with a small difference in mean but same
We'll see if this is getting on the right track!
On 2/10/2013 12:03 PM, J.A. Etzel wrote:
> I've been running some simulations to look at the effect of permuting
> the training set only, testing set only, or both (together) under
> different amounts of signal and different numbers of examples and
> cross-validation folds.
> I do not see the widening of the null distribution as the amount of
> signal increases that appears in some of the example figures
> when the training labels are permuted.
> I posted my version of this comparison at:
> Some translation might be needed: my plots show accuracy, so larger
> numbers are better, and more "bias" corresponds to easier
> classification. The number of "runs" is the number of cross-validation
> folds. I set up the examples with 50 voxels ("features"), all equally
> informative, and this simulation is for just one person.
> Do you typically expect to see the null distribution wider for higher
> signal when the training set labels only are permuted?
> That seems a strange thing to expect, and I couldn't reproduce the
> pattern. We have a new lab member who knows python and can help me sort
> out your code; I suspect we are doing something different in terms of
> how the relabelings are done over the cross-validation folds or how the
> results are tabulated.
Joset A. Etzel, Ph.D.
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
More information about the Pkg-ExpPsy-PyMVPA