[pymvpa] effect of signal on null distributions

Wed Feb 13 23:06:34 UTC 2013

I think (at least some of the) signal-related null distribution effects 
are coming from how we made the simulated datasets: I generated data 
using (what I think is something like) the 
mvpa2.misc.data_generators.normal_feature_dataset command and started 
getting very non-normal, signal-dependent null distributions, despite 
not changing any other part of my code.

I looked at how the simulated datasets were made from the comment of a 
colleague, who suggested looking at the variance of the input data: how 
does the variance change with the signal changes?

In my original simulation the variance increased minimally with signal. 
This is how I made the values for one voxel:
c(rnorm(40, mean=-0.15, sd=1), rnorm(40, mean=0.15, sd=1));
In words, I chose 40 entries from a standard normal distribution with 
standard deviation 1 and mean -0.15 to be the 40 values for that voxel 
in class A, and 40 entries from a normal distribution with standard 
deviation 1 and mean 0.15 to be the values for that voxel in class B.

Reading the description of normal_feature_dataset, I first chose 40 
entries from a standard normal distribution with standard deviation 1 
and mean 0 to be the 40 values for that voxel in class A. I then divided 
each of the 40 values by the signal-to-noise ratio (e.g. 0.3), and set 
those values for class B. Finally, I divided each value by the largest 
absolute value of any of the 80 values.

Dividing by a constant (the signal-to-noise) makes fairly large 
differences in the variance of the A and B samples as the constant 
changes, even after normalization, and I think that might be driving the 
"widening" of the null distribution with increased signal.

Two questions:
Does my description seem accurate for normal_feature_dataset? In 
particular, do you set the class B values as class A / snr, or generate 
a separate set of normal values for A and B, then divide the B values?

What do the null distribution for different amounts of signal look like 
if you try the simulations making the data as I describe (values from 
two normal distributions with a small difference in mean but same 
standard deviation)?

We'll see if this is getting on the right track!

thanks,
Jo

On 2/10/2013 12:03 PM, J.A. Etzel wrote:
> I've been running some simulations to look at the effect of permuting
> the training set only, testing set only, or both (together) under
> different amounts of signal and different numbers of examples and
> cross-validation folds.
>
> I do not see the widening of the null distribution as the amount of
> signal increases that appears in some of the example figures
> (http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20130204/a36533de/attachment-0001.png)
> when the training labels are permuted.
>
> I posted my version of this comparison at:
> http://mvpa.blogspot.com/2013/02/comparing-null-distributions-changing.html
>
> Some translation might be needed: my plots show accuracy, so larger
> numbers are better, and more "bias" corresponds to easier
> classification. The number of "runs" is the number of cross-validation
> folds. I set up the examples with 50 voxels ("features"), all equally
> informative, and this simulation is for just one person.
>
> Do you typically expect to see the null distribution wider for higher
> signal when the training set labels only are permuted?
>
> That seems a strange thing to expect, and I couldn't reproduce the
> pattern. We have a new lab member who knows python and can help me sort
> out your code; I suspect we are doing something different in terms of
> how the relabelings are done over the cross-validation folds or how the
> results are tabulated.
>
> Jo
>
>

-- 
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/