[pymvpa] effect of signal on null distributions

Mon Feb 18 15:12:48 UTC 2013

> Reading the description of normal_feature_dataset, I first chose 40 entries from a standard normal distribution with standard deviation 1 and mean 0 to 
> be the 40 values for that voxel in class A. I then divided each of the 40 values by the signal-to-noise ratio (e.g. 0.3), and set those values for class B.  
> Finally, I divided each value by the largest absolute value of any of the 80 values.

In my humble opinion,  both A and B have the same mean, but, they differ in the variance (variance of class A is 1, and variance of class B is 1/snr). If one does zscoring to A and B, he'll get exactly the same values for A and B. Moreover if one uses GNB classifier there is a high probability to obtain chance level performance or slightly higher (because GNB does the classification after estimating the mean and variance). Dividing A and B by max(A, B) does not have crucial importance since it was intended to keep the values <|1|.

Now that both A and B have the same mean, I doubt that the previously shown behavior of permutation distributions (using only training set, only testing set, or both sets) would be the case in practice. I have to admit though that the resultant distribution of using only the training set was a bit awkward.
Cheers, 
-Rawi

Choose snr;
Generate A, zero mean, and unit variance, so, mA=mean(A)=0;
B = A/snr;
gamma=abs(max(A,B));
B=B/gamma;
A=A/gamma;

mA=mean(A/gamma)=mean(A)/gamma=0; 
mB=mean(B/gamma)=mean(A/snr)/gamma=mean(A)/(snr*gamma)=0;
// similarly,
var_A=1,
var_B=1/sqrt(snr);

----- Original Message -----
> From: J.A. Etzel <jetzel at artsci.wustl.edu>
> To: pkg-exppsy-pymvpa at lists.alioth.debian.org
> Cc: 
> Sent: Wednesday, February 13, 2013 11:06 PM
> Subject: Re: [pymvpa] effect of signal on null distributions
> 
> I think (at least some of the) signal-related null distribution effects are 
> coming from how we made the simulated datasets: I generated data using (what I 
> think is something like) the mvpa2.misc.data_generators.normal_feature_dataset 
> command and started getting very non-normal, signal-dependent null 
> distributions, despite not changing any other part of my code.
> 
> I looked at how the simulated datasets were made from the comment of a 
> colleague, who suggested looking at the variance of the input data: how does the 
> variance change with the signal changes?
> 
> In my original simulation the variance increased minimally with signal. This is 
> how I made the values for one voxel:
> c(rnorm(40, mean=-0.15, sd=1), rnorm(40, mean=0.15, sd=1));
> In words, I chose 40 entries from a standard normal distribution with standard 
> deviation 1 and mean -0.15 to be the 40 values for that voxel in class A, and 40 
> entries from a normal distribution with standard deviation 1 and mean 0.15 to be 
> the values for that voxel in class B.
> 
> Reading the description of normal_feature_dataset, I first chose 40 entries from 
> a standard normal distribution with standard deviation 1 and mean 0 to be the 40 
> values for that voxel in class A. I then divided each of the 40 values by the 
> signal-to-noise ratio (e.g. 0.3), and set those values for class B. Finally, I 
> divided each value by the largest absolute value of any of the 80 values.
> 
> Dividing by a constant (the signal-to-noise) makes fairly large differences in 
> the variance of the A and B samples as the constant changes, even after 
> normalization, and I think that might be driving the "widening" of the 
> null distribution with increased signal.
> 
> Two questions:
> Does my description seem accurate for normal_feature_dataset? In particular, do 
> you set the class B values as class A / snr, or generate a separate set of 
> normal values for A and B, then divide the B values?
> 
> What do the null distribution for different amounts of signal look like if you 
> try the simulations making the data as I describe (values from two normal 
> distributions with a small difference in mean but same standard deviation)?
> 
> We'll see if this is getting on the right track!
> 
> thanks,
> Jo
> 
> 
> On 2/10/2013 12:03 PM, J.A. Etzel wrote:
>>  I've been running some simulations to look at the effect of permuting
>>  the training set only, testing set only, or both (together) under
>>  different amounts of signal and different numbers of examples and
>>  cross-validation folds.
>> 
>>  I do not see the widening of the null distribution as the amount of
>>  signal increases that appears in some of the example figures
>> 
> (http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20130204/a36533de/attachment-0001.png)
>>  when the training labels are permuted.
>> 
>>  I posted my version of this comparison at:
>>  http://mvpa.blogspot.com/2013/02/comparing-null-distributions-changing.html
>> 
>>  Some translation might be needed: my plots show accuracy, so larger
>>  numbers are better, and more "bias" corresponds to easier
>>  classification. The number of "runs" is the number of 
> cross-validation
>>  folds. I set up the examples with 50 voxels ("features"), all 
> equally
>>  informative, and this simulation is for just one person.
>> 
>>  Do you typically expect to see the null distribution wider for higher
>>  signal when the training set labels only are permuted?
>> 
>>  That seems a strange thing to expect, and I couldn't reproduce the
>>  pattern. We have a new lab member who knows python and can help me sort
>>  out your code; I suspect we are doing something different in terms of
>>  how the relabelings are done over the cross-validation folds or how the
>>  results are tabulated.
>> 
>>  Jo
>> 
>> 
> 
> -- Joset A. Etzel, Ph.D.
> Research Analyst
> Cognitive Control & Psychopathology Lab
> Washington University in St. Louis
> http://mvpa.blogspot.com/
> 
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>