[pymvpa] effect of signal on null distributions
rawi707 at yahoo.com
Mon Feb 18 15:12:48 UTC 2013
> Reading the description of normal_feature_dataset, I first chose 40 entries from a standard normal distribution with standard deviation 1 and mean 0 to
> be the 40 values for that voxel in class A. I then divided each of the 40 values by the signal-to-noise ratio (e.g. 0.3), and set those values for class B.
> Finally, I divided each value by the largest absolute value of any of the 80 values.
In my humble opinion, both A and B have the same mean, but, they differ in the variance (variance of class A is 1, and variance of class B is 1/snr). If one does zscoring to A and B, he'll get exactly the same values for A and B. Moreover if one uses GNB classifier there is a high probability to obtain chance level performance or slightly higher (because GNB does the classification after estimating the mean and variance). Dividing A and B by max(A, B) does not have crucial importance since it was intended to keep the values <|1|.
Now that both A and B have the same mean, I doubt that the previously shown behavior of permutation distributions (using only training set, only testing set, or both sets) would be the case in practice. I have to admit though that the resultant distribution of using only the training set was a bit awkward.
Generate A, zero mean, and unit variance, so, mA=mean(A)=0;
B = A/snr;
----- Original Message -----
> From: J.A. Etzel <jetzel at artsci.wustl.edu>
> To: pkg-exppsy-pymvpa at lists.alioth.debian.org
> Sent: Wednesday, February 13, 2013 11:06 PM
> Subject: Re: [pymvpa] effect of signal on null distributions
> I think (at least some of the) signal-related null distribution effects are
> coming from how we made the simulated datasets: I generated data using (what I
> think is something like) the mvpa2.misc.data_generators.normal_feature_dataset
> command and started getting very non-normal, signal-dependent null
> distributions, despite not changing any other part of my code.
> I looked at how the simulated datasets were made from the comment of a
> colleague, who suggested looking at the variance of the input data: how does the
> variance change with the signal changes?
> In my original simulation the variance increased minimally with signal. This is
> how I made the values for one voxel:
> c(rnorm(40, mean=-0.15, sd=1), rnorm(40, mean=0.15, sd=1));
> In words, I chose 40 entries from a standard normal distribution with standard
> deviation 1 and mean -0.15 to be the 40 values for that voxel in class A, and 40
> entries from a normal distribution with standard deviation 1 and mean 0.15 to be
> the values for that voxel in class B.
> Reading the description of normal_feature_dataset, I first chose 40 entries from
> a standard normal distribution with standard deviation 1 and mean 0 to be the 40
> values for that voxel in class A. I then divided each of the 40 values by the
> signal-to-noise ratio (e.g. 0.3), and set those values for class B. Finally, I
> divided each value by the largest absolute value of any of the 80 values.
> Dividing by a constant (the signal-to-noise) makes fairly large differences in
> the variance of the A and B samples as the constant changes, even after
> normalization, and I think that might be driving the "widening" of the
> null distribution with increased signal.
> Two questions:
> Does my description seem accurate for normal_feature_dataset? In particular, do
> you set the class B values as class A / snr, or generate a separate set of
> normal values for A and B, then divide the B values?
> What do the null distribution for different amounts of signal look like if you
> try the simulations making the data as I describe (values from two normal
> distributions with a small difference in mean but same standard deviation)?
> We'll see if this is getting on the right track!
> On 2/10/2013 12:03 PM, J.A. Etzel wrote:
>> I've been running some simulations to look at the effect of permuting
>> the training set only, testing set only, or both (together) under
>> different amounts of signal and different numbers of examples and
>> cross-validation folds.
>> I do not see the widening of the null distribution as the amount of
>> signal increases that appears in some of the example figures
>> when the training labels are permuted.
>> I posted my version of this comparison at:
>> Some translation might be needed: my plots show accuracy, so larger
>> numbers are better, and more "bias" corresponds to easier
>> classification. The number of "runs" is the number of
>> folds. I set up the examples with 50 voxels ("features"), all
>> informative, and this simulation is for just one person.
>> Do you typically expect to see the null distribution wider for higher
>> signal when the training set labels only are permuted?
>> That seems a strange thing to expect, and I couldn't reproduce the
>> pattern. We have a new lab member who knows python and can help me sort
>> out your code; I suspect we are doing something different in terms of
>> how the relabelings are done over the cross-validation folds or how the
>> results are tabulated.
> -- Joset A. Etzel, Ph.D.
> Research Analyst
> Cognitive Control & Psychopathology Lab
> Washington University in St. Louis
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
More information about the Pkg-ExpPsy-PyMVPA