[pymvpa] problem with high dimensional dataset

Yaroslav Halchenko debian at onerussian.com
Sat Mar 27 00:19:45 UTC 2010


On Fri, 26 Mar 2010, Alok Deshpande wrote:
>    here is some additional info about my dataset (named mdata):
>    >>> mdata.samples.shape
>    (41, 8673536)
way too many features for so few samples -- I wouldn't expect any good
generalization -- you would need to employ some feature selection

>    >>> mdata.samples.max()
>    23559.484
>    >>> mdata.samples.min()
>    -1214.7351
yeap -- as I was afraid ;-)

>    print mdata.summary() goes into infinite loop!
how do you know that it is an infinite loop and not just that it takes a
while? ;)

>    In order to check for improper scaling, I rescaled the whole data so
>    that it lies in [0,1]^n
good... although it would also depend on how you did that (per feature
or not), or you could also make use of zscore


>    Also, you asked about C=1 (I assume you meant the cvtype in
>    Nfoldsplitter) When I try with any other C value > 1, it throws back
>    the following error:
nope -- I meant C of the SVM you've used.

>    Has anybody worked on the dataset of this size successfully before?
well... I did 200-300k features, but once again -- you better do feature
selection (use FeatureSelectionClassifier, see constructs in
mvpa/clfs/warehouse.py for some examples which would do Anova first to
sub-select features based on univariate measures)

> My
>    question is not related to whether SVM successfully classifies or not,
>    its related to whether it gives back output within reasonable time on
>    such large scale dataset?
for your case, if data is properly scaled and reasonable C chosen (hence
it is SVM specific here), for 41 samples, it should be
quite reasonable ;)  if you use SMLR then you could expect somewhat
lengthy computation since it doesn't work in dual space and it would
take it a while to select features while learning

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]





More information about the Pkg-ExpPsy-PyMVPA mailing list