[pymvpa] problem with high dimensional dataset
Yaroslav Halchenko
debian at onerussian.com
Sat Mar 27 00:19:45 UTC 2010
On Fri, 26 Mar 2010, Alok Deshpande wrote:
> here is some additional info about my dataset (named mdata):
> >>> mdata.samples.shape
> (41, 8673536)
way too many features for so few samples -- I wouldn't expect any good
generalization -- you would need to employ some feature selection
> >>> mdata.samples.max()
> 23559.484
> >>> mdata.samples.min()
> -1214.7351
yeap -- as I was afraid ;-)
> print mdata.summary() goes into infinite loop!
how do you know that it is an infinite loop and not just that it takes a
while? ;)
> In order to check for improper scaling, I rescaled the whole data so
> that it lies in [0,1]^n
good... although it would also depend on how you did that (per feature
or not), or you could also make use of zscore
> Also, you asked about C=1 (I assume you meant the cvtype in
> Nfoldsplitter) When I try with any other C value > 1, it throws back
> the following error:
nope -- I meant C of the SVM you've used.
> Has anybody worked on the dataset of this size successfully before?
well... I did 200-300k features, but once again -- you better do feature
selection (use FeatureSelectionClassifier, see constructs in
mvpa/clfs/warehouse.py for some examples which would do Anova first to
sub-select features based on univariate measures)
> My
> question is not related to whether SVM successfully classifies or not,
> its related to whether it gives back output within reasonable time on
> such large scale dataset?
for your case, if data is properly scaled and reasonable C chosen (hence
it is SVM specific here), for 41 samples, it should be
quite reasonable ;) if you use SMLR then you could expect somewhat
lengthy computation since it doesn't work in dual space and it would
take it a while to select features while learning
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
More information about the Pkg-ExpPsy-PyMVPA
mailing list