[pymvpa] problem with high dimensional dataset

Fri Mar 26 23:58:31 UTC 2010

Hi Yaroslav,

Thanks for your suggestions. Sorry that I am replying late, I was busy in
some other stuff, so din't get time to check on your suggestions.

here is some additional info about my dataset (named mdata):

>>> mdata.samples.shape
(41, 8673536)
>>> mdata.samples.dtype
dtype('float32')
>>> mdata.samples.max()
23559.484
>>> mdata.samples.min()
-1214.7351

print mdata.summary() goes into infinite loop!

following is the output of print mvpa.wtf() (before going into infinite
loop)

Current date:   2010-03-26 18:52
PyMVPA:
 Version:       0.4.4
 Path:          /var/lib/python-support/python2.5/mvpa/__init__.pyc
 Version control (GIT):
 GIT information could not be obtained due
"/var/lib/python-support/python2.5/mvpa/.. is not under GIT"
SYSTEM:
 OS:            posix Linux 2.6.26-1-amd64 #1 SMP Mon Dec 15 17:25:36 UTC
2008
 Distribution:  debian/5.0
EXTERNALS:
 Present:       cPickle, ctypes, gzip, libsvm, libsvm verbosity control,
matplotlib, mdp, mdp ge 2.4, nifti, nifti ge 0.20090205.1, numpy, pylab,
pylab plottable, pywt, reportlab, scipy, sg_fixedcachesize, shogun,
shogun.krr, shogun.mpd
 Absent:        atlas_fsl, atlas_pymvpa, elasticnet, glmnet, good
scipy.stats.rdist, good scipy.stats.rv_discrete.ppf, griddata, hcluster,
lars, lxml, nose, openopt, pywt wp reconstruct, pywt wp reconstruct fixed,
rpy, running ipython env, sg ge 0.6.4, shogun.lightsvm, shogun.svrlight,
weave
 Versions of critical externals:
  ctypes      : 1.0.3
  matplotlib  : 0.98.1
  nifti       : 0.20090303.1
  numpy       : 1.1.0
  pywt        : 0.1.7
  scipy       : 0.6.0
  shogun      : v0.6.3_r3165_2008-06-13_11:03_
 Matplotlib backend: GTKAgg
RUNTIME:
 PyMVPA Environment Variables:
 PyMVPA Runtime Configuration:
  [externals]
  have griddata = no
  have atlas_pymvpa = no
  have good scipy.stats.rdist = no
  have pylab plottable = yes
  have pywt wp reconstruct = no
  have mdp = yes
  have lxml = no
  have running ipython env = no
  have sg_fixedcachesize = yes
  have elasticnet = no
  have shogun.mpd = yes
  have matplotlib = yes
  have pywt wp reconstruct fixed = no
  have scipy = yes
  have reportlab = yes
  have openopt = no
  have libsvm = yes
  have nifti ge 0.20090205.1 = yes
  have nose = no
  have weave = no
  have atlas_fsl = no
  have ctypes = yes
  have hcluster = no
  have sg ge 0.6.4 = no
  have good scipy.stats.rv_discrete.ppf = no
  have libsvm verbosity control = yes
  have mdp ge 2.4 = yes
  have shogun.svrlight = no
  have rpy = no
  have shogun = yes
  have glmnet = no
  have lars = no
  have nifti = yes
  have shogun.krr = yes
  have cpickle = yes
  have numpy = yes
  have pylab = yes
  have shogun.lightsvm = no
  have pywt = yes
  have gzip = yes

  [general]
  verbose = 1
 Process Information:
  Name:    python
  State:    R (running)
  Tgid:    6316
  Pid:    6316
  PPid:    6152
  TracerPid:    0
  Uid:    1001    1001    1001    1001
  Gid:    1001    1001    1001    1001
  FDSize:    256
  Groups:    1001
  VmPeak:      692040 kB
  VmSize:      630276 kB
  VmLck:           0 kB
  VmHWM:       73408 kB
  VmRSS:       73408 kB
  VmData:      125616 kB
  VmStk:         228 kB
  VmExe:        1172 kB
  VmLib:       71500 kB
  VmPTE:        1080 kB
  Threads:    2
  SigQ:    0/71680
  SigPnd:    0000000000000000
  ShdPnd:    0000000000000000
  SigBlk:    0000000000000000
  SigIgn:    0000000001001000
  SigCgt:    0000000180000002
  CapInh:    0000000000000000
  CapPrm:    0000000000000000
  CapEff:    0000000000000000
  CapBnd:    ffffffffffffffff
  Cpus_allowed:    000000ff
  Cpus_allowed_list:    0-7
  Mems_allowed:    00000000,00000001
  Mems_allowed_list:    0
  voluntary_ctxt_switches:    1482
  nonvoluntary_ctxt_switches:    101

In order to check for improper scaling, I rescaled the whole data so that it
lies in [0,1]^n but the problem of going into infinite loop still persists.
Also, you asked about C=1 (I assume you meant the cvtype in Nfoldsplitter)
When I try with any other C value > 1, it throws back the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/var/lib/python-support/python2.5/mvpa/datasets/splitters.py", line
199, in __call__
    cfgs = self.splitcfg(dataset)
  File "/var/lib/python-support/python2.5/mvpa/datasets/splitters.py", line
388, in splitcfg
    return self._getSplitConfig(eval('dataset.unique' + self.__splitattr))
  File "/var/lib/python-support/python2.5/mvpa/datasets/splitters.py", line
626, in _getSplitConfig
    self.__cvtype)]
  File "/var/lib/python-support/python2.5/mvpa/misc/support.py", line 265,
in getUniqueLengthNCombinations
    return _getUniqueLengthNCombinations_binary(L, n, sort=True)
  File "/var/lib/python-support/python2.5/mvpa/misc/support.py", line 236,
in _getUniqueLengthNCombinations_binary
    for X in range(2**N):
OverflowError: range() result has too many items

Has anybody worked on the dataset of this size successfully before? My
question is not related to whether SVM successfully classifies or not, its
related to whether it gives back output within reasonable time on such large
scale dataset?

Thanks in advance,
Alok

On Mon, Mar 22, 2010 at 6:58 PM, Yaroslav Halchenko
<debian at onerussian.com>wrote:

> Hi Alok,
>
> My guess is that you are using C=1 (or some other rigid value positive
> value) and/or improperly scaled data... then libsvm's optimizer might
> get into such loop.
>
> For best troubleshooting provide
>
> print mvpa.wtf()
>
> print dataset.summary()
> for your dataset which you try to process
>
> and actual classifier you are using
>
> On Mon, 22 Mar 2010, Alok Deshpande wrote:
>
> >    Hi all,
> >    I am currently working on basic classification problems on resting
> >    state fMRI datasets (like gender differences, for example) The
> >    dimensionality of feature vector is pretty large (128 time points X 33
> >    X 64 X 64) I am working on a linux server with 8 cores and suff RAM.
> >    Following is the output of command 'free -m'
> >    free -m
> >                 total       used       free     shared    buffers
> >    cached
> >    Mem:          8006       1254       6752          0        100
> >    854
> >    -/+ buffers/cache:        299       7706
> >    Swap:        23454        209      23244
> >    When I try to train a simple linear SVM classifier on the dataset (I
> --
>                                  .-.
> =------------------------------   /v\  ----------------------------=
> Keep in touch                    // \\     (yoh@|www.)onerussian.com
> Yaroslav Halchenko              /(   )\               ICQ#: 60653192
>                   Linux User    ^^-^^    [175555]
>
>
>
> _______________________________________________
> Pkg-ExpPsy-PyMVPA mailing list
> Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
> http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa
>

-- 
Alok S. Deshpande
Graduate Student
Electrical & Computer Engineering Department
University of Wisconsin Madison
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20100326/8b501646/attachment-0001.htm>