[pymvpa] Assembling PyMVPA 0.5: The new dataset

Wed Oct 28 17:12:28 UTC 2009

Hi all,

this is the first of a series of posts introducing new features and
changes that will be part of the upcoming PyMVPA release 0.5 (expected
some time in early 2010).

The new dataset
---------------

In comparision to the old dataset class the new one is a lot simpler,
but at the same time considerably more powerful. The new class is
implemented to be a container  and just that -- nothing else. All
convoluted logic (e.g. setting labels to a permuted status, and reset
them later on) has been stripped from the class.

The major differences are:

* A dataset can now have an arbitrary number of attributes per sample
  (i.e. not just labels!) and also _per feature_. This will allow us to
  have multiple label sets and also to add information for grouping
  features, as we had for samples before (i.e. to implement distinct ROI
  sets or add stat maps to a dataset).

* However, while it can have more attributes it can also have less. The
  new dataset allows for unlabeled data -- no need to invent pointless
  placeholder labels. In the simplest case a dataset can be created from
  just a 2D samples matrix/array.

* Datasets will no longer copy data over and over. While the former
  behavior might be considered safer, the new one is leaner and
  potentially leads to more speed. Datasets can now be sliced just like
  Numpy arrays (using boolean masks, index sequences or slicing
  arguments). Whenever Numpy allows for slicing without copying PyMVPA
  datasets will offer the same.

  As a consequence more and more functions will simply return new
  datasets instead of modifying existing ones and keeping track of their
  modifications. For example labels permutation simply produces a
  shallow copy of a datasets, assigns permuted labels, but uses views
  for all remaining dataset information. When no longer needed the
  permuted dataset can simply be dumped -- no need to restore labels to
  their previous state.

* __init__ became much simpler. There are no fishy **kwargs anymore. The
  contructor is no generic and valid for all Dataset subclasses.
  Additional ways to create datasets are implemented as classmethods to
  avoid pseudo-overloading of __init__().

If you want to take a first look see here

  http://github.com/hanke/PyMVPA/blob/mh/master/mvpa/datasets/base.py

for the code and examples and here

  http://github.com/hanke/PyMVPA/blob/mh/master/mvpa/tests/test_datasetng.py

for a few test cases that show more functionality.

If you spot any problems, bugs or have some recommendation -- please
speak up! The reimplementation of the dataset is just the first (but a
critical) step to allow for some more necessary improvement coming with
PyMVPA 0.5.

Stay tuned...

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de