[pymvpa] Assembling PyMVPA 0.5: The new dataset
Michael Hanke
michael.hanke at gmail.com
Wed Oct 28 17:12:28 UTC 2009
Hi all,
this is the first of a series of posts introducing new features and
changes that will be part of the upcoming PyMVPA release 0.5 (expected
some time in early 2010).
The new dataset
---------------
In comparision to the old dataset class the new one is a lot simpler,
but at the same time considerably more powerful. The new class is
implemented to be a container and just that -- nothing else. All
convoluted logic (e.g. setting labels to a permuted status, and reset
them later on) has been stripped from the class.
The major differences are:
* A dataset can now have an arbitrary number of attributes per sample
(i.e. not just labels!) and also _per feature_. This will allow us to
have multiple label sets and also to add information for grouping
features, as we had for samples before (i.e. to implement distinct ROI
sets or add stat maps to a dataset).
* However, while it can have more attributes it can also have less. The
new dataset allows for unlabeled data -- no need to invent pointless
placeholder labels. In the simplest case a dataset can be created from
just a 2D samples matrix/array.
* Datasets will no longer copy data over and over. While the former
behavior might be considered safer, the new one is leaner and
potentially leads to more speed. Datasets can now be sliced just like
Numpy arrays (using boolean masks, index sequences or slicing
arguments). Whenever Numpy allows for slicing without copying PyMVPA
datasets will offer the same.
As a consequence more and more functions will simply return new
datasets instead of modifying existing ones and keeping track of their
modifications. For example labels permutation simply produces a
shallow copy of a datasets, assigns permuted labels, but uses views
for all remaining dataset information. When no longer needed the
permuted dataset can simply be dumped -- no need to restore labels to
their previous state.
* __init__ became much simpler. There are no fishy **kwargs anymore. The
contructor is no generic and valid for all Dataset subclasses.
Additional ways to create datasets are implemented as classmethods to
avoid pseudo-overloading of __init__().
If you want to take a first look see here
http://github.com/hanke/PyMVPA/blob/mh/master/mvpa/datasets/base.py
for the code and examples and here
http://github.com/hanke/PyMVPA/blob/mh/master/mvpa/tests/test_datasetng.py
for a few test cases that show more functionality.
If you spot any problems, bugs or have some recommendation -- please
speak up! The reimplementation of the dataset is just the first (but a
critical) step to allow for some more necessary improvement coming with
PyMVPA 0.5.
Stay tuned...
Michael
--
GPG key: 1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de
More information about the Pkg-ExpPsy-PyMVPA
mailing list