[pymvpa] PyMVPA text-data format

Sat Feb 7 19:13:55 UTC 2009

Hi Eric,

On Sat, Feb 07, 2009 at 01:49:51PM -0500, Eric Trautmann wrote:
> Greeting,
>     I'm starting out with PyMVPA and attempting to use this for  
> machine learning problems in robotics.

Sounds cool!

> I haven't seen any documentation regarding what a text dataset should  
> look like.  I've pre-processed our experimental data with Matlab to  
> create the labels and samples, and I'm not concerned with chunks.

You might also consider loading the .mat files directly via scipy.io.loadmat().

> Could someone either refer me to an example of formatted text data or  
> let me know what the format should be?

You have a lot of flexibility wrt the format of the files. I'd recommend
that you load them with NumPy's 'loadtxt()' function. That will give a
data array that you can pass to the 'samples' argument in the Dataset
contructor.

Per chunks: Even if you don't have to care for them with you data, you might
still want to use them. For example if you have 1000 samples, each of them will
be automatically associated with its own chunk -- a cross-validation would then
be 1000-fold. If that is not what you want (e.g. due to processing time
constraints) the chunks are the lever to control for this behavior.

Here is what loadtxt() can do:

In [1]: N.loadtxt?
Type:           function
Base Class:     <type 'function'>
String Form:    <function loadtxt at 0x973c7d4>
Namespace:      Interactive
File:           /usr/lib/python2.5/site-packages/numpy/lib/io.py
Definition:     N.loadtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False)
Docstring:
    Load ASCII data from fname into an array and return the array.

    The data must be regular, same number of values in every row

    Parameters
    ----------
    fname : filename or a file handle.
      Support for gzipped files is automatic, if the filename ends in .gz

    dtype : data-type
      Data type of the resulting array.  If this is a record data-type, the
      resulting array will be 1-d and each row will be interpreted as an
      element of the array. The number of columns used must match the number
      of fields in the data-type in this case.

    comments : str
      The character used to indicate the start of a comment in the file.

    delimiter : str
      A string-like character used to separate values in the file. If delimiter
      is unspecified or none, any whitespace string is a separator.

    converters : {}
      A dictionary mapping column number to a function that will convert that
      column to a float.  Eg, if column 0 is a date string:
      converters={0:datestr2num}. Converters can also be used to provide
      a default value for missing data: converters={3:lambda s: float(s or 0)}.

    skiprows : int
      The number of rows from the top to skip.

    usecols : sequence
      A sequence of integer column indexes to extract where 0 is the first
      column, eg. usecols=(1,4,5) will extract the 2nd, 5th and 6th columns.

    unpack : bool
      If True, will transpose the matrix allowing you to unpack into named
      arguments on the left hand side.

    Examples
    --------
      >>> X = loadtxt('test.dat')  # data in two columns
      >>> x,y,z = load('somefile.dat', usecols=(3,5,7), unpack=True)
      >>> r = np.loadtxt('record.dat', dtype={'names':('gender','age','weight'),
                'formats': ('S1','i4', 'f4')})

    SeeAlso: scipy.io.loadmat to read and write matfiles.

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://apsy.gse.uni-magdeburg.de/hanke
ICQ: 48230050