[Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.
Charles Plessy
plessy at debian.org
Mon Aug 5 23:29:14 UTC 2013
Hi Joerg and Paul,
thank you for your prompt answers and thank for everybody's contribution.
I would like to focus my questions on R binary objects that represent data that
was not entirely computer-generated (that is, for which the source code can not
be summarised by a mathematical formula and simple starting values). Note also
that a large number of other software, like LibreOffice for instance, allow to
store unformatted textual data as a binary object. Therefore "binary object"
does not mean that the content is impractical to retreive.
My first question is: to what extent do we need to verify that the object can
be regenerated.
- The starting point is a source package with a R binary object.
- With this starting point only, it may be impossible to know if it has a
source or not. Has the upstream developer typed the results by hand
in a R session, for instance when collecting data from a table in a
printed report, did he collect his data in a file, not provided
in the source package, or does he need a combination of data and scripts
to regenerate the binary object ? Unless the answer can be found on the
Internet, one has to ask the author directly.
- If we have to ask, how long do we need to wait for the answer, and what
is the conclusion in case there is no answer.
My second question is: to what extent do we need the source.
- When the R binary object is a table that has been generated by hand,
my understanding is that it does not matter whatever format Upstream
prefers, since it is trivial for anybody to export the R object into
his favorite format for modification.
- When the data in the R binary object has been produced by processing
another data file, to what point do we need to go backwards ? This
is an important question, because at the end of the chain of
rebuildability, there can be gigabytes of data.
- When the source of the binary object is not strictly necessary for
making relevant modifications, can we distribute the package in Debian ?
My last question is, given the answers to the previous questions, what do we do
with the R packages that are already in the archive and also contain data that is
editable as is but do have an original source, who will do it, and what is the
timeline in case of inaction.
Also, since the case of pictures have been discussed, here is a parallel
between R objects and PNG files is the following.
1) In the PNG file's metadata, there is a field that can indicate if for instance
it was made by Inkscape. However, in presence of that field, one can not
conclude if the SVG source is still existing, or if it exists on the computer
of a contributor, but the upstream developers decided to discard it.
2) If a program displays an image in PNG format and does not use its SVG
source, while one can regret that the source is not available, it does not
prevent from editing the PNG, or even replacing it entirely.
3) One could consider to scan the Debian archive for PNG files made with
Inkscape with no corresponding SVG file in the source package. Would such
packages be non-Free ? If yes, how long would you wait before removing the
package ?
While writing this answer, I also read Don's email advocating for Debian to
take the lead and change the current practice in the R community, that prefers
to ditribute data as R binary objects in the source packages. This is
laudable, but I expect that it will take time, and it needs people who have
roots in both communities.
In the current situation, that I describe as "active bitrotting", we do not
apply the same rules to the packages that enter the archive and the packages
that are already in, which cause the packages under active development to
become obsolete each time new dependancies can not enter in Debian. Given the
rotten tomatoes that fly on my face because I can not update anymore the
r-cran-ggplot2 package, I do not feel fit to the task of negociating with the
R community to change its traditions.
In any case, I think that we need clear guidelines, that help to foresee if a R
package is acceptable or not in Debian, so that we can better decide if we
undertake the work at all.
Currently, my take would be to move packages to non-free. This would also
allow us to ship the PDF documentation that we currently delete.
Cheers,
--
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan
More information about the Debian-med-packaging
mailing list