[Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.

Don Armstrong don at debian.org
Tue Aug 6 00:44:16 UTC 2013


On Tue, 06 Aug 2013, Charles Plessy wrote:
> My first question is: to what extent do we need to verify that the
> object can be regenerated.
> 
>   - The starting point is a source package with a R binary object.
>   - With this starting point only, it may be impossible to know if it
>   has a source or not. 
[...]
>   Unless the answer can be found on the Internet, one has to ask the
>   author directly.
>   - If we have to ask, how long do we need to wait for the answer, and
>   what is the conclusion in case there is no answer.

We should ask if there is any question. If we get no answer, we should
use our best judgment as to the likely case. Non-responsive upstreams
also should cause us to question whether we should be distributing the
package at all.
 
> My second question is: to what extent do we need the source.
> 
>   - When the R binary object is a table that has been generated by
>   hand, my understanding is that it does not matter whatever format
>   Upstream prefers, since it is trivial for anybody to export the R
>   object into his favorite format for modification.

The original table in any form is source, then. But if there are any
subsequent alterations to the table, we should distribute those
subsequent alterations. In many cases, you take the original raw data,
and then alter it. If the code to do that exists, we should take the
original raw data, and do the alterations. [This should really be SOP
for all modules in R, because to do otherwise means that it is very
difficult to reproduce your alterations in the event of wrong data or
new data.]

>   - When the data in the R binary object has been produced by
>   processing another data file, to what point do we need to go
>   backwards ? This is an important question, because at the end of the
>   chain of rebuildability, there can be gigabytes of data.

This is a far more difficult case, but if this data exists and can be
digitally distributed Debian should have it and distribute it. Perhaps
not in the source package, but almost certainly in a data package
somewhere. [And honestly, there are very few interesting R packages
which we can actually distribute where this is really the case. I can't
think of any we currently distribute, and the main ones I can think of
involve databases of sequences for microarrays, and there you actually
want the complete data anyway.]

>   - When the source of the binary object is not strictly necessary for
>   making relevant modifications, can we distribute the package in
>   Debian ?

If the source isn't strictly necessary, we should remove the binary
object, and distribute the package.
 
> My last question is, given the answers to the previous questions, what
> do we do with the R packages that are already in the archive and also
> contain data that is editable as is but do have an original source,
> who will do it, and what is the timeline in case of inaction.

The package maintainer should handle it; in the case of inaction from
upstream, the package maintainer can then either remove the data, split
the package, move the package to non-free, or remove the package from
Debian entirely. The timeline should be the standard one that is used
for all RC bugs.

> In the current situation, that I describe as "active bitrotting", we
> do not apply the same rules to the packages that enter the archive and
> the packages that are already in, which cause the packages under
> active development to become obsolete each time new dependancies can
> not enter in Debian.

We actually do and should apply the same rules. Sometimes violations of
the rules are missed for a while, though, and we have to come back and
file bugs with severity serious to deal with the problem.

> Currently, my take would be to move packages to non-free. This would
> also allow us to ship the PDF documentation that we currently delete.

In these cases, we should split the package out into a non-free
component and a free component.

I should note that I'm currently distributing via debian-r.debian.net a
few hundred packages which probably have this particular problem too.

-- 
Don Armstrong                      http://www.donarmstrong.com

in Just-
spring      when the world is mud-
luscious the little lame baloonman 

whistles       far       and wee 
 -- e.e. cummings "[in Just-]"



More information about the Debian-med-packaging mailing list