Bug#941569: RFS: sentencepiece/0.1.83+dfsg-1 [ITP] -- Unsupervised text tokenizer for Neural Network-based text generation

Mo Zhou lumin at debian.org
Thu Oct 3 05:16:53 BST 2019


Hi NOKUBI,

Thank you for working on this.
Although it may sound boring or even frustrating, data used for training
machine learning models, or pre-trained machine learning models
should be carefully dealt with.

Your copyright file is not complete
https://bitbucket.org/tsuchm/pkg-sentencepiece/src/master/debian/copyright

at least one file in data/ directory are not apache-2.0 licensed:
https://github.com/google/sentencepiece/blob/master/data/botchan.txt#L1-L12
https://github.com/google/sentencepiece/blob/master/data/Scripts.txt#L1-L13

and I'm wondering whether the Japanese poetry book is free:
(I don't speak Japanese but from the "Chinese characters" within the
text
 I guess it's a poetry book)
https://raw.githubusercontent.com/google/sentencepiece/master/data/wagahaiwa_nekodearu.txt
as its publisher is 青空文庫. Please confirm the copyright information
for this book and its DFSG compliance.

When there are DFSG-incompatible stuff in a source package, a common
practice in Debian is to strip those components from the original
tarballs and prefix the version string with +dfsg. However, data-driven
applications could become useless when the training data was removed...
This is an awkward difficulty, or say conflict in practice between
free software world and the academical machine learning (computational
linguistics) community.

Besides, the packaging of tensorflow is stalled, as it's difficult
to tame the 4.5 million lines of code without a usable build system.
For a long time the users (including myself) have to (somewhat)
depend on third party ecosystems until the day Google started to
rethink about distribution integration (basically hopeless).

Apart from the science team, you are welcome to join the deep learning
team as well: https://salsa.debian.org/deeplearning-team
(it's an informal team)

On 2019-10-03 02:37, NOKUBI Takatsugu wrote:
> On Wed, 02 Oct 2019 14:52:23 +0900,
> Kentaro Hayashi wrote:
>>  * Vcs             : https://salsa.debian.org/debian/sentencepiece
> 
> It contains tensorflow binding, so I think it will be good to belong
> with Debian Science Team.
> 
> I, hayashi-san, and tsuchiya-san sent requests to join the team.
> tsuchiya-san also maintained it himself, so I'll merge them into
> the salsa repository.
> 
> https://bitbucket.org/tsuchm/pkg-sentencepiece/src/master/



More information about the debian-science-maintainers mailing list