Bug#941569: RFS: sentencepiece/0.1.83+dfsg-1 [ITP] -- Unsupervised text tokenizer for Neural Network-based text generation
Mo Zhou
lumin at debian.org
Thu Oct 3 05:16:53 BST 2019
Hi NOKUBI,
Thank you for working on this.
Although it may sound boring or even frustrating, data used for training
machine learning models, or pre-trained machine learning models
should be carefully dealt with.
Your copyright file is not complete
https://bitbucket.org/tsuchm/pkg-sentencepiece/src/master/debian/copyright
at least one file in data/ directory are not apache-2.0 licensed:
https://github.com/google/sentencepiece/blob/master/data/botchan.txt#L1-L12
https://github.com/google/sentencepiece/blob/master/data/Scripts.txt#L1-L13
and I'm wondering whether the Japanese poetry book is free:
(I don't speak Japanese but from the "Chinese characters" within the
text
I guess it's a poetry book)
https://raw.githubusercontent.com/google/sentencepiece/master/data/wagahaiwa_nekodearu.txt
as its publisher is 青空文庫. Please confirm the copyright information
for this book and its DFSG compliance.
When there are DFSG-incompatible stuff in a source package, a common
practice in Debian is to strip those components from the original
tarballs and prefix the version string with +dfsg. However, data-driven
applications could become useless when the training data was removed...
This is an awkward difficulty, or say conflict in practice between
free software world and the academical machine learning (computational
linguistics) community.
Besides, the packaging of tensorflow is stalled, as it's difficult
to tame the 4.5 million lines of code without a usable build system.
For a long time the users (including myself) have to (somewhat)
depend on third party ecosystems until the day Google started to
rethink about distribution integration (basically hopeless).
Apart from the science team, you are welcome to join the deep learning
team as well: https://salsa.debian.org/deeplearning-team
(it's an informal team)
On 2019-10-03 02:37, NOKUBI Takatsugu wrote:
> On Wed, 02 Oct 2019 14:52:23 +0900,
> Kentaro Hayashi wrote:
>> * Vcs : https://salsa.debian.org/debian/sentencepiece
>
> It contains tensorflow binding, so I think it will be good to belong
> with Debian Science Team.
>
> I, hayashi-san, and tsuchiya-san sent requests to join the team.
> tsuchiya-san also maintained it himself, so I'll merge them into
> the salsa repository.
>
> https://bitbucket.org/tsuchm/pkg-sentencepiece/src/master/
More information about the debian-science-maintainers
mailing list