Bug#925294: Does not work without extra downloads

Tue Mar 26 14:11:05 GMT 2019

control: tags -1 +wontfix

Hi Enrico,

> It would have been an entirely different story if the datasets that nltk
> needs were also packaged in Debian, so that it could have worked out of
> the box.

I totally understand your preference and I also prefer the libraries
that work out of box without network access. However, the recent
advances in the computational linguistics field (or say natural language
processing) involve more and more machine learning (including deep
learning, i.e. deep neural networks) stuff.

NLTK's data tarball includes various datasets (corpus) and pre-trained
models, and I guess some of them are highly copyright-problematic.
Pre-trained neural nets are quite involved, and there is still no clear
conclution for it. (Topic: https://lwn.net/Articles/760142/)

To make nltk-data DFSG-compatible, we may need someone with NLP
experience[1] to review all the contents, including the pre-trained
models. Models trained on non-DFSG datasets will have to be removed
as well. I can do this by myself but such workload seems not worthwhile.
I also use Spacy[2] in my research work, which also needs pre-trained
blobs. One will definitely encounter similar licensing problem if
he/she wants to package spacy-models as well.

> I am extremely reluctant to run unreviewed code that downloads random
> data from the internet in some unspecified way, and does unspecified
> things with it, to the point that I decided to give up using the library
> altogether.

If you trust Archlinux's signing keys, download this:
https://www.archlinux.org/packages/community/any/nltk-data/
That's the best suggestion I have (And I'm really doing so for myself).

> CC: debian-devel
Since this bug associates to the pretrained neural net problem.

For NEW software, these key words usually implies a great possibility
that "non-DFSG blobs inside": computational linguistics, natural language
processing, computer vision, XXX (e.g. machine, deep, reinforcement,
suprevised, unsupervised) learning, artificial intelligence.
My current attitude is to avoid packaging any related data package.

[1] Experience really boosts up the copyright-reviewing speed.
[2] This might have been packaged&uploaded by Andreas.