Bug#828941: licensecheck: use binwalk to parse binary blobs?

Andrej Shadura andrew.shadura at collabora.co.uk
Wed Nov 13 13:29:13 GMT 2019


On Wed, 29 Jun 2016 08:24:43 +0000 (UTC) Gianfranco Costamagna
<locutusofborg at debian.org> wrote:
> Source: licensecheck
> Severity: wishlist
> Version: 3.0.0-1
> 
> Hi, as discussed on irc, it might be useful to use binwalk (now with a Python library/binding),
> to spot what is hidden/embedded into binary blobs, and then use the correct tool to search
> for copyrights/licenses.
> 
> Or, as Jonas suggested on irc, use it when in --strict mode, and the parse failed, to let
> the user know what was containing the blob, to better understand why licensecheck failed to parse it.
> 
> Pabs suggested hachoir tool

Using hachoir directly may slow down licensecheck significantly, but
maybe licensecheck would optionally get it involved if asked to. Jonas
(in a private email) suggested the slowdown issue is already quite
serious, so this path needs to be undertaken with caution.

However, I have this idea.

My personal issue with licensecheck is that it tries to parse binary
files, but parses them as text, thus dumping huge lumps of binary junk
into the generated copyright file:

Files: ./data/icons/hicolor/48x48/apps/com.github.maoschanz.drawing.png
 ./help/C/figures/icon.png
Copyright: ^@CC Attribution-ShareAlike
http:creativecommons.org/licenses/by-sa/4.0/ÃTb^E^@^@^E<U+0091>IDATh<U+0081>í<U+0098>kl^TU^X<U+0086><U+009F>3{ë^h<U+008B>
  n¸¤¼Ñ^E;{Ö<%¥Ì^Qz'ü<U+0083>E<U+0082>ë½<U+009D>
  r^RtÅ|вxÒk³Ú ð¼
License: CC-BY-SA
 FIXME

This output is useless, wrong, and it takes a lot of time to generate
since binary files are typically bigger than text files.

What if we tried to detect binary files before parsing them?
A very dumb algorithm would:
1) Check the first 8-16-32-whatever-sensible bytes for magic sequences
of files that might contain copyright/license metadata, e.g. PNG, JPEG,
SVG… (we need to keep this list short)
2) If something’s detected, parse that in a special way, Perl seems to
have a lot of modules for that
3) If nothing found but the file looks binary (TBD how we detect this),
use hachoir of whatever suitable if available, otherwise say UNKNOWN
4) Never dump binary stuff

At worst, a filter to remove non-ASCII stuff from binary-looking files
would be very useful.

-- 
Cheers,
  Andrej



More information about the pkg-perl-maintainers mailing list