Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
FC Stegerman
flx at obfusk.net
Sun Oct 15 23:06:31 BST 2023
* Chris Lamb <lamby at debian.org> [2023-10-11 14:51]:
> Niels Thykier wrote:
>
> > Digging a bit deeper, it turns out that `file -i` correctly classifies
> > the changelog as `text/plain; charset=utf-8`. That is, `file` knows it
> > is text and I suspect `diffoscope` should try `file -i` as well when it
> > gets an unknown result from `file`.
>
> By "unknown result" I assume you mean that diffoscope cannot match
> the file type with any known comparator. :) Indeed, diffoscope
> doesn't recognise the bogus "Message Sequence Chart" so it falls
> back to using a hexdump as you intuited.
I would argue that this is a bug in file(1) as Magdir/communications
uses a "string" test, which is for binary files. If this is a text
file, not a binary format, it should be forcing a text file test by
using "string/t" instead.
That said, this is likely not the only such bug (I already encountered
one before [1]), so the suggestion below makes sense to me.
> I've got some WIP code that will treat unknown file types as text if
> they have a MIME type of text/plain. This avoids the use of hexdump
> with the examples you sent over at least.
>
> Do you think I should be further limiting that conditional to a
> whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)
I don't think we need to handle encodings differently from how we
already handle files identified as text by file(1): the TextFile
comparator tries to guess the encoding, but falls back to a hexdump
for e.g. euc-jp encoded files which are identified as "unknown-8bit"
by File.guess_encoding(), resulting in a LookupError from
codecs.open().
- Fay
[1] https://mailman.astron.com/pipermail/file/2023-February/001132.html
More information about the Reproducible-builds
mailing list