Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

Sun Oct 15 23:06:31 BST 2023

* Chris Lamb <lamby at debian.org> [2023-10-11 14:51]:
> Niels Thykier wrote:
> 
> > Digging a bit deeper, it turns out that `file -i` correctly classifies 
> > the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it 
> > is text and I suspect `diffoscope` should try `file -i` as well when it 
> > gets an unknown result from `file`.
> 
> By "unknown result" I assume you mean that diffoscope cannot match
> the file type with any known comparator. :)  Indeed, diffoscope
> doesn't recognise the bogus "Message Sequence Chart" so it falls
> back to using a hexdump as you intuited.

I would argue that this is a bug in file(1) as Magdir/communications
uses a "string" test, which is for binary files.  If this is a text
file, not a binary format, it should be forcing a text file test by
using "string/t" instead.

That said, this is likely not the only such bug (I already encountered
one before [1]), so the suggestion below makes sense to me.

> I've got some WIP code that will treat unknown file types as text if
> they have a MIME type of text/plain. This avoids the use of hexdump
> with the examples you sent over at least.
> 
> Do you think I should be further limiting that conditional to a
> whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)

I don't think we need to handle encodings differently from how we
already handle files identified as text by file(1): the TextFile
comparator tries to guess the encoding, but falls back to a hexdump
for e.g. euc-jp encoded files which are identified as "unknown-8bit"
by File.guess_encoding(), resulting in a LookupError from
codecs.open().

- Fay

[1] https://mailman.astron.com/pipermail/file/2023-February/001132.html