Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output

Niels Thykier niels at thykier.net
Wed Oct 11 16:57:45 BST 2023


Chris Lamb:
> Niels Thykier wrote:
> 
>> Digging a bit deeper, it turns out that `file -i` correctly classifies
>> the changelog as `text/plain; charset=utf-8`.  That is, `file` knows it
>> is text and I suspect `diffoscope` should try `file -i` as well when it
>> gets an unknown result from `file`.
> 
> By "unknown result" I assume you mean that diffoscope cannot match
> the file type with any known comparator. :)  Indeed, diffoscope
> doesn't recognise the bogus "Message Sequence Chart" so it falls
> back to using a hexdump as you intuited.
> 

Correct.

> I've got some WIP code that will treat unknown file types as text if
> they have a MIME type of text/plain. This avoids the use of hexdump
> with the examples you sent over at least.
> 

Sounds good. :)

> Do you think I should be further limiting that conditional to a
> whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)
> 
> 
> Regards,
> 

I am not sure what to do here.  Maybe you want to normalize the encoding 
first for more reliable diffs.  If one side is utf-8 and the other is 
utf-16 or something more exotic, normalized the encoding before the diff 
might produce more readable diffs as long as diffoscope somewhere 
denotes the encoding difference.  But honestly, I feel I should defer to 
your experience on this particular corner case.

Best regards,
Niels



More information about the Reproducible-builds mailing list