Bug#1053668: diffoscope: Consider using `file -i` as fallback for unknown file output
Niels Thykier
niels at thykier.net
Wed Oct 11 16:57:45 BST 2023
Chris Lamb:
> Niels Thykier wrote:
>
>> Digging a bit deeper, it turns out that `file -i` correctly classifies
>> the changelog as `text/plain; charset=utf-8`. That is, `file` knows it
>> is text and I suspect `diffoscope` should try `file -i` as well when it
>> gets an unknown result from `file`.
>
> By "unknown result" I assume you mean that diffoscope cannot match
> the file type with any known comparator. :) Indeed, diffoscope
> doesn't recognise the bogus "Message Sequence Chart" so it falls
> back to using a hexdump as you intuited.
>
Correct.
> I've got some WIP code that will treat unknown file types as text if
> they have a MIME type of text/plain. This avoids the use of hexdump
> with the examples you sent over at least.
>
Sounds good. :)
> Do you think I should be further limiting that conditional to a
> whitelist of safe encodings, too? (eg. "utf-8" and "us-ascii", etc.)
>
>
> Regards,
>
I am not sure what to do here. Maybe you want to normalize the encoding
first for more reliable diffs. If one side is utf-8 and the other is
utf-16 or something more exotic, normalized the encoding before the diff
might produce more readable diffs as long as diffoscope somewhere
denotes the encoding difference. But honestly, I feel I should defer to
your experience on this particular corner case.
Best regards,
Niels
More information about the Reproducible-builds
mailing list