Bug#848049: diffoscope: Add detection of order-only differences in plain text formats
Jérémy Bobbio
lunar at debian.org
Sun Dec 25 11:37:21 UTC 2016
Маша Глухова:
> I believe the attached patch would provide the requested functionality.
Nice work! :)
> From: Maria Glukhova <siamezzze at gmail.com>
> Date: Sat, 24 Dec 2016 12:29:57 +0200
> Subject: [PATCH] Add detection of order-only difference in plain text format.
>
> Detect if the text files' contents differ only in line ordering, and give an appropriate comment.
> […]
> +def order_only_difference(unified_diff):
> + diff_lines = unified_diff.splitlines()
> + added_lines = [line[1:] for line in diff_lines if line.startswith('+')]
> + removed_lines = [line[1:] for line in diff_lines if line.startswith('-')]
> + # Faster check: does number of lines match?
> + if len(added_lines) != len(removed_lines):
> + return False
> + # Counter stores line and number of its occurrences.
> + return sorted(added_lines) == sorted(removed_lines)
I guess it's a fine approach to the problem, but I wonder if it would
not be better to use a slightly less accurate strategy that would be
nicer to memory and CPU.
What I have in mind is to incrementally compute a hash value that would
give the same result even if the lines are in different order.
Drawing from discussions on StackOverflow [1], I think doing a sum of
Python's hash() would work. My test was:
def unordered_hash(lines):
h = 0
for line in lines:
h += hash(line)
return h
h1 = unordered_hash(open('tests/data/text_order1').readlines())
h2 = unordered_hash(open('tests/data/text_order2').readlines())
print(h1, h2, h1 == h2)
That way, it could probably be implemented directly in the difference
module and work for other file types than just text files.
[1]: https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm
--
Lunar .''`.
lunar at debian.org : :Ⓐ : # apt-get install anarchism
`. `'`
`-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/reproducible-builds/attachments/20161225/e212d5de/attachment.sig>
More information about the Reproducible-builds
mailing list