Bug#848049: diffoscope: Add detection of order-only differences in plain text formats

Jérémy Bobbio lunar at debian.org
Sun Dec 25 11:37:21 UTC 2016


Маша Глухова:
> I believe the attached patch would provide the requested functionality.

Nice work! :)

> From: Maria Glukhova <siamezzze at gmail.com>
> Date: Sat, 24 Dec 2016 12:29:57 +0200
> Subject: [PATCH] Add detection of order-only difference in plain text format.
> 
> Detect if the text files' contents differ only in line ordering, and give an appropriate comment.
> […]
> +def order_only_difference(unified_diff):
> +    diff_lines = unified_diff.splitlines()
> +    added_lines = [line[1:] for line in diff_lines if line.startswith('+')]
> +    removed_lines = [line[1:] for line in diff_lines if line.startswith('-')]
> +    # Faster check: does number of lines match?
> +    if len(added_lines) != len(removed_lines):
> +        return False
> +    # Counter stores line and number of its occurrences.
> +    return sorted(added_lines) == sorted(removed_lines)

I guess it's a fine approach to the problem, but I wonder if it would
not be better to use a slightly less accurate strategy that would be
nicer to memory and CPU.

What I have in mind is to incrementally compute a hash value that would
give the same result even if the lines are in different order.

Drawing from discussions on StackOverflow [1], I think doing a sum of
Python's hash() would work. My test was:

    def unordered_hash(lines):
        h = 0
        for line in lines:
            h += hash(line)
        return h

    h1 = unordered_hash(open('tests/data/text_order1').readlines())
    h2 = unordered_hash(open('tests/data/text_order2').readlines())
    print(h1, h2, h1 == h2)

That way, it could probably be implemented directly in the difference
module and work for other file types than just text files.

 [1]: https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm

-- 
Lunar                                .''`. 
lunar at debian.org                    : :Ⓐ  :  # apt-get install anarchism
                                    `. `'` 
                                      `-   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/reproducible-builds/attachments/20161225/e212d5de/attachment.sig>


More information about the Reproducible-builds mailing list