Bug#848049: diffoscope: Add detection of order-only differences in plain text formats

Sun Dec 25 14:17:21 UTC 2016

Jeremy,
Thank you for sharing that!
The reason why I did not use some algorihm like that is that it requires to
read files for the second time. Right now, all the actual work with the
content of the files (except for the quick check for has_same_content) is
delegated to diff, and on big files, it occupies most of the time. Assuming
that for big files, reading them from drive would be the bottleneck, I
tried to avoid reading them again, instead working with the result of the
diff.
Still, I would be happily mistaken. I will implement your version and
compare the performance.
Thank you again :)

Maria Glukhova

вс, 25 дек. 2016, 13:37 Jérémy Bobbio <lunar at debian.org>:

> Маша Глухова:
> > I believe the attached patch would provide the requested functionality.
>
> Nice work! :)
>
> > From: Maria Glukhova <siamezzze at gmail.com>
> > Date: Sat, 24 Dec 2016 12:29:57 +0200
> > Subject: [PATCH] Add detection of order-only difference in plain text
> format.
> >
> > Detect if the text files' contents differ only in line ordering, and
> give an appropriate comment.
> > […]
> > +def order_only_difference(unified_diff):
> > +    diff_lines = unified_diff.splitlines()
> > +    added_lines = [line[1:] for line in diff_lines if
> line.startswith('+')]
> > +    removed_lines = [line[1:] for line in diff_lines if
> line.startswith('-')]
> > +    # Faster check: does number of lines match?
> > +    if len(added_lines) != len(removed_lines):
> > +        return False
> > +    # Counter stores line and number of its occurrences.
> > +    return sorted(added_lines) == sorted(removed_lines)
>
> I guess it's a fine approach to the problem, but I wonder if it would
> not be better to use a slightly less accurate strategy that would be
> nicer to memory and CPU.
>
> What I have in mind is to incrementally compute a hash value that would
> give the same result even if the lines are in different order.
>
> Drawing from discussions on StackOverflow [1], I think doing a sum of
> Python's hash() would work. My test was:
>
>     def unordered_hash(lines):
>         h = 0
>         for line in lines:
>             h += hash(line)
>         return h
>
>     h1 = unordered_hash(open('tests/data/text_order1').readlines())
>     h2 = unordered_hash(open('tests/data/text_order2').readlines())
>     print(h1, h2, h1 == h2)
>
> That way, it could probably be implemented directly in the difference
> module and work for other file types than just text files.
>
>  [1]:
> https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm
>
> --
> Lunar                                .''`.
> lunar at debian.org                    : :Ⓐ  :  # apt-get install anarchism
>                                     `. `'`
>                                       `-
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/reproducible-builds/attachments/20161225/6372eee9/attachment.html>