[DRE-maint] Bug#853244: ruby-sshkit: Non-determistically FTBFS due to unreliable timing in tests

Wed Apr 17 12:49:06 BST 2019

tags 853244 + patch
thanks

On Mon, 6 Feb 2017, Chris Lamb wrote:

> The problem is that this package would need a *specific* note/mention
> that it should be ignored. Whilst we could certainly add that for
> Reproducible Builds, what happens, for example, when the next person
> runs a rebuild across the archive to test for something else (etc. etc.)

Hi. I'm afraid I'm such next person...

This package FTBFS randomly for me on START1-S / DEV1-S instances from
Scaleway, and the failure rate is around 30% there.

I've put a bunch of build logs here:

https://people.debian.org/~sanvila/build-logs/ruby-sshkit/

and I've setup a Jenkins job here in case you want to see it failing
in real time:

https://jenkins-2.reliable-builds.org/job/ruby-sshkit/

Even if our packages are built "officially" on buildd.debian.org,
this does not mean that it is a good idea to "target" the specific
(or even approximate) CPU speed of buildd.debian.org autobuilders.

The end user must be able to build packages without failures as well,
because, after all, we don't just provide binary packages, we also
provide the freedom to modify them at will.

This is the reason why I would consider this bug "serious" as well.

In fact, imagine what happened if I wanted to build all packages from
source and all upstream authors did the same (i.e. target a specific
CPU speed which makes the package to FTBFS 30% of the time in some
systems). From 30000 source packages I would have to retry on 9000 of
them. From those, 2700 would fail again in the second try, and I would
still have to retry in a geometrical decreasing way until all packages
are successfully build. This certainly does not scale.


Anyway, let's see if we can improve the status of this problem in buster:

There are four tests which fail over and over again in my setup. If I
grep for "FAIL" in the build logs above and count occurrences, this is
what I get:

  23 test_the_connection_manager_can_run_things_with_custom_runner_configs
  21 test_the_connection_manager_can_run_things_in_groups
  15 test_the_connection_manager_can_run_things_in_custom_runner
  10 test_the_connection_manaager_runs_things_in_parallel_by_default
      
So, I strongly suggest disabling those tests for buster, since they
seem to be the ones that fail most.

The patch below (warning: untested) may help.

Thanks.

--- a/test/unit/test_coordinator.rb
+++ b/test/unit/test_coordinator.rb
@@ -43,12 +43,6 @@ module SSHKit
       assert_equal "Command: echo 1.example.com\n", actual_output_commands.last
     end
 
-    def test_the_connection_manaager_runs_things_in_parallel_by_default
-      Coordinator.new(%w{1.example.com 2.example.com}).each(&echo_time)
-      assert_equal 2, actual_execution_times.length
-      assert_within_10_ms(actual_execution_times)
-    end
-
     def test_the_connection_manager_can_run_things_in_sequence
       Coordinator.new(%w{1.example.com 2.example.com}).each in: :sequence, &echo_time
       assert_equal 2, actual_execution_times.length
@@ -68,46 +62,6 @@ module SSHKit
       end
     end
 
-    def test_the_connection_manager_can_run_things_in_custom_runner
-      begin
-        $original_runner = SSHKit.config.default_runner
-        SSHKit.config.default_runner = MyRunner
-
-        Coordinator.new(%w{1.example.com 2.example.com}).each(&echo_time)
-        assert_equal 2, actual_execution_times.length
-        assert_within_10_ms(actual_execution_times)
-        assert_match(/custom runner out/, @output)
-      ensure
-        SSHKit.config.default_runner = $original_runner
-      end
-    end
-
-    def test_the_connection_manager_can_run_things_with_custom_runner_configs
-      begin
-        $original_runner = SSHKit.config.default_runner
-        SSHKit.config.default_runner = :groups
-        $original_runner_config = SSHKit.config.default_runner_config
-        SSHKit.config.default_runner_config = { limit: 2, wait: 5 }
-
-        Coordinator.new(
-          %w{
-            1.example.com
-            2.example.com
-            3.example.com
-            4.example.com
-          }
-        ).each(&echo_time)
-        assert_equal 4, actual_execution_times.length
-        assert_within_10_ms(actual_execution_times[0..1])
-        assert_within_10_ms(actual_execution_times[2..3])
-        assert_at_least_5_sec_apart(actual_execution_times[0], actual_execution_times[2])
-        assert_at_least_5_sec_apart(actual_execution_times[1], actual_execution_times[3])
-      ensure
-        SSHKit.config.default_runner = $original_runner
-        SSHKit.config.default_runner_config = $original_runner_config
-      end
-    end
-
     def test_the_connection_manager_can_run_things_in_sequence_with_wait
       start = Time.now
       Coordinator.new(%w{1.example.com 2.example.com}).each in: :sequence, wait: 10, &echo_time
@@ -115,25 +69,6 @@ module SSHKit
       assert_operator(stop - start, :>=, 10.0)
     end
 
-    def test_the_connection_manager_can_run_things_in_groups
-      Coordinator.new(
-        %w{
-          1.example.com
-          2.example.com
-          3.example.com
-          4.example.com
-          5.example.com
-          6.example.com
-        }
-      ).each in: :groups, &echo_time
-      assert_equal 6, actual_execution_times.length
-      assert_within_10_ms(actual_execution_times[0..1])
-      assert_within_10_ms(actual_execution_times[2..3])
-      assert_within_10_ms(actual_execution_times[4..5])
-      assert_at_least_1_sec_apart(actual_execution_times[1], actual_execution_times[2])
-      assert_at_least_1_sec_apart(actual_execution_times[3], actual_execution_times[4])
-    end
-
     private
 
     def assert_at_least_1_sec_apart(first_time, last_time)