[Git][qa/jenkins.debian.net][master] 6 commits: djm: reduce wait time between rebooting nodes

Holger Levsen (@holger) gitlab at salsa.debian.org
Wed Jul 5 22:51:18 BST 2023



Holger Levsen pushed to branch master at Debian QA / jenkins.debian.net


Commits:
f3add0b0 by Holger Levsen at 2023-07-05T23:32:07+02:00
djm: reduce wait time between rebooting nodes

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -
275e4dd9 by Holger Levsen at 2023-07-05T23:32:45+02:00
node health check: refactoring

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -
fd0539ea by Holger Levsen at 2023-07-05T23:37:12+02:00
node health check: also try to restart failed munin-node services

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -
a62a0f70 by Holger Levsen at 2023-07-05T23:38:27+02:00
jenkins monitoring: run system and node health checks less frequently

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -
b1c14191 by Holger Levsen at 2023-07-05T23:49:59+02:00
reproducible Debian: treat more postgresql errors as serious (for some jobs)

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -
680a3c08 by Holger Levsen at 2023-07-05T23:50:53+02:00
update TODO, mention o4+5 in THANKS

Signed-off-by: Holger Levsen <holger at layer-acht.org>

- - - - -


7 changed files:

- THANKS.head
- TODO
- bin/djm
- bin/reproducible_common.sh
- bin/reproducible_node_health_check.sh
- job-cfg/reproducible.yaml
- logparse/reproducible-extra.rules


Changes:

=====================================
THANKS.head
=====================================
@@ -39,9 +39,11 @@ link:https://jenkins.debian.net/["jenkins.debian.net"] would not be possible wit
  * link:https://letsencrypt.org[Let's encrypt] provides free of charge SSL certificates for jenkins.debian.net, reproducible.debian.net and tests.reproducible-builds.org.
  * In December 2018 we were given access to eight nodes which were donated by Facebook to the GCC Compile Farm project and are now hosted by link:https://osuosl.org/[OSUOSL] which each had 32 cores with 144 GB memory. Those machines have been retired now and OSUOSL offered different machines to us:
  * In spring 2023 we got access to some new nodes hosted by link:https://osuosl.org/[OSUOSL]:
-  ** 16 cores with 125 GB memory for osuosl1-amd64.reproducible.osuosl.org used for building Arch Linux, OpenWrt, coreboot and NetBSD for t.r-b.o
-  ** 16 cores with 125 GB memory for osuosl2-amd64.reproducible.osuosl.org used for building Arch Linux, OpenWrt, coreboot for t.r-b.o
-  ** 16 cores with 125 GB memory for osuosl3-amd64.reproducible.osuosl.org used for building Debian live, Debian bootstrapping jobs, Debian janitor jobs, mmdebstrap-jenkins jobs and openqa.d.n workers
+  ** 16 cores with 128 GB memory for osuosl1-amd64.reproducible.osuosl.org used for building Arch Linux, OpenWrt, coreboot and NetBSD for t.r-b.o
+  ** 16 cores with 128 GB memory for osuosl2-amd64.reproducible.osuosl.org used for building Arch Linux, OpenWrt, coreboot for t.r-b.o
+  ** 16 cores with 128 GB memory for osuosl3-amd64.reproducible.osuosl.org used for building Debian live, Debian bootstrapping jobs, Debian janitor jobs, mmdebstrap-jenkins jobs and openqa.d.n workers
+  ** osuosl4
+  ** osuosl5
 
 ==== Past sponsors
 


=====================================
TODO
=====================================
@@ -18,18 +18,17 @@ See link:https://jenkins.debian.net/userContent/about.html["about jenkins.debian
 
 == General ToDo
 
-* extend /etc/rc.local to do cleanup of lockfiles:
-** rm /var/cache/pbuilder/*tgz.tmp
 * run all bash scripts with set -eu and set -o pipefail: http://redsymbol.net/articles/unofficial-bash-strict-mode/
 ** add -o pipefail to all at once first. that should have less fallout that -u.
 ** though -u is also very nice. it will catch typos.
 
 === nodes at OSUOSL
 
-* mention o4+5 in THANKS and explain usage. mention facebook in past sponsors.
+* mention o4+5 in THANKS and explain usage.
 * mv snapshot.r-b.o from osuosl4 to osuosl5
 ** setup xfs on o5, then copy snapshot over
 * rebuilder on o4
+* jenkins backup on o5 (see below)
 
 === 2023 things
 
@@ -42,11 +41,16 @@ See link:https://jenkins.debian.net/userContent/about.html["about jenkins.debian
 ** maybe: rm /var/lib/schroot/unpack/d-i-manual* older than 5 days
 ** maybe: rm /tmp/mmdebstrap.* older than 3 days
 * djm:
+** move fetching logs at the end
+** write short djm.README, explain at least djm all d nt and djm --check-setup
+** option: --no-fetch (--local? maybe)
+** also maybe make --no-fetch default unless overwritten by config? (eg vagrant & mattia hardly ever trigger jobs via UI, while holger does...)
+** option: --check-setup to check whether one can login as user to all hosts and su to root (except jenkins where root login is expected)
 ** option: -r -y => report for year X
+** option: --no-new-xterm or some such
 ** option: -a/--action (default/implicit/optional), requiring $1 $2 $3 params...
 ** option: -t/--today to be used with action shell (and maybe others)
 ** option: --yolo/--dont-wait4-enter, default being wait for enter.
-** option: --no-fetch (--local? maybe)
 ** make special TARGETs . and jenkins implicit for those actions requiring that target only (jenkins-ui, jenkins-restart, etc)
 ** action: rk / remove-oldest-kernel
 ** action: sm / shell-monitor


=====================================
bin/djm
=====================================
@@ -552,7 +552,7 @@ djm_do() {
 		# action
 		#
 		case $ACTION in
-			reboot)	( ssh $NODE "sudo reboot || ( echo press enter ; read a ) " || true ) & sleep 2
+			reboot)	( ssh $NODE "sudo reboot || ( echo press enter ; read a ) " || true ) & sleep 1
 				run_xterm2wait4node_comeback
 				;;
 			powercycle)	case $SHORTNODE in


=====================================
bin/reproducible_common.sh
=====================================
@@ -8,6 +8,9 @@
 # included by all reproducible_*.sh scripts, so be quiet
 set +x
 
+# running in the future is easier when we know the real time...
+real_year=2023
+
 # postgres database definitions
 export PGDATABASE=reproducibledb
 


=====================================
bin/reproducible_node_health_check.sh
=====================================
@@ -92,10 +92,8 @@ fi
 #
 # check for correct future
 #
-# (XXX: yes this is hardcoded but meh…)
 echo "$(date -u) - testing whether the time is right..."
 get_node_information "$HOSTNAME"
-real_year=2023
 year=$(date +%Y)
 if "$NODE_RUN_IN_THE_FUTURE"; then
 	if [ "$year" -eq "$real_year" ]; then
@@ -165,7 +163,7 @@ if ! systemctl is-system-running > /dev/null; then
 	echo "$(date -u) - problematic services found:"
 	cat $SERVICES
 	echo "$(date -u) - trying to fix problematic services."
-	for UNIT in avahi-daemon acpid rtkit-daemon networking systemd-journal-flush haveged e2scrub_all apt-daily apt-daily-upgrade logrotate man-db; do
+	for UNIT in avahi-daemon acpid rtkit-daemon networking systemd-journal-flush haveged e2scrub_all apt-daily apt-daily-upgrade logrotate man-db munin-node; do
 		if grep -q $UNIT $SERVICES ; then
 			echo "$(date -u) - restarting failed service $UNIT..."
 		        sudo systemctl restart $UNIT


=====================================
job-cfg/reproducible.yaml
=====================================
@@ -291,7 +291,7 @@
             my_task:
                 - 'node_health_check':
                     my_description: 'Do some health checks.'
-                    my_timed: 'H/20 * * * *'
+                    my_timed: 'H/30 * * * *'
                     my_recipients: ''
                     my_timeout: '15'
             my_shell: '/srv/jenkins/bin/reproducible_node_health_check.sh'
@@ -350,7 +350,7 @@
             my_task:
                 - 'node_health_check':
                     my_description: 'Do some health checks'
-                    my_timed: 'H/15 * * * *'
+                    my_timed: 'H/30 * * * *'
                     my_recipients: ''
                     my_timeout: '15'
             my_hname:
@@ -510,7 +510,7 @@
                     my_shellext: ".py"
                 - 'system_health':
                     my_description: 'calculate overall tests.r-b.o system health for usage with https://github.com/jelly/reproduciblebuilds-display/'
-                    my_timed: 'H/15 * * * *'
+                    my_timed: 'H/59 * * * *'
                 - 'html_dashboard':
                     my_description: 'Generate HTML dashboard with graphs for reproducible builds.'
                     my_timed: '1 * * * *'


=====================================
logparse/reproducible-extra.rules
=====================================
@@ -9,5 +9,5 @@ warning /skipping.+because we're missing compilers for.+/
 
 # list of errors here
 error /ERROR:  permission denied for table.+/
-
-
+error /ERROR:  column .+ does not exist/
+error /ERROR:  relation .+ does not exist/



View it on GitLab: https://salsa.debian.org/qa/jenkins.debian.net/-/compare/374c225595b5c8278678474b0eb771d72cc60590...680a3c084ede210831347bc1676f7f9b5c07db6b

-- 
View it on GitLab: https://salsa.debian.org/qa/jenkins.debian.net/-/compare/374c225595b5c8278678474b0eb771d72cc60590...680a3c084ede210831347bc1676f7f9b5c07db6b
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/qa-jenkins-scm/attachments/20230705/1718756e/attachment-0001.htm>


More information about the Qa-jenkins-scm mailing list