[Debian-salsa-ci] Question about artifacts and ccache

Wed Sep 18 01:33:45 BST 2024

Hi!

Welcome to the team, looking forward to your contributions!

On Tue, 17 Sept 2024 at 15:25, Andrea Pappacoda via Debian-salsa-ci
<debian-salsa-ci at alioth-lists.debian.net> wrote:
>
> Hi all!
>
> I've been working on what I've described in
> <https://salsa.debian.org/salsa-ci-team/pipeline/-/issues/296#note_519376>,
> i.e., changing the pipeline to use a dsc instead of an extracted
> package.
>
> This of course requires touching pretty much all of the pipeline files,
> and over time I've accumulated some questions which I'd like to ask to
> you, since you obviously know more about the pipeline than I do.
>
> 1. How are artifacts handled? From what I've gathered looking at the
>    code, it seems that everything contained in the "$WORKING_DIR"
>    directory gets zipped by Salsa as an artifact, which is then
>    available for all the other jobs. Is this right?
>
>    If it is, I find this a bit suboptimal. The artifacts represent the
>    inputs and outputs of the various jobs, and if you think about jobs
>    as functions, you should be very careful about what you take as input
>    and what you return as output.
>
>    I'd propose instead to split the artifacts directory and working
>    directory into two well separated variables, so that stuff gets
>    only included in artifacts purposely via explicit file moves/copies.
>    This would also allow reducing the risk of reaching the artifacts
>    size limit.

The artifacts are basically the build results: source code, binary
packages and build logs.
See example https://salsa.debian.org/mariadb-team/mariadb-10.5/-/jobs/6295411/artifacts/browse/debian/output/

The artifacts will be used by whatever later job defines it 'needs:'
the build and the artifacts.

I don't see any problems with this as only jobs that need them depend
on them, but I would gladly review an improvement suggestion.

> 2. How's ccache setup? Looking at the code, I really cannot wrap my head
>    around it. ccache files are first deleted, then created, then ccache
>    is installed, then some other files are deleted, then ccache is
>    setup... It's a bit of a mess. I don't know much about ccache (I get
>    the concept, but never actually used it), but it definitely looks
>    hacky to me. Also, is it working? Does it save significant time?
>    Otherwise, I'd drop it in favour of simplicity. There are other parts
>    of the pipeline which can be sped up (like I recently did in
>    <https://salsa.debian.org/salsa-ci-team/pipeline/-/merge_requests/537>).

The ccache speeds up C/C++ heavy projects remarkably (like 10x), we
should definitely keep it.

> Another thing I find a bit suboptimal about the current pipeline setup
> is that builders repeatedly call apt update, apt upgrade, and apt
> install. I've observed that most of these calls are simply redundant,
> while others might be done in advance when building the images, so that
> they get ran only once.

Yes, the images the pipeline uses are already updated once every 24h
so having 'apt update' run on them all them is probably unnecessary.
It also makes the point of container versions moot, as running any old
version of the container will always self-update anyway. For example
if there is a regression, and I want to test running the pipeline with
an old image from say 1 week ago to see if the regression existed in
an old version, I cannot, because even the old image would just update
itself and the regression (if introduced by a new package version)
would be visible in all runs.

The background to all your questions I think is the current slight
mess we have in the code base. Salsa CI has been organically expanding
for many years now, and it is hard for new contributors to reason
about what everything does. What could help here if you would team up
with Ahmed to work on
https://salsa.debian.org/salsa-ci-team/pipeline/-/merge_requests/528.
Splitting out the shell scripts from the yaml files into separate
scripts that can - if needed - even be run locally to reproduce
individual steps/jobs. Such a refactoring would make the pipeline
orchestration code in the yaml files smaller and easier to understand,
and the separate scripts would be easier to test for input/output in
various situations.

- Otto