[med-svn] [Git][med-team/libatomic-queue][master] 14 commits: New upstream version 1.6.4

Fri Aug 23 12:24:08 BST 2024


Étienne Mollier pushed to branch master at Debian Med / libatomic-queue


Commits:
af66280a by Étienne Mollier at 2024-08-23T12:13:35+02:00
New upstream version 1.6.4
- - - - -
418317ce by Étienne Mollier at 2024-08-23T12:13:44+02:00
Update upstream source from tag 'upstream/1.6.4'

Update to upstream version '1.6.4'
with Debian dir 218b6b058b0dd02417f4850fe5a1b688abcd3221
- - - - -
cb918b88 by Étienne Mollier at 2024-08-23T12:25:13+02:00
generate-shared-library.patch: refresh.

- - - - -
664bed07 by Étienne Mollier at 2024-08-23T12:26:15+02:00
no-native: refresh patch.

- - - - -
0a718316 by Étienne Mollier at 2024-08-23T12:27:05+02:00
no_thin_archives.patch: refresh.

- - - - -
455b97e9 by Étienne Mollier at 2024-08-23T12:27:28+02:00
compiler.patch: unfuzz.

- - - - -
07490788 by Étienne Mollier at 2024-08-23T12:27:37+02:00
fix_unused_variable.patch: unfuzz.

- - - - -
907ffb0c by Étienne Mollier at 2024-08-23T12:49:03+02:00
generate-shared-library.patch: fixup build bug and dep3 header.

Gbp-Dch: ignore

- - - - -
eeab79ed by Étienne Mollier at 2024-08-23T13:03:36+02:00
d/rules: remove workaround for gcc 13 bug.

- - - - -
c9ad7483 by Étienne Mollier at 2024-08-23T13:03:59+02:00
d/s/lintian-overrides: flag another source missing false positive.

- - - - -
497e2b15 by Étienne Mollier at 2024-08-23T13:06:32+02:00
d/rules: remove workaround against dh_strip failure.

- - - - -
ef689eaf by Étienne Mollier at 2024-08-23T13:19:29+02:00
d/rules: delete file landing in standard top directory.

- - - - -
e02baf24 by Étienne Mollier at 2024-08-23T13:19:54+02:00
d/control: deduplicate Section: libs field.

- - - - -
88e1fe9e by Étienne Mollier at 2024-08-23T13:21:51+02:00
Ready for upload to unstable.

- - - - -


25 changed files:

- − .github/workflows/c-cpp.yml
- + .github/workflows/ci.yml
- + .github/workflows/cmake-gcc-clang.yml
- + CMakeLists.txt
- + CONTRIBUTORS.txt
- Makefile
- README.md
- debian/changelog
- debian/control
- debian/patches/compiler.patch
- debian/patches/fix_unused_variable.patch
- debian/patches/generate-shared-library.patch
- debian/patches/no-native
- debian/patches/no_thin_archives.patch
- debian/rules
- debian/source/lintian-overrides
- html/benchmarks.html
- + include/CMakeLists.txt
- include/atomic_queue/atomic_queue.h
- include/atomic_queue/defs.h
- scripts/benchmark-prologue.sh
- + src/CMakeLists.txt
- src/benchmarks.cc
- src/huge_pages.h
- src/tests.cc


Changes:

=====================================
.github/workflows/c-cpp.yml deleted
=====================================
@@ -1,27 +0,0 @@
-name: Ubuntu continuous integration
-
-on:
-  push:
-    branches: [ master ]
-  pull_request:
-    branches: [ master ]
-
-jobs:
-  build:
-
-    runs-on: ubuntu-20.04
-
-    steps:
-    - uses: actions/checkout at v3
-    - name: Install Boost.Test
-      run: sudo apt-get --quiet --yes install libboost-test-dev
-    - name: Environment variables
-      run: make env; make TOOLSET=gcc versions; make TOOLSET=clang versions
-    - name: Unit tests with gcc
-      run: make -rj2 TOOLSET=gcc example run_tests
-    - name: Unit tests with gcc thread sanitizer
-      run: make -rj2 TOOLSET=gcc BUILD=sanitize run_tests
-    - name: Unit tests with clang
-      run: make -rj2 TOOLSET=clang example run_tests
-    - name: Unit tests with clang thread sanitizer
-      run: make -rj2 TOOLSET=clang BUILD=sanitize run_tests


=====================================
.github/workflows/ci.yml
=====================================
@@ -0,0 +1,39 @@
+name: Makefile Continuous Integrations
+
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+
+jobs:
+  unit-test:
+    strategy:
+      matrix:
+        toolset: [gcc, clang]
+        os: [ubuntu-20.04, ubuntu-22.04, ubuntu-24.04]
+        include:
+            - sanitize: 1
+            - os: ubuntu-20.04 # Work-around for https://bugs.launchpad.net/ubuntu/+source/gcc-10/+bug/2029910
+              sanitize: 0
+
+    runs-on: ${{ matrix.os }}
+
+    steps:
+    - uses: actions/checkout at v4
+
+    - name: Install Boost.Test
+      run: sudo apt-get --quiet --yes install libboost-test-dev
+
+    - name: Environment variables
+      run: make -r TOOLSET=${{ matrix.toolset }} env
+
+    - name: Toolset versions
+      run: make -r TOOLSET=${{ matrix.toolset }} versions
+
+    - name: Build and run unit tests
+      run: make -rj2 TOOLSET=${{ matrix.toolset }} example run_tests
+
+    - if: ${{ matrix.sanitize }}
+      name: Build and run unit tests with thread sanitizer
+      run: make -rj2 TOOLSET=${{ matrix.toolset }} BUILD=sanitize run_tests


=====================================
.github/workflows/cmake-gcc-clang.yml
=====================================
@@ -0,0 +1,76 @@
+# This starter workflow is for a CMake project running on multiple platforms. There is a different starter workflow if you just want a single platform.
+# See: https://github.com/actions/starter-workflows/blob/main/ci/cmake-single-platform.yml
+name: CMake Continuous Integrations
+
+on:
+  push:
+    branches: [ "master", "CMake-support"]
+  pull_request:
+    branches: [ "master" ]
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+
+    strategy:
+      # Set fail-fast to false to ensure that feedback is delivered for all matrix combinations. Consider changing this to true when your workflow is stable.
+      fail-fast: false
+
+      # Set up a matrix to run the following 3 configurations:
+      # 1. <Windows, Release, latest MSVC compiler toolchain on the default runner image, default generator>
+      # 2. <Linux, Release, latest GCC compiler toolchain on the default runner image, default generator>
+      # 3. <Linux, Release, latest Clang compiler toolchain on the default runner image, default generator>
+      #
+      # To add more build types (Release, Debug, RelWithDebInfo, etc.) customize the build_type list.
+      matrix:
+        os: [ubuntu-latest] # windows-latest
+        build_type: [Release]
+        c_compiler: [gcc, clang]
+        include:
+          - os: ubuntu-latest
+            c_compiler: gcc
+            cpp_compiler: g++
+          - os: ubuntu-latest
+            c_compiler: clang
+            cpp_compiler: clang++
+        exclude:
+          - os: ubuntu-latest
+            c_compiler: cl
+
+    steps:
+    - uses: actions/checkout at v4
+
+    - name: Set reusable strings
+      # Turn repeated input strings (such as the build output directory) into step outputs. These step outputs can be used throughout the workflow file.
+      id: strings
+      shell: bash
+      run: |
+        echo "build-output-dir=${{ github.workspace }}/build" >> "$GITHUB_OUTPUT"
+
+    - name: Install boost
+      if: matrix.os == 'ubuntu-latest'
+      run: |
+        sudo apt-get install -y libboost-all-dev
+
+    - name: Configure CMake Ubuntu
+      # Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
+      # See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
+      if: matrix.os == 'ubuntu-latest'
+      run: >
+        cmake -B ${{ steps.strings.outputs.build-output-dir }}
+        -DCMAKE_CXX_COMPILER=${{ matrix.cpp_compiler }}
+        -DCMAKE_C_COMPILER=${{ matrix.c_compiler }}
+        -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
+        -DATOMIC_QUEUE_BUILD_TESTS=ON
+        -DATOMIC_QUEUE_BUILD_EXAMPLES=ON
+        -S ${{ github.workspace }}
+
+    - name: Build
+      # Build your program with the given configuration. Note that --config is needed because the default Windows generator is a multi-config generator (Visual Studio generator).
+      run: cmake --build ${{ steps.strings.outputs.build-output-dir }} --config ${{ matrix.build_type }}
+
+    - name: Test
+      working-directory: ${{ steps.strings.outputs.build-output-dir }}
+      # Execute tests defined by the CMake configuration. Note that --build-config is needed because the default Windows generator is a multi-config generator (Visual Studio generator).
+      # See https://cmake.org/cmake/help/latest/manual/ctest.1.html for more detail
+      run: ctest --build-config ${{ matrix.build_type }}


=====================================
CMakeLists.txt
=====================================
@@ -0,0 +1,30 @@
+CMAKE_MINIMUM_REQUIRED( VERSION 3.25 )
+
+PROJECT(atomic_queue VERSION 1.5.0)
+
+OPTION( ATOMIC_QUEUE_BUILD_TESTS
+    "If the tests should be built."
+    OFF
+)
+
+OPTION( ATOMIC_QUEUE_BUILD_EXAMPLES
+    "If examples should be built."
+    OFF
+)
+
+if ( PROJECT_IS_TOP_LEVEL )
+    set(CMAKE_CXX_STANDARD 14)
+    set(CMAKE_CXX_STANDARD_REQUIRED)
+endif()
+
+add_subdirectory( include )
+
+if ( ATOMIC_QUEUE_BUILD_TESTS )
+    enable_testing()
+endif()
+
+if ( ATOMIC_QUEUE_BUILD_TESTS OR ATOMIC_QUEUE_BUILD_EXAMPLES)
+    add_subdirectory( src )
+endif()
+
+add_library(max0x7ba::atomic_queue ALIAS atomic_queue)
\ No newline at end of file


=====================================
CONTRIBUTORS.txt
=====================================
@@ -0,0 +1,18 @@
+Contributors:
+
+- Jean-Michaël Celerier (https://github.com/max0x7ba/atomic_queue/pull/1)
+- Manuel Pöter (https://github.com/max0x7ba/atomic_queue/pull/9)
+- Paul Ferrand (https://github.com/max0x7ba/atomic_queue/pull/13)
+- Paul Ferrand (https://github.com/max0x7ba/atomic_queue/pull/14)
+- JP Cimalando (https://github.com/max0x7ba/atomic_queue/pull/16)
+- Cameron (https://github.com/max0x7ba/atomic_queue/pull/22)
+- Ben Beasley (https://github.com/max0x7ba/atomic_queue/pull/30)
+- Xeonacid (https://github.com/max0x7ba/atomic_queue/pull/38)
+- XieJiSS (https://github.com/max0x7ba/atomic_queue/pull/43)
+- Jan Niklas Hasse (https://github.com/max0x7ba/atomic_queue/pull/53)
+- Jan Niklas Hasse (https://github.com/max0x7ba/atomic_queue/pull/54)
+- Jonathan Wakely (https://github.com/max0x7ba/atomic_queue/pull/56)
+- Andriy06 (https://github.com/max0x7ba/atomic_queue/pull/58)
+- RedSkittleFox (https://github.com/max0x7ba/atomic_queue/pull/61)
+- RedSkittleFox (https://github.com/max0x7ba/atomic_queue/pull/62)
+- Yvan (https://github.com/max0x7ba/atomic_queue/pull/63)


=====================================
Makefile
=====================================
@@ -1,15 +1,23 @@
 # Copyright (c) 2019 Maxim Egorushkin. MIT License. See the full licence in file LICENSE.
 
 # Usage examples (assuming this directory is ~/src/atomic_queue):
-# time make -rC ~/src/atomic_queue -j8 run_benchmarks
-# time make -rC ~/src/atomic_queue -j8 TOOLSET=clang run_benchmarks
-# time make -rC ~/src/atomic_queue -j8 BUILD=debug run_tests
-# time make -rC ~/src/atomic_queue -j8 BUILD=sanitize run_tests
+#
+#   time make -rC ~/src/atomic_queue -j8
+#   time make -rC ~/src/atomic_queue -j8 run_benchmarks
+#   time make -rC ~/src/atomic_queue -j8 TOOLSET=clang run_benchmarks
+#   time make -rC ~/src/atomic_queue -j8 BUILD=debug run_tests
+#   time make -rC ~/src/atomic_queue -j8 BUILD=sanitize TOOLSET=clang run_tests
+#
+# Additional CPPFLAGS, CXXFLAGS, CFLAGS, LDLIBS, LDFLAGS can come from the command line, e.g. make CPPFLAGS='-I<my-include-dir>', or from environment variables.  For example, also produce assembly outputs:
+#
+#   time make -rC ~/src/atomic_queue -j8 CXXFLAGS="-save-temps=obj -fverbose-asm -masm=intel"
+#
 
 SHELL := /bin/bash
-BUILD := release
 
+BUILD := release
 TOOLSET := gcc
+
 build_dir := ${CURDIR}/build/${BUILD}/${TOOLSET}
 
 cxx.gcc := g++
@@ -30,7 +38,7 @@ AR := ${ar.${TOOLSET}}
 cxxflags.gcc.debug := -Og -fstack-protector-all -fno-omit-frame-pointer # -D_GLIBCXX_DEBUG
 cxxflags.gcc.release := -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG
 cxxflags.gcc.sanitize := ${cxxflags.gcc.release} -fsanitize=thread
-cxxflags.gcc := -pthread -march=native -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
+cxxflags.gcc := -pthread -march=native -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs,error=array-bounds}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
 ldflags.gcc.sanitize := ${ldflags.gcc.release} -fsanitize=thread
 ldflags.gcc := ${ldflags.gcc.${BUILD}}
 
@@ -44,7 +52,6 @@ ldflags.clang.sanitize := ${ldflags.clang.release} -fsanitize=thread
 ldflags.clang := -stdlib=libstdc++ ${ldflags.clang.${BUILD}}
 
 # Additional CPPFLAGS, CXXFLAGS, CFLAGS, LDLIBS, LDFLAGS can come from the command line, e.g. make CPPFLAGS='-I<my-include-dir>', or from environment variables.
-# However, a clean build is required when changing the flags in the command line or in environment variables, this makefile doesn't detect such changes.
 cxxflags := ${cxxflags.${TOOLSET}} ${CXXFLAGS}
 cflags := ${cflags.${TOOLSET}} ${CFLAGS}
 cppflags := ${CPPFLAGS} -Iinclude
@@ -60,13 +67,24 @@ ldlibs.moodycamel :=
 cppflags.xenium := -I${abspath ../xenium}
 ldlibs.xenium :=
 
+recompile := ${build_dir}/.make/recompile
+relink := ${build_dir}/.make/relink
+
 COMPILE.CXX = ${CXX} -o $@ -c ${cppflags} ${cxxflags} -MD -MP $(abspath $<)
-COMPILE.S = ${CXX} -o- -S -masm=intel ${cppflags} ${cxxflags} $(abspath $<) | c++filt | egrep -v '^[[:space:]]*\.(loc|cfi|L[A-Z])' > $@
+COMPILE.S = ${CXX} -o- -S -fverbose-asm -masm=intel ${cppflags} ${cxxflags} $(abspath $<) | c++filt | egrep -v '^[[:space:]]*\.(loc|cfi|L[A-Z])' > $@
 PREPROCESS.CXX = ${CXX} -o $@ -E ${cppflags} ${cxxflags} $(abspath $<)
 COMPILE.C = ${CC} -o $@ -c ${cppflags} ${cflags} -MD -MP $(abspath $<)
-LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
-LINK.SO = ${LD} -o $@ -shared $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
-LINK.A = ${AR} rscT $@ $(filter-out Makefile,$^)
+LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out ${relink},$^) $(ldlibs)
+LINK.SO = ${LD} -o $@ -shared $(ldflags) $(filter-out ${relink},$^) $(ldlibs)
+LINK.A = ${AR} rscT $@ $(filter-out ${relink},$^)
+
+ifneq (,$(findstring n,$(firstword -${MAKEFLAGS})))
+# Perform bash parameter expansion when --just-print for rtags.
+strip2 = $(shell printf '%q ' ${1})
+else
+# Unduplicate whitespace.
+strip2 = $(strip ${1})
+endif
 
 exes := benchmarks tests example
 
@@ -78,28 +96,65 @@ ${exes} : % : ${build_dir}/%
 benchmarks_src := benchmarks.cc cpu_base_frequency.cc huge_pages.cc
 ${build_dir}/benchmarks : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
 ${build_dir}/benchmarks : ldlibs += ${ldlibs.tbb} ${ldlibs.moodycamel} ${ldlibs.xenium} -ldl
-${build_dir}/benchmarks : ${benchmarks_src:%.cc=${build_dir}/%.o} Makefile | ${build_dir}
-	$(strip ${LINK.EXE})
+${build_dir}/benchmarks : ${benchmarks_src:%.cc=${build_dir}/%.o} ${relink} | ${build_dir}
+	$(call strip2,${LINK.EXE})
 -include ${benchmarks_src:%.cc=${build_dir}/%.d}
 
 tests_src := tests.cc
-${build_dir}/tests : cppflags += ${boost_unit_test_framework_inc} -DBOOST_TEST_DYN_LINK=1
+${build_dir}/tests : cppflags += -DBOOST_TEST_DYN_LINK=1
 ${build_dir}/tests : ldlibs += -lboost_unit_test_framework
-${build_dir}/tests : ${tests_src:%.cc=${build_dir}/%.o} Makefile | ${build_dir}
-	$(strip ${LINK.EXE})
+${build_dir}/tests : ${tests_src:%.cc=${build_dir}/%.o} ${relink} | ${build_dir}
+	$(call strip2,${LINK.EXE})
 -include ${tests_src:%.cc=${build_dir}/%.d}
 
 example_src := example.cc
-${build_dir}/example : ${example_src:%.cc=${build_dir}/%.o} Makefile | ${build_dir}
-	$(strip ${LINK.EXE})
+${build_dir}/example : ${example_src:%.cc=${build_dir}/%.o} ${relink} | ${build_dir}
+	$(call strip2,${LINK.EXE})
 -include ${example_src:%.cc=${build_dir}/%.d}
 
 ${build_dir}/%.so : cxxflags += -fPIC
-${build_dir}/%.so : Makefile | ${build_dir}
-	$(strip ${LINK.SO})
+${build_dir}/%.so : ${relink} | ${build_dir}
+	$(call strip2,${LINK.SO})
+
+${build_dir}/%.a : ${relink} | ${build_dir}
+	$(call strip2,${LINK.A})
+
+${build_dir}/%.o : src/%.cc ${recompile} | ${build_dir}
+	$(call strip2,${COMPILE.CXX})
+
+${build_dir}/%.o : src/%.c ${recompile} | ${build_dir}
+	$(call strip2,${COMPILE.C})
+
+${build_dir}/%.S : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
+${build_dir}/%.S : src/%.cc ${recompile} | ${build_dir}
+	$(call strip2,${COMPILE.S})
+
+${build_dir}/%.I : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
+${build_dir}/%.I : src/%.cc ${recompile} | ${build_dir}
+	$(call strip2,${PREPROCESS.CXX})
+
+${build_dir}/%.d : ;
+
+${build_dir}/.make : | ${build_dir}
+${build_dir} ${build_dir}/.make:
+	mkdir -p $@
 
-${build_dir}/%.a : Makefile | ${build_dir}
-	$(strip ${LINK.A})
+ver = "$(shell ${1} --version | head -n1)"
+# Trigger recompilation when compiler environment change.
+env.compile := $(call ver,${CXX}) ${cppflags} ${cxxflags} ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
+# Trigger relink when linker environment change.
+env.link := $(call ver,${LD}) ${ldflags} ${ldlibs} ${ldlibs.tbb} ${ldlibs.moodycamel} ${ldlibs.xenium}
+
+define env_txt_rule
+${build_dir}/.make/env.${1}.txt : $(shell cmp --quiet ${build_dir}/.make/env.${1}.txt <(printf "%s\n" ${env.${1}}) || echo update_env_txt) Makefile | ${build_dir}/.make
+	@printf "%s\n" ${env.${1}} >$$@
+endef
+$(eval $(call env_txt_rule,compile))
+$(eval $(call env_txt_rule,link))
+${recompile} ${relink} : ${build_dir}/.make/re% : ${build_dir}/.make/env.%.txt Makefile
+	@[[ ! -f $@ ]] || { u="$?"; echo "Re-$* is triggered by changes in $${u// /, }."; }
+	touch $@
+# cd ~/src/atomic_queue; make -r clean; set -x; make -rj8; make -rj8; make -rj8 CPPFLAGS=-DXYZ=1; make -rj8 CPPFLAGS=-DXYZ=1; make -rj8; make -rj8; make -rj8 LDLIBS=-lrt; make -rj8 LDLIBS=-lrt; make -rj8; make -rj8
 
 run_benchmarks : ${build_dir}/benchmarks
 	@echo "---- running $< ----"
@@ -115,33 +170,17 @@ run_% : ${build_dir}/%
 	@echo "---- running $< ----"
 	$<
 
-${build_dir}/%.o : src/%.cc Makefile | ${build_dir}
-	$(strip ${COMPILE.CXX})
-
-${build_dir}/%.o : src/%.c Makefile | ${build_dir}
-	$(strip ${COMPILE.C})
-
-%.S : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
-%.S : src/%.cc Makefile | ${build_dir}
-	$(strip ${COMPILE.S})
-
-%.I : src/%.cc
-	$(strip ${PREPROCESS.CXX})
-
-${build_dir} :
-	mkdir -p $@
-
-rtags : clean
-	${MAKE} -nk all | rtags-rc -c -; true
+rtags :
+	${MAKE} --always-make --just-print all | { rtags-rc -c -; true; }
 
 clean :
 	rm -rf ${build_dir} ${exes}
 
-env :
-	env | sort --ignore-case
-
 versions:
-	${MAKE} --version | head -n1
+	${MAKE} --version | awk 'FNR<2'
 	${CXX} --version | head -n1
 
-.PHONY : env versions rtags run_benchmarks clean all run_%
+env :
+	env | sort --ignore-case
+
+.PHONY : update_env_txt env versions rtags run_benchmarks clean all run_%


=====================================
README.md
=====================================
@@ -1,30 +1,45 @@
 [![C++14](https://img.shields.io/badge/dialect-C%2B%2B14-blue)](https://en.cppreference.com/w/cpp/14)
 [![MIT license](https://img.shields.io/github/license/max0x7ba/atomic_queue)](https://github.com/max0x7ba/atomic_queue/blob/master/LICENSE)
-![platform Linux 64-bit](https://img.shields.io/badge/platform-Linux%2064--bit-yellow)
+![Latest release](https://img.shields.io/github/v/tag/max0x7ba/atomic_queue?label=latest%20release)
+<br>
+[![Makefile Continuous Integrations](https://github.com/max0x7ba/atomic_queue/actions/workflows/ci.yml/badge.svg)](https://github.com/max0x7ba/atomic_queue/actions/workflows/ci.yml)
+[![CMake Continuous Integrations](https://github.com/max0x7ba/atomic_queue/actions/workflows/cmake-gcc-clang.yml/badge.svg)](https://github.com/max0x7ba/atomic_queue/actions/workflows/cmake-gcc-clang.yml)
+<br>
+![platform Linux x86_64](https://img.shields.io/badge/platform-Linux%20x86_64--bit-yellow)
 ![platform Linux ARM](https://img.shields.io/badge/platform-Linux%20ARM-yellow)
 ![platform Linux RISC-V](https://img.shields.io/badge/platform-Linux%20RISC--V-yellow)
-![Latest release](https://img.shields.io/github/v/tag/max0x7ba/atomic_queue?label=latest%20release)
-[![Ubuntu continuous integration](https://github.com/max0x7ba/atomic_queue/workflows/Ubuntu%20continuous%20integration/badge.svg)](https://github.com/max0x7ba/atomic_queue/actions?query=workflow%3A%22Ubuntu%20continuous%20integration%22)
+![platform Linux PowerPC](https://img.shields.io/badge/platform-Linux%20PowerPC-yellow)
+![platform Linux IBM System/390](https://img.shields.io/badge/platform-Linux%20IBM%20System/390-yellow)
 
 # atomic_queue
-C++14 multiple-producer-multiple-consumer *lockless* queues based on circular buffer with [`std::atomic`][3].
+C++14 multiple-producer-multiple-consumer *lock-free* queues based on circular buffer and [`std::atomic`][3]. Designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.
+
+It has been developed, tested and benchmarked on Linux, but should support any C++14 platforms which implement `std::atomic`. Reported as compatible with Windows, but the continuous integrations hosted by GitHub are currently set up only for x86_64 platform on Ubuntu-20.04 and Ubuntu-22.04. Pull requests to extend the [continuous integrations][18] to run on other architectures and/or platforms are welcome.
+
+## Design Principles
+When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify.
 
-It has been developed, tested and benchmarked on Linux, but should support any C++14 platforms which implement `std::atomic`.
+The main design principle these queues follow is _minimalism_, which results in such design choices as:
 
-The main design principle these queues follow is _minimalism_: the bare minimum of atomic operations, fixed size buffer, value semantics.
+* Bare minimum of atomic instructions. Inlinable by default push and pop functions can hardly be any cheaper in terms of CPU instruction number / L1i cache pressure.
+* Explicit contention/false-sharing avoidance for queue and its elements.
+* Linear fixed size ring-buffer array. No heap memory allocations after a queue object has constructed. It doesn't get any more CPU L1d or TLB cache friendly than that.
+* Value semantics. Meaning that the queues make a copy/move upon `push`/`pop`, no reference/pointer to elements in the queue can be obtained.
 
-These qualities are also limitations:
+The impact of each of these small design choices on their own is barely measurable, but their total impact is much greater than a simple sum of the constituents' impacts, aka super-scalar compounding or synergy. The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least impeded.
 
-* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't processed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront back-of-the-envelope calculation on what would be the largest expected/acceptable queue size.
+These design choices are also limitations:
+
+* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't consumed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront calculation on what would be the largest expected/acceptable number of unconsumed elements in the queue.
 * There are no OS-blocking push/pop functions. This queue is designed for ultra-low-latency scenarios and using an OS blocking primitive would be sacrificing push-to-pop latency. For lowest possible latency one cannot afford blocking in the OS kernel because the wake-up latency of a blocked thread is about 1-3 microseconds, whereas this queue's round-trip time can be as low as 150 nanoseconds.
 
 Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the [throughput and latency benchmarks][1].
 
 Available containers are:
 * `AtomicQueue` - a fixed size ring-buffer for atomic elements.
-* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full.
+* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full. It is `AtomicQueue` used with `push`/`pop` instead of `try_push`/`try_pop`.
 * `AtomicQueue2` - a fixed size ring-buffer for non-atomic elements.
-* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full.
+* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full. It is `AtomicQueue2` used with `push`/`pop` instead of `try_push`/`try_pop`.
 
 These containers have corresponding `AtomicQueueB`, `OptimistAtomicQueueB`, `AtomicQueueB2`, `OptimistAtomicQueueB2` versions where the buffer size is specified as an argument to the constructor.
 
@@ -34,12 +49,13 @@ Single-producer-single-consumer mode is supported. In this mode, no expensive at
 
 Move-only queue element types are fully supported. For example, a queue of `std::unique_ptr<T>` elements would be `AtomicQueue2B<std::unique_ptr<T>>` or `AtomicQueue2<std::unique_ptr<T>, CAPACITY>`.
 
-A few other thread-safe containers are used for reference in the benchmarks:
+## Role Models
+Several other well established and popular thread-safe containers are used for reference in the [benchmarks][1]:
 * `std::mutex` - a fixed size ring-buffer with `std::mutex`.
 * `pthread_spinlock` - a fixed size ring-buffer with `pthread_spinlock_t`.
 * `boost::lockfree::spsc_queue` - a wait-free single-producer-single-consumer queue from Boost library.
 * `boost::lockfree::queue` - a lock-free multiple-producer-multiple-consumer queue from Boost library.
-* `moodycamel::ConcurrentQueue` - a lock-free multiple-producer-multiple-consumer queue used in non-blocking mode.
+* `moodycamel::ConcurrentQueue` - a lock-free multiple-producer-multiple-consumer queue used in non-blocking mode. This queue is designed to maximize throughput at the expense of latency and eschewing the global time order of elements pushed into one queue by different threads. It is not equivalent to other queues benchmarked here in this respect.
 * `moodycamel::ReaderWriterQueue` - a lock-free single-producer-single-consumer queue used in non-blocking mode.
 * `xenium::michael_scott_queue` - a lock-free multi-producer-multi-consumer queue proposed by [Michael and Scott](http://www.cs.rochester.edu/~scott/papers/1996_PODC_queues.pdf) (this queue is similar to `boost::lockfree::queue` which is also based on the same proposal).
 * `xenium::ramalhete_queue` - a lock-free multi-producer-multi-consumer queue proposed by [Ramalhete and Correia](http://concurrencyfreaks.blogspot.com/2016/11/faaarrayqueue-mpmc-lock-free-queue-part.html).
@@ -80,7 +96,7 @@ make -r -j4 run_benchmarks
 The benchmark also requires Intel TBB library to be available. It assumes that it is installed in `/usr/local/include` and `/usr/local/lib`. If it is installed elsewhere you may like to modify `cppflags.tbb` and `ldlibs.tbb` in `Makefile`.
 
 # API
-The containers support the following APIs:
+The queue class templates provide the following member functions:
 * `try_push` - Appends an element to the end of the queue. Returns `false` when the queue is full.
 * `try_pop` - Removes an element from the front of the queue. Returns `false` when the queue is empty.
 * `push` (optimist) - Appends an element to the end of the queue. Busy waits when the queue is full. Faster than `try_push` when the queue is not full. Optional FIFO producer queuing and total order.
@@ -90,24 +106,31 @@ The containers support the following APIs:
 * `was_full` - Returns `true` if the container was full during the call. The state may have changed by the time the return value is examined.
 * `capacity` - Returns the maximum number of elements the queue can possibly hold.
 
-_Atomic elements_ are those, for which [`std::atomic<T>{T{}}.is_lock_free()`][10] returns `true`, and, when C++17 features are available, [`std::atomic<T>::is_always_lock_free`][16] evaluates to `true` at compile time. In other words, the CPU can load, store and compare-and-exchange such elements atomically natively. On x86-64 such elements are all the [C++ standard arithmetic and pointer types][11]. The queues for atomic elements reserve one value to serve as an empty element marker `NIL`, its default value is `0`. `NIL` value must not be pushed into a queue and there is an [`assert`][13] statement in `push` functions to guard against that in debug mode builds. Pushing `NIL` element into a queue in release mode builds results in undefined behaviour, such as deadlocks and/or lost queue elements.
+_Atomic elements_ are those, for which [`std::atomic<T>{T{}}.is_lock_free()`][10] returns `true`, and, when C++17 features are available, [`std::atomic<T>::is_always_lock_free`][16] evaluates to `true` at compile time. In other words, the CPU can load, store and compare-and-exchange such elements atomically natively. On x86-64 such elements are all the [C++ standard arithmetic and pointer types][11].
 
-Note that _optimism_ is a choice of a queue modification operation control flow, rather than a queue type. An _optimist_ `push` is fastest when the queue is not full most of the time, an optimistic `pop` - when the queue is not empty most of the time. Optimistic and not so operations can be mixed with no restrictions. The `OptimistAtomicQueue`s in [the benchmarks][1] use only _optimist_ `push` and `pop`.
+The queues for atomic elements reserve one value to serve as an empty element marker `NIL`, its default value is `0`. `NIL` value must not be pushed into a queue and there is an [`assert`][13] statement in `push` functions to guard against that in debug mode builds. Pushing `NIL` element into a queue in release mode builds results in undefined behaviour, such as deadlocks and/or lost queue elements.
 
-`push` and `try_push` operations _synchronize-with_ (as defined in [`std::memory_order`][17]) with any subsequent `pop` or `try_pop` operation of the same queue object.
+Note that _optimism_ is a choice of a queue modification operation control flow, rather than a queue type. An _optimist_ `push` is fastest when the queue is not full most of the time, an optimistic `pop` - when the queue is not empty most of the time. Optimistic and not so operations can be mixed with no restrictions. The `OptimistAtomicQueue`s in [the benchmarks][1] use only _optimist_ `push` and `pop`.
 
 See [example.cc](src/example.cc) for a usage example.
 
 TODO: full API reference.
 
+## Memory order of non-atomic loads and stores
+`push` and `try_push` operations _synchronize-with_ (as defined in [`std::memory_order`][17]) with any subsequent `pop` or `try_pop` operation of the same queue object. Meaning that:
+* No non-atomic load/store gets reordered past `push`/`try_push`, which is a `memory_order::release` operation. Same memory order as that of `std::mutex::unlock`.
+* No non-atomic load/store gets reordered prior to `pop`/`try_pop`, which is a `memory_order::acquire` operation. Same memory order as that of `std::mutex::lock`.
+* The effects of a producer thread's non-atomic stores followed by `push`/`try_push` of an element into a queue become visible in the consumer's thread which `pop`/`try_pop` that particular element.
+
 # Implementation Notes
-The available queues here use a ring-buffer array for storing elements. The size of the queue is fixed at compile time or construction time.
+## Ring-buffer capacity
+The available queues here use a ring-buffer array for storing elements. The capacity of the queue is fixed at compile time or construction time.
 
-In a production multiple-producer-multiple-consumer scenario the ring-buffer size should be set to the maximum expected queue size. When the ring-buffer gets full it means that the consumers cannot consume the elements fast enough. A fix for that is any of:
+In a production multiple-producer-multiple-consumer scenario the ring-buffer capacity should be set to the maximum expected queue size. When the ring-buffer gets full it means that the consumers cannot consume the elements fast enough. A fix for that is any of:
 
-* increase the buffer size to be able to handle temporary spikes of produced elements, or,
-* increase the number of consumers to consume elements faster, or,
-* decrease the number of producers to producer fewer elements.
+* Increase the queue capacity in order to handle temporary spikes of pending elements in the queue. This normally requires restarting the application after re-configuration/re-compilation has been done.
+* Increase the number of consumers to drain the queue faster. The number of consumers can be managed dynamically, e.g.: when a consumer observes that the number of elements pending in the queue keeps growing, that calls for deploying more consumer threads to drain the queue at a faster rate; mostly empty queue calls for suspending/terminating excess consumer threads.
+* Decrease the rate of pushing elements into the queue. `push` and `pop` calls always incur some expensive CPU cycles to maintain the integrity of queue state in atomic/consistent/isolated fashion with respect to other threads and these costs increase super-linearly as queue contention grows. Producer batching of multiple small elements or elements resulting from one event into one queue message is often a reasonable solution.
 
 Using a power-of-2 ring-buffer array size allows a couple of important optimizations:
 
@@ -116,13 +139,35 @@ Using a power-of-2 ring-buffer array size allows a couple of important optimizat
 
 The containers use `unsigned` type for size and internal indexes. On x86-64 platform `unsigned` is 32-bit wide, whereas `size_t` is 64-bit wide. 64-bit instructions utilise an extra byte instruction prefix resulting in slightly more pressure on the CPU instruction cache and the front-end. Hence, 32-bit `unsigned` indexes are used to maximise performance. That limits the queue size to 4,294,967,295 elements, which seems to be a reasonable hard limit for many applications.
 
-While the atomic queues can be used with any moveable element types (including `std::unique_ptr`), for best througput and latency the queue elements should be cheap to copy and lock-free (e.g. `unsigned` or `T*`), so that `push` and `pop` operations complete fastest.
+While the atomic queues can be used with any moveable element types (including `std::unique_ptr`), for best throughput and latency the queue elements should be cheap to copy and lock-free (e.g. `unsigned` or `T*`), so that `push` and `pop` operations complete fastest.
 
-`push` and `pop` both perform two atomic operations: increment the counter to claim the element slot and store the element into the array. If a thread calling `push` or `pop` is pre-empted between the two atomic operations that causes another thread calling `pop` or `push` (corresondingly) on the same slot to spin on loading the element until the element is stored; other threads calling `push` and `pop` are not affected. Using real-time `SCHED_FIFO` threads reduces the risk of pre-emption, however, a higher priority `SCHED_FIFO` thread or kernel interrupt handler can still preempt your `SCHED_FIFO` thread. If the queues are used on isolated cores with real-time priority threads, in which case no pre-emption or interrupts occur, the queues operations become _lock-free_.
+## Lock-free guarantees
+*Conceptually*, a `push` or `pop` operation does two atomic steps:
 
-So, ideally, you may like to run your critical low-latency code on isolated cores that also no other processes can possibly use. And disable [real-time thread throttling](#real-time-thread-throttling) to prevent `SCHED_FIFO` real-time threads from being throttled.
+1. Atomically and exclusively claims the queue slot index to store/load an element to/from. That's producers incrementing `head` index, consumers incrementing `tail` index. Each slot is accessed by one producer and one consumer threads only.
+2. Atomically store/load the element into/from the slot. Producer storing into a slot changes its state to be non-`NIL`, consumer loading from a slot changes its state to be `NIL`. The slot is a spinlock for its one producer and one consumer threads.
 
-People often propose limiting busy-waiting with a subsequent call to `sched_yield`/`pthread_yield`. However, `sched_yield` is a wrong tool for locking because it doesn't communicate to the OS kernel what the thread is waiting for, so that the OS thread scheduler can never reschedule the calling thread to resume when the shared state has changed (unless there are no other threads that can run on this CPU core, so that the caller resumes immediately). [More details about `sched_yield` and spinlocks from Linus Torvalds][5].
+These queues anticipate that a thread doing `push` or `pop` may complete step 1 and then be preempted before completing step 2.
+
+An algorithm is *lock-free* if there is guaranteed system-wide progress. These queue guarantee system-wide progress by the following properties:
+
+* Each `push` is independent of any preceding `push`. An incomplete (preempted) `push` by one producer thread doesn't affect `push` of any other thread.
+* Each `pop` is independent of any preceding `pop`. An incomplete (preempted) `pop` by one consumer thread doesn't affect `pop` of any other thread.
+* An incomplete (preempted) `push` from one producer thread affects only one consumer thread `pop`ing an element from this particular queue slot. All other threads `pop`s are unaffected.
+* An incomplete (preempted) `pop` from one consumer thread affects only one producer thread `push`ing an element into this particular queue slot while expecting it to have been consumed long time ago, in the rather unlikely scenario that producers have wrapped around the entire ring-buffer while this consumer hasn't completed its `pop`. All other threads `push`s and `pop`s are unaffected.
+
+## Preemption
+Linux task scheduler thread preemption is something no user-space process should be able to affect or escape, otherwise any/every malicious application would exploit that.
+
+Still, there are a few things one can do to minimize preemption of one's mission critical application threads:
+
+* Use real-time `SCHED_FIFO` scheduling class for your threads, e.g. `chrt --fifo 50 <app>`. A higher priority `SCHED_FIFO` thread or kernel interrupt handler can still preempt your `SCHED_FIFO` threads.
+* Use one same fixed real-time scheduling priority for all threads accessing same queue objects. Real-time threads with different scheduling priorities modifying one queue object may cause priority inversion and deadlocks. Using the default scheduling class `SCHED_OTHER` with its dynamically adjusted priorities defeats the purpose of using these queues.
+* Disable [real-time thread throttling](#real-time-thread-throttling) to prevent `SCHED_FIFO` real-time threads from being throttled.
+* Isolate CPU cores, so that no interrupt handlers or applications ever run on it. Mission critical applications should be explicitly placed on these isolated cores with `taskset`.
+* Pin threads to specific cores, otherwise the task scheduler keeps moving threads to other idle CPU cores to level voltage/heat-induced wear-and-tear accross CPU cores. Keeping a thread running on one same CPU core maximizes CPU cache hit rate. Moving a thread to another CPU core incurs otherwise unnecessary CPU cache thrashing.
+
+People often propose limiting busy-waiting with a subsequent call to `std::this_thread::yield()`/`sched_yield`/`pthread_yield`. However, `sched_yield` is a wrong tool for locking because it doesn't communicate to the OS kernel what the thread is waiting for, so that the OS thread scheduler can never schedule the calling thread to resume at the right time when the shared state has changed (unless there are no other threads that can run on this CPU core, so that the caller resumes immediately). See notes section in [`man sched_yield`][19] and [a Linux kernel thread about `sched_yield` and spinlocks][5] for more details.
 
 [In Linux, there is mutex type `PTHREAD_MUTEX_ADAPTIVE_NP`][9] which busy-waits a locked mutex for a number of iterations and then makes a blocking syscall into the kernel to deschedule the waiting thread. In the benchmarks it was the worst performer and I couldn't find a way to make it perform better, and that's the reason it is not included in the benchmarks.
 
@@ -136,15 +181,15 @@ There are a few OS behaviours that complicate benchmarking:
 * CPU scheduler can place threads on different CPU cores each run. To avoid that the threads are pinned to specific CPU cores.
 * CPU scheduler can preempt threads. To avoid that real-time `SCHED_FIFO` priority 50 is used to disable scheduler time quantum expiry and make the threads non-preemptable by lower priority processes/threads.
 * Real-time thread throttling disabled.
-* Adverse address space randomisation may cause extra CPU cache conflicts, as well as other processes running on the system. To minimise effects of that `benchmarks` executable is run at least 33 times. The benchmark charts show the average; the standard deviation, minimum and maximum values are shown in the chart tooltips.
+* Adverse address space randomisation may cause extra CPU cache conflicts, as well as other processes running on the system. To minimise effects of that `benchmarks` executable is run at least 33 times. The benchmark charts display average values. The chart tooltip also displays the standard deviation, minimum and maximum values.
 
 I only have access to a few x86-64 machines. If you have access to different hardware feel free to submit the output file of `scripts/run-benchmarks.sh` and I will include your results into the benchmarks page.
 
 ### Huge pages
 When huge pages are available the benchmarks use 1x1GB or 16x2MB huge pages for the queues to minimise TLB misses. To enable huge pages do one of:
 ```
-sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
-sudo hugeadm --pool-pages-min 2MB:16 --pool-pages-max 2MB:16
+sudo hugeadm --pool-pages-min 1GB:1
+sudo hugeadm --pool-pages-min 2MB:16
 ```
 Alternatively, you may like to enable [transparent hugepages][15] in your system and use a hugepage-aware allocator, such as [tcmalloc][14].
 
@@ -161,7 +206,13 @@ N producer threads push a 4-byte integer into one same queue, N consumer threads
 One thread posts an integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmarks measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time.
 
 # Contributing
-The project uses `.editorconfig` and `.clang-format` to automate formatting. Pull requests are expected to be formatted using these settings.
+Contributions are more than welcome. `.editorconfig` and `.clang-format` can be used to automatically match code formatting.
+
+# Reading material
+Some books on the subject of multi-threaded programming I found instructive:
+
+* _Programming with POSIX Threads_ by David R. Butenhof.
+* _The Art of Multiprocessor Programming_ by Maurice Herlihy, Nir Shavit.
 
 ---
 
@@ -184,3 +235,5 @@ Copyright (c) 2019 Maxim Egorushkin. MIT License. See the full licence in file L
 [15]: https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html
 [16]: https://en.cppreference.com/w/cpp/atomic/atomic/is_always_lock_free
 [17]: https://en.cppreference.com/w/cpp/atomic/memory_order
+[18]: https://github.com/max0x7ba/atomic_queue/blob/master/.github/workflows/ci.yml
+[19]: https://man7.org/linux/man-pages/man2/sched_yield.2.html


=====================================
debian/changelog
=====================================
@@ -1,3 +1,19 @@
+libatomic-queue (1.6.4-1) unstable; urgency=medium
+
+  * New upstream version 1.6.4
+  * generate-shared-library.patch: refresh.
+  * no-native: refresh patch.
+  * no_thin_archives.patch: refresh.
+  * compiler.patch: unfuzz.
+  * fix_unused_variable.patch: unfuzz.
+  * d/rules: remove workaround for gcc 13 bug.
+  * d/rules: remove workaround against dh_strip failure.
+  * d/rules: delete file landing in standard top directory.
+  * d/control: deduplicate Section: libs field.
+  * d/s/lintian-overrides: flag another source-missing false positive.
+
+ -- Étienne Mollier <emollier at debian.org>  Fri, 23 Aug 2024 13:20:22 +0200
+
 libatomic-queue (1.4-2) unstable; urgency=medium
 
   * d/*.symbols: mark several compiler specific symbols as optional.


=====================================
debian/control
=====================================
@@ -21,7 +21,6 @@ Rules-Requires-Root: no
 
 Package: libatomic-queue0
 Architecture: any
-Section: libs
 Depends: ${shlibs:Depends},
          ${misc:Depends}
 Description: C++ atomic_queue library


=====================================
debian/patches/compiler.patch
=====================================
@@ -8,7 +8,7 @@ Last-Update: 2022-07-01
 This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
 --- libatomic-queue.orig/Makefile
 +++ libatomic-queue/Makefile
-@@ -23,8 +23,8 @@
+@@ -31,8 +31,8 @@
  ld.clang := clang++
  ar.clang := ar
  


=====================================
debian/patches/fix_unused_variable.patch
=====================================
@@ -2,9 +2,9 @@ Author: Andreas Tille <tille at debian.org>
 Last-Update: Fri, 23 Oct 2020 22:10:01 +0200
 Description: Fix unused variable
 
---- a/src/benchmarks.cc
-+++ b/src/benchmarks.cc
-@@ -197,7 +197,7 @@ void throughput_producer(unsigned N, Que
+--- libatomic-queue.orig/src/benchmarks.cc
++++ libatomic-queue/src/benchmarks.cc
+@@ -180,7 +180,7 @@
      cycles_t expected = 0;
      t0->compare_exchange_strong(expected, __builtin_ia32_rdtsc(), std::memory_order_acq_rel, std::memory_order_relaxed);
  
@@ -13,7 +13,7 @@ Description: Fix unused variable
      ProducerOf<Queue> producer{*queue};
      for(unsigned n = 1, stop = N + 1; n <= stop; ++n)
          producer.push(*queue, n);
-@@ -208,7 +208,7 @@ void throughput_consumer_impl(unsigned N
+@@ -191,7 +191,7 @@
      unsigned const stop = N + 1;
      sum_t sum = 0;
  
@@ -22,7 +22,7 @@ Description: Fix unused variable
      ConsumerOf<Queue> consumer{*queue};
      for(;;) {
          unsigned n = consumer.pop(*queue);
-@@ -408,7 +408,7 @@ void run_throughput_benchmarks(HugePages
+@@ -393,7 +393,7 @@
  template<class Queue>
  void ping_pong_thread_impl(Queue* q1, Queue* q2, unsigned N, cycles_t* time, std::false_type /*sender*/) {
      cycles_t t0 = __builtin_ia32_rdtsc();
@@ -31,7 +31,7 @@ Description: Fix unused variable
      ConsumerOf<Queue> consumer_q1{*q1};
      ProducerOf<Queue> producer_q2{*q2};
      for(unsigned i = 1, j = 0; j < N; ++i) {
-@@ -422,7 +422,7 @@ void ping_pong_thread_impl(Queue* q1, Qu
+@@ -407,7 +407,7 @@
  template<class Queue>
  void ping_pong_thread_impl(Queue* q1, Queue* q2, unsigned N, cycles_t* time, std::true_type /*sender*/) {
      cycles_t t0 = __builtin_ia32_rdtsc();


=====================================
debian/patches/generate-shared-library.patch
=====================================
@@ -1,33 +1,30 @@
 Author: Nilesh Patra <npatra974 at gmail.com>,
         Andreas Tille <tille at debian.org>
-Last-Update: Fri, 23 Oct 2020 22:10:01 +0200
-Description: Fix unused variable
+Reviewed-By: Étienne Mollier <emollier at debian.org>
+Last-Update: 2024-08-23
+Description: add rules to generate a shared library.
 
 --- libatomic-queue.orig/Makefile
 +++ libatomic-queue/Makefile
-@@ -11,6 +11,7 @@
- 
+@@ -19,6 +19,7 @@
  TOOLSET := gcc
+ 
  build_dir := ${CURDIR}/build/${BUILD}/${TOOLSET}
 +build_dir_shared := ${CURDIR}/build_shared/${BUILD}/${TOOLSET}
  
  cxx.gcc := g++
  cc.gcc := gcc
-@@ -60,21 +61,30 @@
- cppflags.xenium := -I${abspath ../xenium}
- ldlibs.xenium :=
- 
-+SOVERSION := 0
- COMPILE.CXX = ${CXX} -o $@ -c ${cppflags} ${cxxflags} -MD -MP $(abspath $<)
- COMPILE.S = ${CXX} -o- -S -masm=intel ${cppflags} ${cxxflags} $(abspath $<) | c++filt | egrep -v '^[[:space:]]*\.(loc|cfi|L[A-Z])' > $@
+@@ -75,7 +76,8 @@
  PREPROCESS.CXX = ${CXX} -o $@ -E ${cppflags} ${cxxflags} $(abspath $<)
  COMPILE.C = ${CC} -o $@ -c ${cppflags} ${cflags} -MD -MP $(abspath $<)
- LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
--LINK.SO = ${LD} -o $@ -shared $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
+ LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out ${relink},$^) $(ldlibs)
+-LINK.SO = ${LD} -o $@ -shared $(ldflags) $(filter-out ${relink},$^) $(ldlibs)
++SOVERSION := 0
 +LINK.SO = ${LD} -o $@.$(SOVERSION) -shared -Wl,-soname,`basename $@`.$(SOVERSION) $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
- LINK.A = ${AR} rscT $@ $(filter-out Makefile,$^)
+ LINK.A = ${AR} rscT $@ $(filter-out ${relink},$^)
  
- exes := benchmarks tests example
+ ifneq (,$(findstring n,$(firstword -${MAKEFLAGS})))
+@@ -90,9 +92,17 @@
  
  all : ${exes}
  
@@ -46,47 +43,49 @@ Description: Fix unused variable
  benchmarks_src := benchmarks.cc cpu_base_frequency.cc huge_pages.cc
  ${build_dir}/benchmarks : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
  ${build_dir}/benchmarks : ldlibs += ${ldlibs.tbb} ${ldlibs.moodycamel} ${ldlibs.xenium} -ldl
-@@ -94,9 +104,10 @@
- 	$(strip ${LINK.EXE})
+@@ -112,9 +122,10 @@
+ 	$(call strip2,${LINK.EXE})
  -include ${example_src:%.cc=${build_dir}/%.d}
  
 -${build_dir}/%.so : cxxflags += -fPIC
--${build_dir}/%.so : Makefile | ${build_dir}
--	$(strip ${LINK.SO})
+-${build_dir}/%.so : ${relink} | ${build_dir}
+-	$(call strip2,${LINK.SO})
 +${build_dir_shared}/%.so : cxxflags += -fPIC
 +${build_dir_shared}/%.so : Makefile | ${build_dir_shared}
 +	${LINK.SO}
 +	ln -s `basename $@`.$(SOVERSION) $@
  
- ${build_dir}/%.a : Makefile | ${build_dir}
- 	$(strip ${LINK.A})
-@@ -121,6 +132,13 @@
- ${build_dir}/%.o : src/%.c Makefile | ${build_dir}
- 	$(strip ${COMPILE.C})
+ ${build_dir}/%.a : ${relink} | ${build_dir}
+ 	$(call strip2,${LINK.A})
+@@ -125,6 +136,12 @@
+ ${build_dir}/%.o : src/%.c ${recompile} | ${build_dir}
+ 	$(call strip2,${COMPILE.C})
  
 +${build_dir_shared}/%.o : src/%.cc Makefile | ${build_dir_shared}
-+	$(strip ${COMPILE.CXX})
++	$(call strip2,${COMPILE.CXX})
 +
 +${build_dir_shared}/%.o : src/%.c Makefile | ${build_dir_shared}
-+	$(strip ${COMPILE.C})
-+
++	$(call strip2,${COMPILE.C})
 +
- %.S : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
- %.S : src/%.cc Makefile | ${build_dir}
- 	$(strip ${COMPILE.S})
-@@ -131,11 +149,14 @@
- ${build_dir} :
+ ${build_dir}/%.S : cppflags += ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
+ ${build_dir}/%.S : src/%.cc ${recompile} | ${build_dir}
+ 	$(call strip2,${COMPILE.S})
+@@ -139,6 +156,10 @@
+ ${build_dir} ${build_dir}/.make:
  	mkdir -p $@
  
-+${build_dir_shared} :
++${build_dir_shared}/.make : | ${build_dir_shared}
++${build_dir_shared} ${build_dir_shared}/.make:
 +	mkdir -p $@
 +
- rtags : clean
- 	${MAKE} -nk all | rtags-rc -c -; true
+ ver = "$(shell ${1} --version | head -n1)"
+ # Trigger recompilation when compiler environment change.
+ env.compile := $(call ver,${CXX}) ${cppflags} ${cxxflags} ${cppflags.tbb} ${cppflags.moodycamel} ${cppflags.xenium}
+@@ -175,6 +196,7 @@
  
  clean :
--	rm -rf ${build_dir} ${exes}
-+	rm -rf ${build_dir} ${exes} ${build_dir_shared}
+ 	rm -rf ${build_dir} ${exes}
++	rm -rf ${build_dir_shared}
  
- env :
- 	env | sort --ignore-case
+ versions:
+ 	${MAKE} --version | awk 'FNR<2'


=====================================
debian/patches/no-native
=====================================
@@ -6,15 +6,15 @@ Forwarded: not-needed
 It violates Debian's architectual baseline and causes reproducibilty problems
 --- libatomic-queue.orig/Makefile
 +++ libatomic-queue/Makefile
-@@ -29,18 +29,18 @@
+@@ -37,18 +37,18 @@
  AR := ${ar.${TOOLSET}}
  
  cxxflags.gcc.debug := -Og -fstack-protector-all -fno-omit-frame-pointer # -D_GLIBCXX_DEBUG
 -cxxflags.gcc.release := -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG
 +cxxflags.gcc.release := -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG
  cxxflags.gcc.sanitize := ${cxxflags.gcc.release} -fsanitize=thread
--cxxflags.gcc := -pthread -march=native -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
-+cxxflags.gcc := -pthread -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
+-cxxflags.gcc := -pthread -march=native -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs,error=array-bounds}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
++cxxflags.gcc := -pthread -std=gnu++14 -W{all,extra,error,no-{maybe-uninitialized,unused-variable,unused-function,unused-local-typedefs,error=array-bounds}} -g -fmessage-length=0 ${cxxflags.gcc.${BUILD}}
  ldflags.gcc.sanitize := ${ldflags.gcc.release} -fsanitize=thread
  ldflags.gcc := ${ldflags.gcc.${BUILD}}
  


=====================================
debian/patches/no_thin_archives.patch
=====================================
@@ -6,12 +6,12 @@ Origin: https://lists.debian.org/debian-med/2021/12/msg00131.html
 
 --- libatomic-queue.orig/Makefile
 +++ libatomic-queue/Makefile
-@@ -68,7 +68,7 @@
- COMPILE.C = ${CC} -o $@ -c ${cppflags} ${cflags} -MD -MP $(abspath $<)
- LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
+@@ -78,7 +78,7 @@
+ LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out ${relink},$^) $(ldlibs)
+ SOVERSION := 0
  LINK.SO = ${LD} -o $@.$(SOVERSION) -shared -Wl,-soname,`basename $@`.$(SOVERSION) $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
--LINK.A = ${AR} rscT $@ $(filter-out Makefile,$^)
-+LINK.A = ${AR} rsc $@ $(filter-out Makefile,$^)
- 
- exes := benchmarks tests example
+-LINK.A = ${AR} rscT $@ $(filter-out ${relink},$^)
++LINK.A = ${AR} rsc $@ $(filter-out ${relink},$^)
  
+ ifneq (,$(findstring n,$(firstword -${MAKEFLAGS})))
+ # Perform bash parameter expansion when --just-print for rtags.


=====================================
debian/rules
=====================================
@@ -1,13 +1,6 @@
 #!/usr/bin/make -f
 export DEB_BUILD_MAINT_OPTIONS = optimize=+lto
 
-# Workaround gcc-13 bug.  See Gcc bug #110764[1].  FIXME: this is fixed in
-# upcoming upstream build flags[2], so won't be necessary for long.
-# [1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110764
-# [2]: https://github.com/max0x7ba/atomic_queue/commit/c79be00ef12a5f3dc4294ba4954194a712a2853d
-export DEB_CXXFLAGS_MAINT_APPEND  = -Wno-error=array-bounds
-
-
 %:
 	dh $@
 
@@ -20,13 +13,7 @@ override_dh_install:
 		    --exclude-la \
 		    --movedev include usr \
 		    $$(find . -name "*.so")
-
-# FIXME: No idea why  dh_strip  fails with
-#strip: debian/libatomic-queue-dev/usr/lib/x86_64-linux-gnu/libatomic_queue.a: sorry: copying thin archives is not currently supported: invalid operation
-#dh_strip: error: strip --strip-debug --remove-section=.comment --remove-section=.note --enable-deterministic-archives -R .gnu.lto_\* -R .gnu.debuglto_\* -N __gnu_lto_slim -N __gnu_lto_v1 debian/libatomic-queue-dev/usr/lib/x86_64-linux-gnu/libatomic_queue.a returned exit code 1
-#dh_strip: error: Aborting due to earlier error
-override_dh_strip:
-	dh_strip || true
+	rm -vf debian/libatomic-queue-dev/usr/include/CMakeLists.txt
 
 override_dh_auto_test:
 ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))


=====================================
debian/source/lintian-overrides
=====================================
@@ -1,2 +1,3 @@
-# False positive
+# False positives tripped by long lines in the affected files.
 libatomic-queue source: source-is-missing [html/benchmarks.js*]
+libatomic-queue source: source-is-missing [html/benchmarks.html]


=====================================
html/benchmarks.html
=====================================
@@ -20,12 +20,12 @@
     <script src="theme.js"></script>
     <script src="benchmarks.js"></script>
     <meta charset="utf-8">
-    <title>Scalaibilty and Latency Benchmarks</title>
+    <title>Scalability and Latency Benchmarks</title>
   </head>
   <body>
     <h1 class="view-toggle">Scalability Benchmark</h1>
     <div>
-      <p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues.</p>
+      <p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
       <h3 class="view-toggle">Scalability on Intel i9-9900KS</h3><div class="chart" id="scalability-9900KS-5GHz"></div>
       <h3 class="view-toggle">Scalability on AMD Ryzen 7 5825U</h3><div class="chart" id="scalability-ryzen-5825u"></div>
       <h3 class="view-toggle">Scalability on Intel Xeon Gold 6132</h3><div class="chart" id="scalability-xeon-gold-6132"></div>
@@ -34,7 +34,7 @@
 
     <h1 class="view-toggle">Latency Benchmark</h1>
     <div>
-      <p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time.</p>
+      <p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time, i.e. the time it takes to post a message to another thread and receive a reply. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
       <h3 class="view-toggle">Latency on Intel i9-9900KS</h3><div class="chart" id="latency-9900KS-5GHz"></div>
       <h3 class="view-toggle">Latency on AMD Ryzen 7 5825U</h3><div class="chart" id="latency-ryzen-5825u"></div>
       <h3 class="view-toggle">Latency on Intel Xeon Gold 6132</h3><div class="chart" id="latency-xeon-gold-6132"></div>


=====================================
include/CMakeLists.txt
=====================================
@@ -0,0 +1,17 @@
+CMAKE_MINIMUM_REQUIRED( VERSION 3.25 )
+
+add_library(
+    atomic_queue
+    INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}/atomic_queue/atomic_queue.h
+    ${CMAKE_CURRENT_SOURCE_DIR}/atomic_queue/atomic_queue_mutex.h
+    ${CMAKE_CURRENT_SOURCE_DIR}/atomic_queue/barrier.h
+    ${CMAKE_CURRENT_SOURCE_DIR}/atomic_queue/defs.h
+    ${CMAKE_CURRENT_SOURCE_DIR}/atomic_queue/spinlock.h
+)
+
+target_include_directories(
+    atomic_queue
+    INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}
+)
\ No newline at end of file


=====================================
include/atomic_queue/atomic_queue.h
=====================================
@@ -11,7 +11,6 @@
 #include <cstddef>
 #include <cstdint>
 #include <memory>
-#include <type_traits>
 #include <utility>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -56,14 +55,11 @@ struct GetIndexShuffleBits<false, array_size, elements_per_cache_line> {
 // minimizes contention. This is done by swapping the lowest order N bits (which are the index of
 // the element within the cache line) with the next N bits (which are the index of the cache line)
 // of the element index.
-template<int BITS>
-constexpr unsigned remap_index_with_mix(unsigned index, unsigned mix) {
-    return index ^ mix ^ (mix << BITS);
-}
-
 template<int BITS>
 constexpr unsigned remap_index(unsigned index) noexcept {
-    return remap_index_with_mix<BITS>(index, (index ^ (index >> BITS)) & ((1u << BITS) - 1));
+    unsigned constexpr mix_mask{(1u << BITS) - 1};
+    unsigned const mix{(index ^ (index >> BITS)) & mix_mask};
+    return index ^ mix ^ (mix << BITS);
 }
 
 template<>
@@ -128,6 +124,12 @@ constexpr T nil() noexcept {
     return {};
 }
 
+template<class T>
+inline void destroy_n(T* p, unsigned n) noexcept {
+    for(auto q = p + n; p != q;)
+        (p++)->~T();
+}
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace details
@@ -200,7 +202,7 @@ protected:
             q_element.store(element, R);
         }
         else {
-            for(T expected = NIL; ATOMIC_QUEUE_UNLIKELY(!q_element.compare_exchange_strong(expected, element, R, X)); expected = NIL) {
+            for(T expected = NIL; ATOMIC_QUEUE_UNLIKELY(!q_element.compare_exchange_weak(expected, element, R, X)); expected = NIL) {
                 do
                     spin_loop_pause(); // (1) Wait for store (2) to complete.
                 while(Derived::maximize_throughput_ && q_element.load(X) != NIL);
@@ -223,7 +225,7 @@ protected:
         else {
             for(;;) {
                 unsigned char expected = STORED;
-                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, LOADING, A, X))) {
+                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_weak(expected, LOADING, A, X))) {
                     T element{std::move(q_element)};
                     state.store(EMPTY, R);
                     return element;
@@ -248,7 +250,7 @@ protected:
         else {
             for(;;) {
                 unsigned char expected = EMPTY;
-                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, STORING, A, X))) {
+                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_weak(expected, STORING, A, X))) {
                     q_element = std::forward<U>(element);
                     state.store(STORED, R);
                     return;
@@ -274,7 +276,7 @@ public:
             do {
                 if(static_cast<int>(head - tail_.load(X)) >= static_cast<int>(static_cast<Derived&>(*this).size_))
                     return false;
-            } while(ATOMIC_QUEUE_UNLIKELY(!head_.compare_exchange_strong(head, head + 1, X, X))); // This loop is not FIFO.
+            } while(ATOMIC_QUEUE_UNLIKELY(!head_.compare_exchange_weak(head, head + 1, X, X))); // This loop is not FIFO.
         }
 
         static_cast<Derived&>(*this).do_push(std::forward<T>(element), head);
@@ -293,7 +295,7 @@ public:
             do {
                 if(static_cast<int>(head_.load(X) - tail) <= 0)
                     return false;
-            } while(ATOMIC_QUEUE_UNLIKELY(!tail_.compare_exchange_strong(tail, tail + 1, X, X))); // This loop is not FIFO.
+            } while(ATOMIC_QUEUE_UNLIKELY(!tail_.compare_exchange_weak(tail, tail + 1, X, X))); // This loop is not FIFO.
         }
 
         element = static_cast<Derived&>(*this).do_pop(tail);
@@ -358,7 +360,7 @@ class AtomicQueue : public AtomicQueueCommon<AtomicQueue<T, SIZE, NIL, MINIMIZE_
     static constexpr bool spsc_ = SPSC;
     static constexpr bool maximize_throughput_ = MAXIMIZE_THROUGHPUT;
 
-    alignas(CACHE_LINE_SIZE) std::atomic<T> elements_[size_] = {}; // Empty elements are NIL.
+    alignas(CACHE_LINE_SIZE) std::atomic<T> elements_[size_];
 
     T do_pop(unsigned tail) noexcept {
         std::atomic<T>& q_element = details::map<SHUFFLE_BITS>(elements_, tail % size_);
@@ -375,9 +377,8 @@ public:
 
     AtomicQueue() noexcept {
         assert(std::atomic<T>{NIL}.is_lock_free()); // Queue element type T is not atomic. Use AtomicQueue2/AtomicQueueB2 for such element types.
-        if(details::nil<T>() != NIL)
-            for(auto& element : elements_)
-                element.store(NIL, X);
+        for(auto p = elements_, q = elements_ + size_; p != q; ++p)
+            p->store(NIL, X);
     }
 
     AtomicQueue(AtomicQueue const&) = delete;
@@ -423,8 +424,9 @@ public:
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
 template<class T, class A = std::allocator<T>, T NIL = details::nil<T>(), bool MAXIMIZE_THROUGHPUT = true, bool TOTAL_ORDER = false, bool SPSC = false>
-class AtomicQueueB : public AtomicQueueCommon<AtomicQueueB<T, A, NIL, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>>,
-                     private std::allocator_traits<A>::template rebind_alloc<std::atomic<T>> {
+class AtomicQueueB : private std::allocator_traits<A>::template rebind_alloc<std::atomic<T>>,
+                     public AtomicQueueCommon<AtomicQueueB<T, A, NIL, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>> {
+    using AllocatorElements = typename std::allocator_traits<A>::template rebind_alloc<std::atomic<T>>;
     using Base = AtomicQueueCommon<AtomicQueueB<T, A, NIL, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>>;
     friend Base;
 
@@ -432,8 +434,6 @@ class AtomicQueueB : public AtomicQueueCommon<AtomicQueueB<T, A, NIL, MAXIMIZE_T
     static constexpr bool spsc_ = SPSC;
     static constexpr bool maximize_throughput_ = MAXIMIZE_THROUGHPUT;
 
-    using AllocatorElements = typename std::allocator_traits<A>::template rebind_alloc<std::atomic<T>>;
-
     static constexpr auto ELEMENTS_PER_CACHE_LINE = CACHE_LINE_SIZE / sizeof(std::atomic<T>);
     static_assert(ELEMENTS_PER_CACHE_LINE, "Unexpected ELEMENTS_PER_CACHE_LINE.");
 
@@ -457,25 +457,25 @@ class AtomicQueueB : public AtomicQueueCommon<AtomicQueueB<T, A, NIL, MAXIMIZE_T
 
 public:
     using value_type = T;
+    using allocator_type = A;
 
     // The special member functions are not thread-safe.
 
-    AtomicQueueB(unsigned size)
-        : size_(std::max(details::round_up_to_power_of_2(size), 1u << (SHUFFLE_BITS * 2)))
+    AtomicQueueB(unsigned size, A const& allocator = A{})
+        : AllocatorElements(allocator)
+        , size_(std::max(details::round_up_to_power_of_2(size), 1u << (SHUFFLE_BITS * 2)))
         , elements_(AllocatorElements::allocate(size_)) {
         assert(std::atomic<T>{NIL}.is_lock_free()); // Queue element type T is not atomic. Use AtomicQueue2/AtomicQueueB2 for such element types.
-        for(auto p = elements_, q = elements_ + size_; p < q; ++p)
-            p->store(NIL, X);
+        std::uninitialized_fill_n(elements_, size_, NIL);
+        assert(get_allocator() == allocator); // The standard requires the original and rebound allocators to manage the same state.
     }
 
     AtomicQueueB(AtomicQueueB&& b) noexcept
-        : Base(static_cast<Base&&>(b))
-        , AllocatorElements(static_cast<AllocatorElements&&>(b)) // TODO: This must be noexcept, static_assert that.
-        , size_(b.size_)
-        , elements_(b.elements_) {
-        b.size_ = 0;
-        b.elements_ = 0;
-    }
+        : AllocatorElements(static_cast<AllocatorElements&&>(b)) // TODO: This must be noexcept, static_assert that.
+        , Base(static_cast<Base&&>(b))
+        , size_(std::exchange(b.size_, 0))
+        , elements_(std::exchange(b.elements_, nullptr))
+    {}
 
     AtomicQueueB& operator=(AtomicQueueB&& b) noexcept {
         b.swap(*this);
@@ -483,19 +483,25 @@ public:
     }
 
     ~AtomicQueueB() noexcept {
-        if(elements_)
+        if(elements_) {
+            details::destroy_n(elements_, size_);
             AllocatorElements::deallocate(elements_, size_); // TODO: This must be noexcept, static_assert that.
+        }
+    }
+
+    A get_allocator() const noexcept {
+        return *this; // The standard requires implicit conversion between rebound allocators.
     }
 
     void swap(AtomicQueueB& b) noexcept {
         using std::swap;
-        this->Base::swap(b);
         swap(static_cast<AllocatorElements&>(*this), static_cast<AllocatorElements&>(b));
+        Base::swap(b);
         swap(size_, b.size_);
         swap(elements_, b.elements_);
     }
 
-    friend void swap(AtomicQueueB& a, AtomicQueueB& b) {
+    friend void swap(AtomicQueueB& a, AtomicQueueB& b) noexcept {
         a.swap(b);
     }
 };
@@ -503,27 +509,25 @@ public:
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
 template<class T, class A = std::allocator<T>, bool MAXIMIZE_THROUGHPUT = true, bool TOTAL_ORDER = false, bool SPSC = false>
-class AtomicQueueB2 : public AtomicQueueCommon<AtomicQueueB2<T, A, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>>,
-                      private A,
-                      private std::allocator_traits<A>::template rebind_alloc<std::atomic<uint8_t>> {
+class AtomicQueueB2 : private std::allocator_traits<A>::template rebind_alloc<unsigned char>,
+                      public AtomicQueueCommon<AtomicQueueB2<T, A, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>> {
+    using StorageAllocator = typename std::allocator_traits<A>::template rebind_alloc<unsigned char>;
     using Base = AtomicQueueCommon<AtomicQueueB2<T, A, MAXIMIZE_THROUGHPUT, TOTAL_ORDER, SPSC>>;
     using State = typename Base::State;
+    using AtomicState = std::atomic<unsigned char>;
     friend Base;
 
     static constexpr bool total_order_ = TOTAL_ORDER;
     static constexpr bool spsc_ = SPSC;
     static constexpr bool maximize_throughput_ = MAXIMIZE_THROUGHPUT;
 
-    using AllocatorElements = A;
-    using AllocatorStates = typename std::allocator_traits<A>::template rebind_alloc<std::atomic<uint8_t>>;
-
     // AtomicQueueCommon members are stored into by readers and writers.
     // Allocate these immutable members on another cache line which never gets invalidated by stores.
     alignas(CACHE_LINE_SIZE) unsigned size_;
-    std::atomic<unsigned char>* states_;
+    AtomicState* states_;
     T* elements_;
 
-    static constexpr auto STATES_PER_CACHE_LINE = CACHE_LINE_SIZE / sizeof(State);
+    static constexpr auto STATES_PER_CACHE_LINE = CACHE_LINE_SIZE / sizeof(AtomicState);
     static_assert(STATES_PER_CACHE_LINE, "Unexpected STATES_PER_CACHE_LINE.");
 
     static constexpr auto SHUFFLE_BITS = details::GetCacheLineIndexBits<STATES_PER_CACHE_LINE>::value;
@@ -540,34 +544,43 @@ class AtomicQueueB2 : public AtomicQueueCommon<AtomicQueueB2<T, A, MAXIMIZE_THRO
         Base::template do_push_any(std::forward<U>(element), states_[index], elements_[index]);
     }
 
+    template<class U>
+    U* allocate_() {
+        U* p = reinterpret_cast<U*>(StorageAllocator::allocate(size_ * sizeof(U)));
+        assert(reinterpret_cast<uintptr_t>(p) % alignof(U) == 0); // Allocated storage must be suitably aligned for U.
+        return p;
+    }
+
+    template<class U>
+    void deallocate_(U* p) noexcept {
+        StorageAllocator::deallocate(reinterpret_cast<unsigned char*>(p), size_ * sizeof(U)); // TODO: This must be noexcept, static_assert that.
+    }
+
 public:
     using value_type = T;
+    using allocator_type = A;
 
     // The special member functions are not thread-safe.
 
-    AtomicQueueB2(unsigned size)
-        : size_(std::max(details::round_up_to_power_of_2(size), 1u << (SHUFFLE_BITS * 2)))
-        , states_(AllocatorStates::allocate(size_))
-        , elements_(AllocatorElements::allocate(size_)) {
-        for(auto p = states_, q = states_ + size_; p < q; ++p)
-            p->store(Base::EMPTY, X);
-
-        AllocatorElements& ae = *this;
+    AtomicQueueB2(unsigned size, A const& allocator = A{})
+        : StorageAllocator(allocator)
+        , size_(std::max(details::round_up_to_power_of_2(size), 1u << (SHUFFLE_BITS * 2)))
+        , states_(allocate_<AtomicState>())
+        , elements_(allocate_<T>()) {
+        std::uninitialized_fill_n(states_, size_, Base::EMPTY);
+        A a = get_allocator();
+        assert(a == allocator); // The standard requires the original and rebound allocators to manage the same state.
         for(auto p = elements_, q = elements_ + size_; p < q; ++p)
-            std::allocator_traits<AllocatorElements>::construct(ae, p);
+            std::allocator_traits<A>::construct(a, p);
     }
 
     AtomicQueueB2(AtomicQueueB2&& b) noexcept
-        : Base(static_cast<Base&&>(b))
-        , AllocatorElements(static_cast<AllocatorElements&&>(b)) // TODO: This must be noexcept, static_assert that.
-        , AllocatorStates(static_cast<AllocatorStates&&>(b))     // TODO: This must be noexcept, static_assert that.
-        , size_(b.size_)
-        , states_(b.states_)
-        , elements_(b.elements_) {
-        b.size_ = 0;
-        b.states_ = 0;
-        b.elements_ = 0;
-    }
+        : StorageAllocator(static_cast<StorageAllocator&&>(b)) // TODO: This must be noexcept, static_assert that.
+        , Base(static_cast<Base&&>(b))
+        , size_(std::exchange(b.size_, 0))
+        , states_(std::exchange(b.states_, nullptr))
+        , elements_(std::exchange(b.elements_, nullptr))
+    {}
 
     AtomicQueueB2& operator=(AtomicQueueB2&& b) noexcept {
         b.swap(*this);
@@ -576,19 +589,23 @@ public:
 
     ~AtomicQueueB2() noexcept {
         if(elements_) {
-            AllocatorElements& ae = *this;
+            A a = get_allocator();
             for(auto p = elements_, q = elements_ + size_; p < q; ++p)
-                std::allocator_traits<AllocatorElements>::destroy(ae, p);
-            AllocatorElements::deallocate(elements_, size_); // TODO: This must be noexcept, static_assert that.
-            AllocatorStates::deallocate(states_, size_); // TODO: This must be noexcept, static_assert that.
+                std::allocator_traits<A>::destroy(a, p);
+            deallocate_(elements_);
+            details::destroy_n(states_, size_);
+            deallocate_(states_);
         }
     }
 
+    A get_allocator() const noexcept {
+        return *this; // The standard requires implicit conversion between rebound allocators.
+    }
+
     void swap(AtomicQueueB2& b) noexcept {
         using std::swap;
-        this->Base::swap(b);
-        swap(static_cast<AllocatorElements&>(*this), static_cast<AllocatorElements&>(b));
-        swap(static_cast<AllocatorStates&>(*this), static_cast<AllocatorStates&>(b));
+        swap(static_cast<StorageAllocator&>(*this), static_cast<StorageAllocator&>(b));
+        Base::swap(b);
         swap(size_, b.size_);
         swap(states_, b.states_);
         swap(elements_, b.elements_);


=====================================
include/atomic_queue/defs.h
=====================================
@@ -14,7 +14,7 @@ static inline void spin_loop_pause() noexcept {
     _mm_pause();
 }
 } // namespace atomic_queue
-#elif defined(__arm__) || defined(__aarch64__)
+#elif defined(__arm__) || defined(__aarch64__) || defined(_M_ARM64)
 namespace atomic_queue {
 constexpr int CACHE_LINE_SIZE = 64;
 static inline void spin_loop_pause() noexcept {
@@ -30,6 +30,8 @@ static inline void spin_loop_pause() noexcept {
      defined(__ARM_ARCH_8A__) || \
      defined(__aarch64__))
     asm volatile ("yield" ::: "memory");
+#elif defined(_M_ARM64)
+    __yield();
 #else
     asm volatile ("nop" ::: "memory");
 #endif
@@ -55,7 +57,11 @@ static inline void spin_loop_pause() noexcept {
 }
 } // namespace atomic_queue
 #else
+#ifdef _MSC_VER
+#pragma message("Unknown CPU architecture. Using L1 cache line size of 64 bytes and no spinloop pause instruction.")
+#else
 #warning "Unknown CPU architecture. Using L1 cache line size of 64 bytes and no spinloop pause instruction."
+#endif
 namespace atomic_queue {
 constexpr int CACHE_LINE_SIZE = 64; // TODO: Review that this is the correct value.
 static inline void spin_loop_pause() noexcept {}


=====================================
scripts/benchmark-prologue.sh
=====================================
@@ -2,7 +2,7 @@
 
 set +e # Ignore failures.
 
-sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
+sudo hugeadm --pool-pages-min 1GB:1
 sudo cpupower frequency-set --related --governor performance >/dev/null
 
 if [[ -e /proc/sys/kernel/sched_rt_runtime_us ]]; then


=====================================
src/CMakeLists.txt
=====================================
@@ -0,0 +1,33 @@
+CMAKE_MINIMUM_REQUIRED( VERSION 3.25 )
+
+if ( ATOMIC_QUEUE_BUILD_EXAMPLES )
+    add_executable(
+        atomic_queue_example
+        ${CMAKE_CURRENT_SOURCE_DIR}/example.cc
+    )
+
+    target_link_libraries(
+        atomic_queue_example
+        atomic_queue
+    )
+endif()
+
+if ( ATOMIC_QUEUE_BUILD_TESTS )
+    find_package(Boost REQUIRED COMPONENTS unit_test_framework)
+
+    add_executable(
+        atomic_queue_tests
+        ${CMAKE_CURRENT_SOURCE_DIR}/tests.cc
+    )
+
+    target_link_libraries(
+        atomic_queue_tests
+        atomic_queue
+        Boost::unit_test_framework
+    )
+
+    add_test(
+        NAME atomic_queue_tests
+        COMMAND atomic_queue_tests
+    )
+endif()
\ No newline at end of file


=====================================
src/benchmarks.cc
=====================================
@@ -149,23 +149,6 @@ void check_huge_pages_leaks(char const* name, HugePages& hp) {
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
-struct ThreadCound {
-    unsigned producers;
-    unsigned comsumers;
-};
-
-template<class T>
-struct ConstructorAdapter : T{
-    using T::T;
-};
-
-// template<class T>
-// struct ConstructorAdapter : {
-//     using T::T;
-// };
-
-////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
-
 // According to my benchmarking, it looks like the best performance is achieved with the following parameters:
 // * For SPSC: SPSC=true,  MINIMIZE_CONTENTION=false, MAXIMIZE_THROUGHPUT=false.
 // * For MPMC: SPSC=false, MINIMIZE_CONTENTION=true,  MAXIMIZE_THROUGHPUT=true.
@@ -320,22 +303,24 @@ void run_throughput_benchmark(char const* name, HugePages& hp, std::vector<unsig
     }
 }
 
+constexpr int N_TROUGHPUT_MESSAGES = 1000000;
+
 template<class Queue>
 void run_throughput_mpmc_benchmark(char const* name, HugePages& hp, std::vector<unsigned> const& hw_thread_ids, Type<Queue>, unsigned thread_count_min = 1) {
     unsigned const thread_count_max = hw_thread_ids.size() / 2;
-    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, 1000000, thread_count_min, thread_count_max);
+    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, N_TROUGHPUT_MESSAGES, thread_count_min, thread_count_max);
 }
 
 template<class... Args>
 void run_throughput_spsc_benchmark(char const* name, HugePages& hp, std::vector<unsigned> const& hw_thread_ids,
                                    Type<BoostSpScAdapter<boost::lockfree::spsc_queue<Args...>>>) {
     using Queue = BoostSpScAdapter<boost::lockfree::spsc_queue<Args...>>;
-    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, 1000000, 1, 1); // spsc_queue can only handle 1 producer and 1 consumer.
+    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, N_TROUGHPUT_MESSAGES, 1, 1); // spsc_queue can only handle 1 producer and 1 consumer.
 }
 
 template<class Queue>
 void run_throughput_spsc_benchmark(char const* name, HugePages& hp, std::vector<unsigned> const& hw_thread_ids, Type<Queue>) {
-    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, 1000000, 1, 1); // Special case for 1 producer and 1 consumer.
+    run_throughput_benchmark<Queue>(name, hp, hw_thread_ids, N_TROUGHPUT_MESSAGES, 1, 1); // Special case for 1 producer and 1 consumer.
 }
 
 void run_throughput_benchmarks(HugePages& hp, std::vector<CpuTopologyInfo> const& cpu_topology) {
@@ -464,7 +449,7 @@ inline std::array<cycles_t, 2> ping_pong_benchmark(unsigned N, HugePages& hp, un
 
 template<class Queue>
 void run_ping_pong_benchmark(char const* name, HugePages& hp, std::vector<unsigned> const& hw_thread_ids) {
-    int constexpr N = 100000;
+    int constexpr N_PING_PONG_MESSAGES = 100000;
     int constexpr RUNS = 10;
 
     unsigned const cpus[2] = {hw_thread_ids[0], hw_thread_ids[1]};
@@ -472,7 +457,7 @@ void run_ping_pong_benchmark(char const* name, HugePages& hp, std::vector<unsign
     // select the best of RUNS runs.
     std::array<cycles_t, 2> best_times = {std::numeric_limits<int64_t>::max(), std::numeric_limits<int64_t>::max()};
     for(unsigned run = RUNS; run--;) {
-        auto times = ping_pong_benchmark<Queue>(N, hp, cpus);
+        auto times = ping_pong_benchmark<Queue>(N_PING_PONG_MESSAGES, hp, cpus);
         if(best_times[0] + best_times[1] > times[0] + times[1])
             best_times = times;
 
@@ -480,7 +465,7 @@ void run_ping_pong_benchmark(char const* name, HugePages& hp, std::vector<unsign
     }
 
     auto avg_time = to_seconds((best_times[0] + best_times[1]) / 2);
-    auto round_trip_time = avg_time / N;
+    auto round_trip_time = avg_time / N_PING_PONG_MESSAGES;
     std::printf("%32s: %.9f sec/round-trip\n", name, round_trip_time);
 }
 


=====================================
src/huge_pages.h
=====================================
@@ -143,6 +143,12 @@ struct HugePageAllocator : HugePageAllocatorBase
 
     using value_type = T;
 
+    HugePageAllocator() noexcept = default;
+
+    template<class U>
+    HugePageAllocator(HugePageAllocator<U>) noexcept
+    {}
+
     T* allocate(size_t n) const {
         return static_cast<T*>(hp->allocate(n * sizeof(T)));
     }
@@ -151,11 +157,13 @@ struct HugePageAllocator : HugePageAllocatorBase
         hp->deallocate(p, n * sizeof(T));
     }
 
-    bool operator==(HugePageAllocator b) const {
+    template<class U>
+    bool operator==(HugePageAllocator<U> b) const {
         return hp == b.hp;
     }
 
-    bool operator!=(HugePageAllocator b) const {
+    template<class U>
+    bool operator!=(HugePageAllocator<U> b) const {
         return hp != b.hp;
     }
 };


=====================================
src/tests.cc
=====================================
@@ -11,6 +11,7 @@
 
 #include <cstdint>
 #include <thread>
+#include <string>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -32,7 +33,7 @@ void stress() {
 
     std::thread producers[PRODUCERS];
     for(unsigned i = 0; i < PRODUCERS; ++i)
-        producers[i] = std::thread([&q, &barrier]() {
+        producers[i] = std::thread([&q, &barrier, N=N]() {
             barrier.wait();
             for(unsigned n = N; n; --n)
                 q.push(n);
@@ -79,13 +80,13 @@ void test_unique_ptr_int(Q& q) {
     BOOST_CHECK(q.was_empty());
     BOOST_CHECK_EQUAL(q.was_size(), 0u);
     std::unique_ptr<int> p{new int{1}};
-    BOOST_REQUIRE(q.try_push(move(p)));
+    BOOST_REQUIRE(q.try_push(std::move(p)));
     BOOST_CHECK(!p);
     BOOST_CHECK(!q.was_empty());
     BOOST_CHECK_EQUAL(q.was_size(), 1u);
 
     p.reset(new int{2});
-    q.push(move(p));
+    q.push(std::move(p));
     BOOST_REQUIRE(!p);
     BOOST_CHECK(!q.was_empty());
     BOOST_CHECK_EQUAL(q.was_size(), 2u);
@@ -105,6 +106,55 @@ void test_unique_ptr_int(Q& q) {
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template<class T, class State>
+struct test_stateful_allocator : std::allocator<T> {
+    State state;
+    test_stateful_allocator() = delete;
+
+    // disambiguate constructor with std::nullptr_t
+    // std::in_place available since C++17
+    test_stateful_allocator(std::nullptr_t, const State& s) noexcept
+        : state(s) {}
+
+    test_stateful_allocator(const test_stateful_allocator& other) noexcept
+        : std::allocator<T>(other), state(other.state) {}
+
+    template<class U>
+    test_stateful_allocator(const test_stateful_allocator<U, State>& other) noexcept
+        : state(other.state) {}
+
+    test_stateful_allocator& operator=(const test_stateful_allocator& other) noexcept {
+        state = other.state;
+        return *this;
+    }
+
+    ~test_stateful_allocator() noexcept = default;
+
+    template<class U>
+    struct rebind {
+        using other = test_stateful_allocator<U, State>;
+    };
+};
+
+// Required by boost-test
+template<class T, class State>
+std::ostream& operator<<(std::ostream& os, const test_stateful_allocator<T, State>& allocator) {
+    return os << allocator.state;
+}
+
+template<class T1, class T2, class State>
+bool operator==(const test_stateful_allocator<T1, State>& lhs, const test_stateful_allocator<T2, State>& rhs) {
+    return lhs.state == rhs.state;
+}
+
+template<class T1, class T2, class State>
+bool operator!=(const test_stateful_allocator<T1, State>& lhs, const test_stateful_allocator<T2, State>& rhs) {
+    return !(lhs.state == rhs.state);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
 } // namespace
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -137,6 +187,28 @@ BOOST_AUTO_TEST_CASE(move_only_b2) {
     test_unique_ptr_int(q);
 }
 
+BOOST_AUTO_TEST_CASE(allocator_constructor_only_b) {
+    using allocator_type = test_stateful_allocator<int, std::string>;
+    const auto allocator = allocator_type(nullptr, "Capybara");
+
+    AtomicQueueB<int, allocator_type> q(2, allocator);
+
+    BOOST_CHECK_EQUAL(q.get_allocator(), allocator);
+    auto q2 = std::move(q);
+    BOOST_CHECK_EQUAL(q2.get_allocator(), allocator);
+}
+
+BOOST_AUTO_TEST_CASE(allocator_constructor_only_b2) {
+    using allocator_type = test_stateful_allocator<std::unique_ptr<int>, std::string>;
+    const auto allocator = allocator_type(nullptr, "Fox");
+
+    AtomicQueueB2<std::unique_ptr<int>, allocator_type> q(2, allocator);
+
+    BOOST_CHECK_EQUAL(q.get_allocator(), allocator);
+    auto q2 = std::move(q);
+    BOOST_CHECK_EQUAL(q2.get_allocator(), allocator);
+}
+
 BOOST_AUTO_TEST_CASE(move_constructor_assignment) {
     AtomicQueueB2<std::unique_ptr<int>> q(2);
     auto q2 = std::move(q);



View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/compare/42d12de5264f943fc601cf8d6c539635c0df6ee9...88e1fe9ecc5d5b95101905a25f10666b1cd101ca

-- 
View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/compare/42d12de5264f943fc601cf8d6c539635c0df6ee9...88e1fe9ecc5d5b95101905a25f10666b1cd101ca
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240823/4ec86f88/attachment-0001.htm>