[med-svn] [Git][med-team/libatomic-queue][master] 4 commits: routine-update: New upstream version

Fri Aug 25 08:48:19 BST 2023


Andreas Tille pushed to branch master at Debian Med / libatomic-queue


Commits:
beed9b57 by Andreas Tille at 2023-08-25T09:42:23+02:00
routine-update: New upstream version

- - - - -
55827781 by Andreas Tille at 2023-08-25T09:42:24+02:00
New upstream version 1.4
- - - - -
0a0879aa by Andreas Tille at 2023-08-25T09:42:31+02:00
Update upstream source from tag 'upstream/1.4'

Update to upstream version '1.4'
with Debian dir f96a1897f9fa927d5d6c1b0d18f3638063120519
- - - - -
6ae12289 by Andreas Tille at 2023-08-25T09:47:09+02:00
Upload to unstable

- - - - -


5 changed files:

- README.md
- debian/changelog
- html/benchmarks.html
- include/atomic_queue/atomic_queue.h
- include/atomic_queue/defs.h


Changes:

=====================================
README.md
=====================================
@@ -1,45 +1,30 @@
 [![C++14](https://img.shields.io/badge/dialect-C%2B%2B14-blue)](https://en.cppreference.com/w/cpp/14)
 [![MIT license](https://img.shields.io/github/license/max0x7ba/atomic_queue)](https://github.com/max0x7ba/atomic_queue/blob/master/LICENSE)
-![Latest release](https://img.shields.io/github/v/tag/max0x7ba/atomic_queue?label=latest%20release)
-[![Ubuntu continuous integration](https://github.com/max0x7ba/atomic_queue/workflows/Ubuntu%20continuous%20integration/badge.svg)](https://github.com/max0x7ba/atomic_queue/actions?query=workflow%3A%22Ubuntu%20continuous%20integration%22)
-<br>
-![platform Linux x86_64](https://img.shields.io/badge/platform-Linux%20x86_64--bit-yellow)
+![platform Linux 64-bit](https://img.shields.io/badge/platform-Linux%2064--bit-yellow)
 ![platform Linux ARM](https://img.shields.io/badge/platform-Linux%20ARM-yellow)
 ![platform Linux RISC-V](https://img.shields.io/badge/platform-Linux%20RISC--V-yellow)
-![platform Linux PowerPC](https://img.shields.io/badge/platform-Linux%20PowerPC-yellow)
-![platform Linux IBM System/390](https://img.shields.io/badge/platform-Linux%20IBM%20System/390-yellow)
+![Latest release](https://img.shields.io/github/v/tag/max0x7ba/atomic_queue?label=latest%20release)
+[![Ubuntu continuous integration](https://github.com/max0x7ba/atomic_queue/workflows/Ubuntu%20continuous%20integration/badge.svg)](https://github.com/max0x7ba/atomic_queue/actions?query=workflow%3A%22Ubuntu%20continuous%20integration%22)
 
 # atomic_queue
 C++14 multiple-producer-multiple-consumer *lockless* queues based on circular buffer with [`std::atomic`][3].
 
 It has been developed, tested and benchmarked on Linux, but should support any C++14 platforms which implement `std::atomic`.
 
-These queues have been designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.
-
-## Design Principles
-When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify.
-
-The main design principle these queues follow is _minimalism_, which results in such design choices as:
-
-* Bare minimum of atomic instructions.
-* Explicit contention/false-sharing avoidance.
-* Fixed size buffer.
-* Value semantics. Meaning that the queues make a copy/move upon `push`/`pop`, no reference/pointer to elements in the queue can be obtained.
-
-The impact of each of these small design choices on their own is barely measurable, but their total impact is much greater than a simple sum of the constituents' impacts, aka super-scalar compounding or synergy (a layman's term). The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least impeded.
+The main design principle these queues follow is _minimalism_: the bare minimum of atomic operations, fixed size buffer, value semantics.
 
-These design choices are also limitations:
+These qualities are also limitations:
 
-* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't consumed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront calculation on what would be the largest expected/acceptable number of unconsumed elements in the queue.
+* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't processed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront back-of-the-envelope calculation on what would be the largest expected/acceptable queue size.
 * There are no OS-blocking push/pop functions. This queue is designed for ultra-low-latency scenarios and using an OS blocking primitive would be sacrificing push-to-pop latency. For lowest possible latency one cannot afford blocking in the OS kernel because the wake-up latency of a blocked thread is about 1-3 microseconds, whereas this queue's round-trip time can be as low as 150 nanoseconds.
 
 Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the [throughput and latency benchmarks][1].
 
 Available containers are:
 * `AtomicQueue` - a fixed size ring-buffer for atomic elements.
-* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full. It is `AtomicQueue` used with `push`/`pop` instead of `try_push`/`try_pop`.
+* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full.
 * `AtomicQueue2` - a fixed size ring-buffer for non-atomic elements.
-* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full. It is `AtomicQueue2` used with `push`/`pop` instead of `try_push`/`try_pop`.
+* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full.
 
 These containers have corresponding `AtomicQueueB`, `OptimistAtomicQueueB`, `AtomicQueueB2`, `OptimistAtomicQueueB2` versions where the buffer size is specified as an argument to the constructor.
 
@@ -49,13 +34,12 @@ Single-producer-single-consumer mode is supported. In this mode, no expensive at
 
 Move-only queue element types are fully supported. For example, a queue of `std::unique_ptr<T>` elements would be `AtomicQueue2B<std::unique_ptr<T>>` or `AtomicQueue2<std::unique_ptr<T>, CAPACITY>`.
 
-## Role Models
-Several other well established and popular thread-safe containers are used for reference in the [benchmarks][1]:
+A few other thread-safe containers are used for reference in the benchmarks:
 * `std::mutex` - a fixed size ring-buffer with `std::mutex`.
 * `pthread_spinlock` - a fixed size ring-buffer with `pthread_spinlock_t`.
 * `boost::lockfree::spsc_queue` - a wait-free single-producer-single-consumer queue from Boost library.
 * `boost::lockfree::queue` - a lock-free multiple-producer-multiple-consumer queue from Boost library.
-* `moodycamel::ConcurrentQueue` - a lock-free multiple-producer-multiple-consumer queue used in non-blocking mode. This queue is designed to maximize throughput at the expense of latency and eschewing the global time order of elements pushed into one queue by different threads. It is not equivalent to other queues benchmarked here in this respect.
+* `moodycamel::ConcurrentQueue` - a lock-free multiple-producer-multiple-consumer queue used in non-blocking mode.
 * `moodycamel::ReaderWriterQueue` - a lock-free single-producer-single-consumer queue used in non-blocking mode.
 * `xenium::michael_scott_queue` - a lock-free multi-producer-multi-consumer queue proposed by [Michael and Scott](http://www.cs.rochester.edu/~scott/papers/1996_PODC_queues.pdf) (this queue is similar to `boost::lockfree::queue` which is also based on the same proposal).
 * `xenium::ramalhete_queue` - a lock-free multi-producer-multi-consumer queue proposed by [Ramalhete and Correia](http://concurrencyfreaks.blogspot.com/2016/11/faaarrayqueue-mpmc-lock-free-queue-part.html).


=====================================
debian/changelog
=====================================
@@ -1,4 +1,4 @@
-libatomic-queue (0.0+git20230629.b770bb2-1) UNRELEASED; urgency=medium
+libatomic-queue (1.4-1) unstable; urgency=medium
 
   [ Nilesh Patra ]
   * [ci skip] Remove myself from uploaders
@@ -21,7 +21,7 @@ libatomic-queue (0.0+git20230629.b770bb2-1) UNRELEASED; urgency=medium
   [ Andreas Tille ]
   * Update symbols
 
- -- Étienne Mollier <emollier at debian.org>  Fri, 21 Jul 2023 14:39:18 +0200
+ -- Andreas Tille <tille at debian.org>  Fri, 25 Aug 2023 09:45:57 +0200
 
 libatomic-queue (0.0+git20220518.83774a2-1) unstable; urgency=medium
 


=====================================
html/benchmarks.html
=====================================
@@ -25,7 +25,7 @@
   <body>
     <h1 class="view-toggle">Scalability Benchmark</h1>
     <div>
-      <p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
+      <p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues.</p>
       <h3 class="view-toggle">Scalability on Intel i9-9900KS</h3><div class="chart" id="scalability-9900KS-5GHz"></div>
       <h3 class="view-toggle">Scalability on AMD Ryzen 7 5825U</h3><div class="chart" id="scalability-ryzen-5825u"></div>
       <h3 class="view-toggle">Scalability on Intel Xeon Gold 6132</h3><div class="chart" id="scalability-xeon-gold-6132"></div>
@@ -34,7 +34,7 @@
 
     <h1 class="view-toggle">Latency Benchmark</h1>
     <div>
-      <p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time, i.e. the time it takes to post a message to another thread and receive a reply. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
+      <p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time.</p>
       <h3 class="view-toggle">Latency on Intel i9-9900KS</h3><div class="chart" id="latency-9900KS-5GHz"></div>
       <h3 class="view-toggle">Latency on AMD Ryzen 7 5825U</h3><div class="chart" id="latency-ryzen-5825u"></div>
       <h3 class="view-toggle">Latency on Intel Xeon Gold 6132</h3><div class="chart" id="latency-xeon-gold-6132"></div>


=====================================
include/atomic_queue/atomic_queue.h
=====================================
@@ -57,12 +57,15 @@ struct GetIndexShuffleBits<false, array_size, elements_per_cache_line> {
 // the element within the cache line) with the next N bits (which are the index of the cache line)
 // of the element index.
 template<int BITS>
-constexpr unsigned remap_index(unsigned index) noexcept {
-    unsigned constexpr mix_mask{(1u << BITS) - 1};
-    unsigned const mix{(index ^ (index >> BITS)) & mix_mask};
+constexpr unsigned remap_index_with_mix(unsigned index, unsigned mix) {
     return index ^ mix ^ (mix << BITS);
 }
 
+template<int BITS>
+constexpr unsigned remap_index(unsigned index) noexcept {
+    return remap_index_with_mix<BITS>(index, (index ^ (index >> BITS)) & ((1u << BITS) - 1));
+}
+
 template<>
 constexpr unsigned remap_index<0>(unsigned index) noexcept {
     return index;


=====================================
include/atomic_queue/defs.h
=====================================
@@ -14,7 +14,7 @@ static inline void spin_loop_pause() noexcept {
     _mm_pause();
 }
 } // namespace atomic_queue
-#elif defined(__arm__) || defined(__aarch64__) || defined(_M_ARM64)
+#elif defined(__arm__) || defined(__aarch64__)
 namespace atomic_queue {
 constexpr int CACHE_LINE_SIZE = 64;
 static inline void spin_loop_pause() noexcept {
@@ -30,8 +30,6 @@ static inline void spin_loop_pause() noexcept {
      defined(__ARM_ARCH_8A__) || \
      defined(__aarch64__))
     asm volatile ("yield" ::: "memory");
-#elif defined(_M_ARM64)
-    __yield();
 #else
     asm volatile ("nop" ::: "memory");
 #endif
@@ -57,11 +55,7 @@ static inline void spin_loop_pause() noexcept {
 }
 } // namespace atomic_queue
 #else
-#ifdef _MSC_VER
-#pragma message("Unknown CPU architecture. Using L1 cache line size of 64 bytes and no spinloop pause instruction.")
-#else
 #warning "Unknown CPU architecture. Using L1 cache line size of 64 bytes and no spinloop pause instruction."
-#endif
 namespace atomic_queue {
 constexpr int CACHE_LINE_SIZE = 64; // TODO: Review that this is the correct value.
 static inline void spin_loop_pause() noexcept {}



View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/compare/143e3724d4c9642ed5b9a1ee38c00554f3439b1b...6ae122898bfa8eaa79061daede2e7600e49c287e

-- 
View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/compare/143e3724d4c9642ed5b9a1ee38c00554f3439b1b...6ae122898bfa8eaa79061daede2e7600e49c287e
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230825/a2b292d8/attachment-0001.htm>