[med-svn] [Git][med-team/libatomic-queue][upstream] New upstream version 0.0+git20201108.d9d66b6

Sat Dec 5 20:09:25 GMT 2020


Andreas Tille pushed to branch upstream at Debian Med / libatomic-queue


Commits:
2b1e15cf by Andreas Tille at 2020-12-05T20:39:42+01:00
New upstream version 0.0+git20201108.d9d66b6
- - - - -


5 changed files:

- .github/workflows/c-cpp.yml
- Makefile
- README.md
- include/atomic_queue/atomic_queue.h
- src/example.cc


Changes:

=====================================
.github/workflows/c-cpp.yml
=====================================
@@ -16,6 +16,6 @@ jobs:
     - name: Environment variables
       run: make env; make TOOLSET=gcc versions; make TOOLSET=clang versions
     - name: Unit tests with gcc
-      run: make -rj2 TOOLSET=gcc run_tests
+      run: make -rj2 TOOLSET=gcc example run_tests
     - name: Unit tests with clang
-      run: make -rj2 TOOLSET=clang run_tests
+      run: make -rj2 TOOLSET=clang example run_tests


=====================================
Makefile
=====================================
@@ -69,7 +69,7 @@ LINK.EXE = ${LD} -o $@ $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
 LINK.SO = ${LD} -o $@ -shared $(ldflags) $(filter-out Makefile,$^) $(ldlibs)
 LINK.A = ${AR} rscT $@ $(filter-out Makefile,$^)
 
-exes := benchmarks tests
+exes := benchmarks tests example
 
 all : ${exes}
 


=====================================
README.md
=====================================
@@ -3,16 +3,14 @@
 # atomic_queue
 C++14 multiple-producer-multiple-consumer *lockless* queues based on circular buffer with [`std::atomic`][3].
 
-The main design principle these queues follow is _simplicity_: the bare minimum of atomic operations, fixed size buffer, value semantics.
-
-The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details.
+The main design principle these queues follow is _minimalism_: the bare minimum of atomic operations, fixed size buffer, value semantics.
 
 These qualities are also limitations:
 
-* The maximum queue size must be set at compile time or construction time.
-* There are no OS-blocking push/pop functions.
+* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't processed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront back-of-the-envelope calculation on what would be the largest expected/acceptable queue size.
+* There are no OS-blocking push/pop functions. This queue is designed for ultra-low-latency scenarios and using an OS blocking primitive would be sacrificing push-to-pop latency. For lowest possible latency one cannot afford blocking in the OS kernel because the wake-up latency of a blocked thread is about 1-3 microseconds, whereas this queue's round-trip time can be as low as 150 nanoseconds.
 
-Nevertheless, ultra-low-latency applications need just that and nothing more. The simplicity pays off, see the [throughput and latency benchmarks][1].
+Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the [throughput and latency benchmarks][1].
 
 Available containers are:
 * `AtomicQueue` - a fixed size ring-buffer for atomic elements.
@@ -92,7 +90,7 @@ In a production multiple-producer-multiple-consumer scenario the ring-buffer siz
 
 Using a power-of-2 ring-buffer array size allows a couple of important optimizations:
 
-* The writer and reader indexes get mapped into the ring-buffer array index using modulo `% SIZE` binary operator and using a power-of-2 size turns that modulo operator into one plain `and` instruction and that is as fast as it gets.
+* The writer and reader indexes get mapped into the ring-buffer array index using remainder binary operator `% SIZE` and using a power-of-2 size turns that remainder operator into one plain `and` instruction and that is as fast as it gets.
 * The *element index within the cache line* gets swapped with the *cache line index* within the *ring-buffer array element index*, so that subsequent queue elements actually reside in different cache lines. This eliminates contention between producers and consumers on the ring-buffer cache lines. Instead of `N` producers together with `M` consumers competing on the same ring-buffer array cache line in the worst case, it is only one producer competing with one consumer. This optimisation scales better with the number of producers and consumers, and element size. With low number of producers and consumers (up to about 2 of each in these benchmarks) disabling this optimisation may yield better throughput (but higher variance across runs).
 
 The containers use `unsigned` type for size and internal indexes. On x86-64 platform `unsigned` is 32-bit wide, whereas `size_t` is 64-bit wide. 64-bit instructions utilise an extra byte instruction prefix resulting in slightly more pressure on the CPU instruction cache and the front-end. Hence, 32-bit `unsigned` indexes are used to maximise performance. That limits the queue size to 4,294,967,295 elements, which seems to be a reasonable hard limit for many applications.
@@ -141,7 +139,7 @@ The project uses `.editorconfig` and `.clang-format` to automate formatting. Pul
 
 ## Help needed
 * Submit pull requests with benchmarking code for other queues. The queues should be somewhat widely used or have exceptional performance, not my-first-mpmc-queue type projects.
-* Benchmarking results on different architectures or with much more cores. Run `scripts/run-benchmarks.sh` and email me the results file, or put it under `results/` and submit a pull request.
+* Benchmarking results on different architectures or with much more cores, in particular on AMD Ryzen CPUs. Run `scripts/run-benchmarks.sh` and email me the results file, or put it under `results/` and submit a pull request.
 
 ---
 


=====================================
include/atomic_queue/atomic_queue.h
=====================================
@@ -211,7 +211,7 @@ protected:
         else {
             for(;;) {
                 unsigned char expected = STORED;
-                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, LOADING, X, X))) {
+                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, LOADING, A, X))) {
                     T element{std::move(q_element)};
                     state.store(EMPTY, R);
                     return element;
@@ -236,7 +236,7 @@ protected:
         else {
             for(;;) {
                 unsigned char expected = EMPTY;
-                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, STORING, X, X))) {
+                if(ATOMIC_QUEUE_LIKELY(state.compare_exchange_strong(expected, STORING, A, X))) {
                     q_element = std::forward<U>(element);
                     state.store(STORED, R);
                     return;


=====================================
src/example.cc
=====================================
@@ -9,20 +9,19 @@
 #include <iostream>
 
 int main() {
-    int constexpr PRODUCERS = 1;
-    int constexpr CONSUMERS = 2;
-    unsigned constexpr CAPACITY = 1024;
-    unsigned constexpr N = 1000000;
+    int constexpr PRODUCERS = 1; // Number of producer threads.
+    int constexpr CONSUMERS = 2; // Number of consumer threads.
+    unsigned constexpr N = 1000000; // Pass this many elements from producers to consumers.
+    unsigned constexpr CAPACITY = 1024; // Queue capacity. Since there are more consumers than producers this doesn't have to be large.
 
-    using Element = uint32_t;
-    Element constexpr NIL = static_cast<Element>(-1);
+    using Element = uint32_t; // Queue element type.
+    Element constexpr NIL = static_cast<Element>(-1); // Atomic elements require a special value that cannot be pushed/popped.
+    using Queue = atomic_queue::AtomicQueueB<Element, std::allocator<Element>, NIL>; // Use heap-allocated buffer.
 
-    using Queue = atomic_queue::AtomicQueueB<Element, std::allocator<Element>, NIL>;
-
-    // Create a queue shared between producers and consumers.
+    // Create a queue object shared between producers and consumers.
     Queue q{CAPACITY};
 
-    // Start consumers.
+    // Start the consumers.
     uint64_t results[CONSUMERS];
     std::thread consumers[CONSUMERS];
     for(int i = 0; i < CONSUMERS; ++i)
@@ -33,7 +32,7 @@ int main() {
             r = sum;
         });
 
-    // Start producers.
+    // Start the producers.
     std::thread producers[PRODUCERS];
     for(int i = 0; i < PRODUCERS; ++i)
         producers[i] = std::thread([&q]() {
@@ -45,14 +44,14 @@ int main() {
     for(auto& t : producers)
         t.join();
 
-    // Stop consumers.
+    // Tell each consumer to complete and terminate.
     for(int i = CONSUMERS; i--;)
         q.push(0);
     // Wait till consumers complete and terminate.
     for(auto& t : consumers)
         t.join();
 
-    // Verify the results.
+    // Verify that each message was received exactly by one consumer only.
     uint64_t result = 0;
     for(auto& r : results) {
         result += r;



View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/commit/2b1e15cf0af7bc41f82b575210df53ca3180ade0

-- 
View it on GitLab: https://salsa.debian.org/med-team/libatomic-queue/-/commit/2b1e15cf0af7bc41f82b575210df53ca3180ade0
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201205/5caab678/attachment-0001.html>