diff mbox series

[2/4] atomics: update documentation for C11

Message ID 20200406191320.13371-3-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series async: fix hangs on weakly-ordered architectures | expand

Commit Message

Paolo Bonzini April 6, 2020, 7:13 p.m. UTC
Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
use them correctly because they do not interoperate with sequentially-consistent
RMW operations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/devel/atomics.rst | 290 ++++++++++++++++-------------------------
 1 file changed, 114 insertions(+), 176 deletions(-)

Comments

Eric Blake April 6, 2020, 8:03 p.m. UTC | #1
On 4/6/20 2:13 PM, Paolo Bonzini wrote:
> Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
> use them correctly because they do not interoperate with sequentially-consistent
> RMW operations.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   docs/devel/atomics.rst | 290 ++++++++++++++++-------------------------
>   1 file changed, 114 insertions(+), 176 deletions(-)
> 

> @@ -24,6 +29,14 @@ Macros defined by ``qemu/atomic.h`` fall in three camps:
>   
>   - sequentially consistent atomic access: everything else.
>   
> +In general, use of ``qemu/atomic.h`` should be wrapped with more easily
> +used data structures (e.g. the lock-free singly-liked list operations

linked

> +``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
> +primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
> +atomic operations and memory barriers should be limited to inter-thread
> +checking of flags and documented thoroughly.
> +
> +
>   
>   Compiler memory barrier
>   =======================
> @@ -85,36 +98,14 @@ Similar operations return the new value of ``*ptr``::
>       typeof(*ptr) atomic_or_fetch(ptr, val)
>       typeof(*ptr) atomic_xor_fetch(ptr, val)
>   
> -Sequentially consistent loads and stores can be done using::
> -
> -    atomic_fetch_add(ptr, 0) for loads
> -    atomic_xchg(ptr, val) for stores
> +These operations operate on any type that is as wide as an int or smaller.
>   
> -However, they are quite expensive on some platforms, notably POWER and
> -Arm.  Therefore, qemu/atomic.h provides two primitives with slightly
> -weaker constraints::
> +``qemu/atomic.h`` also provides sequentially consistent loads and stores can::
>   

s/ can//

>       typeof(*ptr) atomic_mb_read(ptr)
>       void         atomic_mb_set(ptr, val)
>   
> -The semantics of these primitives map to Java volatile variables,
> -and are strongly related to memory barriers as used in the Linux
> -kernel (see below).
> -
> -As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
> -be reordered with each other, and it is also not possible to reorder
> -"normal" accesses around them.
> -
> -However, and this is the important difference between
> -atomic_mb_read/atomic_mb_set and sequential consistency, it is important
> -for both threads to access the same volatile variable.  It is not the
> -case that everything visible to thread A when it writes volatile field f
> -becomes visible to thread B after it reads volatile field g. The store
> -and load have to "match" (i.e., be performed on the same volatile
> -field) to achieve the right semantics.
> -
> -
> -These operations operate on any type that is as wide as an int or smaller.
> +which however are deprecated.
>
Stefan Hajnoczi April 7, 2020, 9:06 a.m. UTC | #2
On Mon, Apr 06, 2020 at 03:13:18PM -0400, Paolo Bonzini wrote:
> -The semantics of these primitives map to Java volatile variables,
> -and are strongly related to memory barriers as used in the Linux
> -kernel (see below).
> -
> -As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
> -be reordered with each other, and it is also not possible to reorder
> -"normal" accesses around them.
> -
> -However, and this is the important difference between
> -atomic_mb_read/atomic_mb_set and sequential consistency, it is important
> -for both threads to access the same volatile variable.  It is not the
> -case that everything visible to thread A when it writes volatile field f
> -becomes visible to thread B after it reads volatile field g. The store
> -and load have to "match" (i.e., be performed on the same volatile
> -field) to achieve the right semantics.
> -
> -
> -These operations operate on any type that is as wide as an int or smaller.
> +which however are deprecated.

Please indicate why they are deprecated and advise which alternative is
preferred.  This will help readers understand the current best practices
and make a decision about whether to avoid the deprecated APIs.
Paolo Bonzini April 7, 2020, 9:13 a.m. UTC | #3
On 07/04/20 11:06, Stefan Hajnoczi wrote:
>> -
>> -However, and this is the important difference between
>> -atomic_mb_read/atomic_mb_set and sequential consistency, it is important
>> -for both threads to access the same volatile variable.  It is not the
>> -case that everything visible to thread A when it writes volatile field f
>> -becomes visible to thread B after it reads volatile field g. The store
>> -and load have to "match" (i.e., be performed on the same volatile
>> -field) to achieve the right semantics.
>> -
>> -
>> -These operations operate on any type that is as wide as an int or smaller.
>> +which however are deprecated.
> Please indicate why they are deprecated and advise which alternative is
> preferred.  This will help readers understand the current best practices
> and make a decision about whether to avoid the deprecated APIs.

True, I will put a link to the "Comparison with Linux" section and clarify.

Paolo
diff mbox series

Patch

diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
index 00f3f7d3ed..e92752a64d 100644
--- a/docs/devel/atomics.rst
+++ b/docs/devel/atomics.rst
@@ -11,10 +11,15 @@  that is consistent with the expectations of the programmer.
 The most basic tool is locking.  Mutexes, condition variables and
 semaphores are used in QEMU, and should be the default approach to
 synchronization.  Anything else is considerably harder, but it's
-also justified more often than one would like.  The two tools that
-are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
+also justified more often than one would like;
+the most performance-critical parts of QEMU in particular require
+a very low level approach to concurrency, involving memory barriers
+and atomic operations.  The semantics of concurrent memory accesses are governed
+by the C11 memory model.
 
-Macros defined by ``qemu/atomic.h`` fall in three camps:
+QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
+provide better portability and a less verbose syntax.  ``qemu/atomic.h``
+provides macros that fall in three camps:
 
 - compiler barriers: ``barrier()``;
 
@@ -24,6 +29,14 @@  Macros defined by ``qemu/atomic.h`` fall in three camps:
 
 - sequentially consistent atomic access: everything else.
 
+In general, use of ``qemu/atomic.h`` should be wrapped with more easily
+used data structures (e.g. the lock-free singly-liked list operations
+``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
+primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
+atomic operations and memory barriers should be limited to inter-thread
+checking of flags and documented thoroughly.
+
+
 
 Compiler memory barrier
 =======================
@@ -85,36 +98,14 @@  Similar operations return the new value of ``*ptr``::
     typeof(*ptr) atomic_or_fetch(ptr, val)
     typeof(*ptr) atomic_xor_fetch(ptr, val)
 
-Sequentially consistent loads and stores can be done using::
-
-    atomic_fetch_add(ptr, 0) for loads
-    atomic_xchg(ptr, val) for stores
+These operations operate on any type that is as wide as an int or smaller.
 
-However, they are quite expensive on some platforms, notably POWER and
-Arm.  Therefore, qemu/atomic.h provides two primitives with slightly
-weaker constraints::
+``qemu/atomic.h`` also provides sequentially consistent loads and stores can::
 
     typeof(*ptr) atomic_mb_read(ptr)
     void         atomic_mb_set(ptr, val)
 
-The semantics of these primitives map to Java volatile variables,
-and are strongly related to memory barriers as used in the Linux
-kernel (see below).
-
-As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
-be reordered with each other, and it is also not possible to reorder
-"normal" accesses around them.
-
-However, and this is the important difference between
-atomic_mb_read/atomic_mb_set and sequential consistency, it is important
-for both threads to access the same volatile variable.  It is not the
-case that everything visible to thread A when it writes volatile field f
-becomes visible to thread B after it reads volatile field g. The store
-and load have to "match" (i.e., be performed on the same volatile
-field) to achieve the right semantics.
-
-
-These operations operate on any type that is as wide as an int or smaller.
+which however are deprecated.
 
 
 Weak atomic access and manual memory barriers
@@ -208,135 +199,62 @@  They come in six kinds:
   dependency and a full read barrier or better is required.
 
 
-This is the set of barriers that is required *between* two ``atomic_read()``
-and ``atomic_set()`` operations to achieve sequential consistency:
-
-   +----------------+-------------------------------------------------------+
-   |                |                 2nd operation                         |
-   |                +------------------+-----------------+------------------+
-   | 1st operation  | (after last)     | atomic_read     | atomic_set       |
-   +----------------+------------------+-----------------+------------------+
-   | (before first) | ..               | none            | smp_mb_release() |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_read    | smp_mb_acquire() | smp_rmb() [1]_  | [2]_             |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_set     | none             | smp_mb() [3]_   | smp_wmb()        |
-   +----------------+------------------+-----------------+------------------+
-
-   .. [1] Or smp_read_barrier_depends().
-
-   .. [2] This requires a load-store barrier.  This is achieved by
-          either smp_mb_acquire() or smp_mb_release().
-
-   .. [3] This requires a store-load barrier.  On most machines, the only
-          way to achieve this is a full barrier.
-
-
-You can see that the two possible definitions of ``atomic_mb_read()``
-and ``atomic_mb_set()`` are the following:
-
-  1) | atomic_mb_read(p)   = atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
-
-  2) | atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
-
-Usually the former is used, because ``smp_mb()`` is expensive and a program
-normally has more reads than writes.  Therefore it makes more sense to
-make ``atomic_mb_set()`` the more expensive operation.
-
-There are two common cases in which atomic_mb_read and atomic_mb_set
-generate too many memory barriers, and thus it can be useful to manually
-place barriers, or use atomic_load_acquire/atomic_store_release instead:
-
-- when a data structure has one thread that is always a writer
-  and one thread that is always a reader, manual placement of
-  memory barriers makes the write side faster.  Furthermore,
-  correctness is easy to check for in this case using the "pairing"
-  trick that is explained below:
-
-    +----------------------------------------------------------------------+
-    | thread 1                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   (other writes)                  |                                  |
-    |   atomic_mb_set(&a, x)            |   atomic_store_release(&a, x)    |
-    |   atomic_mb_set(&b, y)            |   atomic_store_release(&b, y)    |
-    +-----------------------------------+----------------------------------+
-
-    +----------------------------------------------------------------------+
-    | thread 2                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   y = atomic_mb_read(&b)          |   y = atomic_load_acquire(&b)    |
-    |   x = atomic_mb_read(&a)          |   x = atomic_load_acquire(&a)    |
-    |   (other reads)                   |                                  |
-    +-----------------------------------+----------------------------------+
-
-  Note that the barrier between the stores in thread 1, and between
-  the loads in thread 2, has been optimized here to a write or a
-  read memory barrier respectively.  On some architectures, notably
-  ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
-  smp_mb, but smp_rmb and/or smp_wmb are more efficient.
-
-- sometimes, a thread is accessing many variables that are otherwise
-  unrelated to each other (for example because, apart from the current
-  thread, exactly one other thread will read or write each of these
-  variables).  In this case, it is possible to "hoist" the implicit
-  barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
-  a loop.  For example, the above definition ``atomic_mb_read()`` gives
-  the following transformation:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   n = 0;                          |   n = 0;                         |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     n += atomic_mb_read(&a[i]);   |     n += atomic_read(&a[i]);     |
-    |                                   |   smp_mb_acquire();              |
-    +-----------------------------------+----------------------------------+
-
-  Similarly, atomic_mb_set() can be transformed as follows:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |   smp_mb_release();              |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     atomic_mb_set(&a[i], false);  |     atomic_set(&a[i], false);    |
-    |                                   |   smp_mb();                      |
-    +-----------------------------------+----------------------------------+
-
-
-  The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
+Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are
+mostly used when a data structure has one thread that is always a writer
+and one thread that is always a reader:
+
+    +----------------------------------+----------------------------------+
+    | thread 1                         | thread 2                         |
+    +==================================+==================================+
+    | ::                               | ::                               |
+    |                                  |                                  |
+    |   atomic_store_release(&a, x)    |   y = atomic_load_acquire(&b)    |
+    |   atomic_store_release(&b, y)    |   x = atomic_load_acquire(&a)    |
+    +----------------------------------+----------------------------------+
+
+In this case, correctness is easy to check for in this case using the
+"pairing" trick that is explained below.
+
+Sometimes, a thread is accessing many variables that are otherwise
+unrelated to each other (for example because, apart from the current
+thread, exactly one other thread will read or write each of these
+variables).  In this case, it is possible to "hoist" the barriers
+outside a loop.  For example:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |   n = 0;                                 |   n = 0;                         |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     n += atomic_load_acquire(&a[i]);     |     n += atomic_read(&a[i]);     |
+    |                                          |   smp_mb_acquire();              |
+    +------------------------------------------+----------------------------------+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |   smp_mb_release();              |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     atomic_store_release(&a[i], false);  |     atomic_set(&a[i], false);    |
+    +------------------------------------------+----------------------------------+
 
 The two tricks can be combined.  In this case, splitting a loop in
-two lets you hoist the barriers out of the loops _and_ eliminate the
-expensive ``smp_mb()``:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |     smp_mb_release();            |
-    |   for (i = 0; i < 10; i++) {      |     for (i = 0; i < 10; i++)     |
-    |     atomic_mb_set(&a[i], false);  |       atomic_set(&a[i], false);  |
-    |     atomic_mb_set(&b[i], false);  |     smb_wmb();                   |
-    |   }                               |     for (i = 0; i < 10; i++)     |
-    |                                   |       atomic_set(&a[i], false);  |
-    |                                   |     smp_mb();                    |
-    +-----------------------------------+----------------------------------+
+two lets you hoist the barriers out of the loops, and replace a
+``smp_mb_release()`` with a (possibly cheaper, and clearer as well)
+``smp_wmb()``:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |     smp_mb_release();            |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
+    |     atomic_store_release(&a[i], false);  |       atomic_set(&a[i], false);  |
+    |     atomic_store_release(&b[i], false);  |     smb_wmb();                   |
+    |   }                                      |     for (i = 0; i < 10; i++)     |
+    |                                          |       atomic_set(&b[i], false);  |
+    +------------------------------------------+----------------------------------+
 
 
 Memory barrier pairing
@@ -347,11 +265,13 @@  always, be paired with another barrier.  In the case of QEMU, however,
 note that the other barrier may actually be in a driver that runs in
 the guest!
 
-For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
-both count as read barriers.  A read barrier shall pair with a write
-barrier or a full barrier; a write barrier shall pair with a read
-barrier or a full barrier.  A full barrier can pair with anything.
-For example:
+For the purposes of pairing, ``smp_read_barrier_depends()``, ``smp_rmb()``
+and ``smp_mb_acquire()`` all count as read barriers; ``smp_wmb()` and
+``smp_mb_release()`` both count as write barriers.
+
+A read barrier shall pair with a write barrier or a full barrier;
+a write barrier shall pair with a read barrier or a full barrier.
+A full barrier can pair with anything.  For example:
 
       +--------------------+------------------------------+
       | thread 1           | thread 2                     |
@@ -387,9 +307,6 @@  access and for data dependency barriers:
       |                    |  z = b[y];                   |
       +--------------------+------------------------------+
 
-``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
-and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
-
 
 Comparison with Linux kernel memory barriers
 ============================================
@@ -424,24 +341,45 @@  and memory barriers, and the equivalents in QEMU:
          ``atomic_cmpxchg``    returns the old value of the variable
          ===================== =========================================
 
-  In QEMU, the second kind does not exist.  Currently Linux has
-  atomic_fetch_or only.  QEMU provides and, or, inc, dec, add, sub.
+  In QEMU, the second kind does not exist.
 
 - different atomic read-modify-write operations in Linux imply
   a different set of memory barriers; in QEMU, all of them enforce
-  sequential consistency, which means they imply full memory barriers
-  before and after the operation.
-
-- Linux does not have an equivalent of ``atomic_mb_set()``.  In particular,
-  note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
-  ``atomic_mb_read()`` compiles to the same instructions as Linux's
-  ``smp_load_acquire()``, but this should be treated as an implementation
-  detail.
+  sequential consistency.
+
+- in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in
+  the total ordering enforced by sequentially-consistent operations.
+  This is because QEMU uses the C11 memory model.  The following example
+  is correct in Linux but not in QEMU:
+
+      +----------------------------------+--------------------------------+
+      | Linux (correct)                  | QEMU (incorrect)               |
+      +==================================+================================+
+      | ::                               | ::                             |
+      |                                  |                                |
+      |   a = atomic_fetch_add(&x, 2);   |   a = atomic_fetch_add(&x, 2); |
+      |   b = READ_ONCE(&y);             |   b = atomic_read(&y);         |
+      +----------------------------------+--------------------------------+
+
+  because the read of ``y`` can be moved (by either the processor or the
+  compiler) before the write of ``x``.
+
+  Fixing this requires an ``smp_mb()`` memory barrier between the write
+  of ``x`` and the read of ``y``.  In the common case where only one thread
+  writes ``x``, it is also possible to write it like this:
+
+      +--------------------------------+
+      | QEMU (correct)                 |
+      +================================+
+      | ::                             |
+      |                                |
+      |   a = atomic_read(&x);         |
+      |   atomic_set(&x, a + 2);       |
+      |   smp_mb();                    |
+      |   b = atomic_read(&y);         |
+      +--------------------------------+
 
 Sources
 =======
 
 * Documentation/memory-barriers.txt from the Linux kernel
-
-* "The JSR-133 Cookbook for Compiler Writers", available at
-  http://g.oswego.edu/dl/jmm/cookbook.html