diff mbox series

include/qemu/bswap.h: using atomic memory load/store for word access

Message ID 1621220021-17199-1-git-send-email-maobibo@loongson.cn (mailing list archive)
State New, archived
Headers show
Series include/qemu/bswap.h: using atomic memory load/store for word access | expand

Commit Message

bibo mao May 17, 2021, 2:53 a.m. UTC
virtio ring buffer has lockless ring buffer scheme. When guest vcpu
reads the memory, qemu io thread may is writing the same address.
It requiires atomic operation in qemu side, __builtin_memcpy may
read byte-by-byte.

This patch uses fix this, however it may bring negative performance
effect on system which does not support hw aligned memory access.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
---
 include/qemu/bswap.h | 34 ++++++++++++----------------------
 1 file changed, 12 insertions(+), 22 deletions(-)

Comments

Peter Maydell May 17, 2021, 9:23 a.m. UTC | #1
On Mon, 17 May 2021 at 03:54, Bibo Mao <maobibo@loongson.cn> wrote:
>
> virtio ring buffer has lockless ring buffer scheme. When guest vcpu
> reads the memory, qemu io thread may is writing the same address.
> It requiires atomic operation in qemu side, __builtin_memcpy may
> read byte-by-byte.
>
> This patch uses fix this, however it may bring negative performance
> effect on system which does not support hw aligned memory access.
>
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
>  /*
> - * Any compiler worth its salt will turn these memcpy into native unaligned
> - * operations.  Thus we don't need to play games with packed attributes, or
> - * inline byte-by-byte stores.
> - * Some compilation environments (eg some fortify-source implementations)
> - * may intercept memcpy() in a way that defeats the compiler optimization,
> - * though, so we use __builtin_memcpy() to give ourselves the best chance
> - * of good performance.
> + * Some driver using lockless ring buffer like virtio ring requires that
> + * it should be atomic, since guest vcpu thread is reading the memory.
> + * It may bring out negative performance effect for architectures which
> + * do not support hw memory aligned access like mips, if ptr is not word
> + * alignment.
>   */
>
>  static inline int lduw_he_p(const void *ptr)
>  {
> -    uint16_t r;
> -    __builtin_memcpy(&r, ptr, sizeof(r));
> -    return r;
> +    return *(uint16_t *)ptr;

We rely on these functions handling unaligned pointers, as
the comment you delete notes. This change will not merely
"bring negative performance effect", it will cause crashes
on hosts like SPARC.

(Your change makes these functions have undefined behaviour in C,
which is why it will cause crashes on some systems.)

We do need to add support in various bits of the memory subsystem
for "this really does have to be atomic" (notably so that we can
do atomic read-modify-write actions when doing page table walks),
but that has to be done by adding extra APIs as necessary. And in
some places we already *have* those functions.

Can you describe the problem you're trying to solve in greater
detail? Notably:
 * what are the stacktraces for the accesses that end up in these
   functions ? (ie what is the top level function call or set of
   function calls that need to have must-be-atomic variants?)
 * can we guarantee that the pointers really are aligned there?

thanks
-- PMM
bibo mao May 17, 2021, 1:40 p.m. UTC | #2
PMM,

Thanks for kindly response.

I reply inline.

在 2021年05月17日 17:23, Peter Maydell 写道:
> On Mon, 17 May 2021 at 03:54, Bibo Mao <maobibo@loongson.cn> wrote:
>>
>> virtio ring buffer has lockless ring buffer scheme. When guest vcpu
>> reads the memory, qemu io thread may is writing the same address.
>> It requiires atomic operation in qemu side, __builtin_memcpy may
>> read byte-by-byte.
>>
>> This patch uses fix this, however it may bring negative performance
>> effect on system which does not support hw aligned memory access.
>>
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>>  /*
>> - * Any compiler worth its salt will turn these memcpy into native unaligned
>> - * operations.  Thus we don't need to play games with packed attributes, or
>> - * inline byte-by-byte stores.
>> - * Some compilation environments (eg some fortify-source implementations)
>> - * may intercept memcpy() in a way that defeats the compiler optimization,
>> - * though, so we use __builtin_memcpy() to give ourselves the best chance
>> - * of good performance.
>> + * Some driver using lockless ring buffer like virtio ring requires that
>> + * it should be atomic, since guest vcpu thread is reading the memory.
>> + * It may bring out negative performance effect for architectures which
>> + * do not support hw memory aligned access like mips, if ptr is not word
>> + * alignment.
>>   */
>>
>>  static inline int lduw_he_p(const void *ptr)
>>  {
>> -    uint16_t r;
>> -    __builtin_memcpy(&r, ptr, sizeof(r));
>> -    return r;
>> +    return *(uint16_t *)ptr;
> 
> We rely on these functions handling unaligned pointers, as
> the comment you delete notes. This change will not merely
> "bring negative performance effect", it will cause crashes
> on hosts like SPARC.
> 
> (Your change makes these functions have undefined behaviour in C,
> which is why it will cause crashes on some systems.)
oops, maybe anothe extra APIs with atomic function is better.
> 
> We do need to add support in various bits of the memory subsystem
> for "this really does have to be atomic" (notably so that we can
> do atomic read-modify-write actions when doing page table walks),
> but that has to be done by adding extra APIs as necessary. And in
> some places we already *have* those functions.
> 
> Can you describe the problem you're trying to solve in greater
> detail? Notably:
>  * what are the stacktraces for the accesses that end up in these
>    functions ? (ie what is the top level function call or set of
>    function calls that need to have must-be-atomic variants?)
>  * can we guarantee that the pointers really are aligned there?
When running virtio network stress test cases on my MIPS kvm guest,
network will lost after hours.

The reason is that when threads in guestOS is reading vring descriptor
in linux virtio driver with file drivers/virtio/virtio_ring.c:
static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
{
        struct vring_virtqueue *vq = to_vvq(_vq);

        return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev,
                        vq->split.vring.used->idx);
}


And the same time qemu iothread in host is updating idx, it is in qemu
file hw/virtio/virtio.c:
static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
{
    VRingMemoryRegionCaches *caches = vring_get_region_caches(vq);
    hwaddr pa = offsetof(VRingUsed, idx);

    if (caches) {
        virtio_stw_phys_cached(vq->vdev, &caches->used, pa, val);
        address_space_cache_invalidate(&caches->used, pa, sizeof(val));
    }

    vq->used_idx = val;
}
And here is function call trace in qemu side:
	stw_he_p()
	stw_le_p()
	address_space_stw_le_cached()
	stw_le_phys_cached()
	virtio_stw_phys_cached ()
static inline void stw_he_p(void *ptr, uint16_t v)
{
    __builtin_memcpy(ptr, &v, sizeof(v));
}

in linux kernel side, it is reading with half-word directly, however in
qemu side it may writing the same buffer byte-by-byte. it will bring out
problem in guest virtio driver.


By spec in https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-430008
the pointer is aligned with half-word. Here is piece of code in linux/include/uapi/linux/virtio_ring.h

/* The standard layout for the ring is a continuous chunk of memory which looks
 * like this.  We assume num is a power of 2.
 *
 * struct vring
 * {
 *      // The actual descriptors (16 bytes each)
 *      struct vring_desc desc[num];
 *
 *      // A ring of available descriptor heads with free-running index.
 *      __virtio16 avail_flags;
 *      __virtio16 avail_idx;
 *      __virtio16 available[num];
 *      __virtio16 used_event_idx;
 *
 *      // Padding to the next align boundary.
 *      char pad[];
 *
 *      // A ring of used descriptor heads with free-running index.
 *      __virtio16 used_flags;
 *      __virtio16 used_idx;
 *      struct vring_used_elem used[num];
 *      __virtio16 avail_event_idx;
 * };
 */

> 
> thanks
> -- PMM
>
diff mbox series

Patch

diff --git a/include/qemu/bswap.h b/include/qemu/bswap.h
index 2d3bb8b..b914d33 100644
--- a/include/qemu/bswap.h
+++ b/include/qemu/bswap.h
@@ -327,56 +327,46 @@  static inline void stb_p(void *ptr, uint8_t v)
 }
 
 /*
- * Any compiler worth its salt will turn these memcpy into native unaligned
- * operations.  Thus we don't need to play games with packed attributes, or
- * inline byte-by-byte stores.
- * Some compilation environments (eg some fortify-source implementations)
- * may intercept memcpy() in a way that defeats the compiler optimization,
- * though, so we use __builtin_memcpy() to give ourselves the best chance
- * of good performance.
+ * Some driver using lockless ring buffer like virtio ring requires that
+ * it should be atomic, since guest vcpu thread is reading the memory.
+ * It may bring out negative performance effect for architectures which
+ * do not support hw memory aligned access like mips, if ptr is not word
+ * alignment.
  */
 
 static inline int lduw_he_p(const void *ptr)
 {
-    uint16_t r;
-    __builtin_memcpy(&r, ptr, sizeof(r));
-    return r;
+    return *(uint16_t *)ptr;
 }
 
 static inline int ldsw_he_p(const void *ptr)
 {
-    int16_t r;
-    __builtin_memcpy(&r, ptr, sizeof(r));
-    return r;
+    return *(int16_t *)ptr;
 }
 
 static inline void stw_he_p(void *ptr, uint16_t v)
 {
-    __builtin_memcpy(ptr, &v, sizeof(v));
+    *(uint16_t *)ptr = v;
 }
 
 static inline int ldl_he_p(const void *ptr)
 {
-    int32_t r;
-    __builtin_memcpy(&r, ptr, sizeof(r));
-    return r;
+    return *(int32_t *)ptr;
 }
 
 static inline void stl_he_p(void *ptr, uint32_t v)
 {
-    __builtin_memcpy(ptr, &v, sizeof(v));
+    *(uint32_t *)ptr = v;
 }
 
 static inline uint64_t ldq_he_p(const void *ptr)
 {
-    uint64_t r;
-    __builtin_memcpy(&r, ptr, sizeof(r));
-    return r;
+    return *(uint64_t *)ptr;
 }
 
 static inline void stq_he_p(void *ptr, uint64_t v)
 {
-    __builtin_memcpy(ptr, &v, sizeof(v));
+    *(uint64_t *)ptr = v;
 }
 
 static inline int lduw_le_p(const void *ptr)