Message ID | 20220201171601.53316-4-cristian.marussi@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add SCMI Virtio & Clock atomic support | expand |
Looks correct, thanks. Some minor comments below: On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote: > Exported API virtqueue_poll() can be used to support polling mode operation > on top of virtio layer if needed; currently the parameter last_used_idx is > the opaque value that needs to be passed to the virtqueue_poll() function > to check if there are new pending used buffers in the queue: such opaque > value would have been previously obtained by a call to the API function > virtqueue_enable_cb_prepare(). > > Since such opaque value is indeed containing simply a snapshot in time of > the internal to add: 16 bit > last_used_index (roughly), it is possible that, to add here: if another thread calls virtqueue_add_*() at the same time (which existing drivers don't do, but does not seem to be documented as prohibited anywhere), and > if exactly > 2**16 buffers are marked as used between two successive calls to > virtqueue_poll(), the caller is fooled into thinking that nothing is > pending (ABA problem). > Keep a full fledged internal wraps counter s/full fledged/a 16 bit/ since I don't see why is a 16 bit counter full but not e.g. a 32 bit one > per virtqueue and embed it into > the upper 16bits of the returned opaque value, so that the above scenario > can be detected transparently by virtqueue_poll(): this way each single > possible last_used_idx value is really belonging to a different wrap. Just to add here: the ABA problem can in theory still happen but now that's after 2^32 requests, which seems sufficient in practice. > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Igor Skalkin <igor.skalkin@opensynergy.com> > Cc: Peter Hilber <peter.hilber@opensynergy.com> > Cc: virtualization@lists.linux-foundation.org > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> > --- > Still no perf data on this, I was wondering what exactly to measure in > term of perf metrics to evaluate the impact of the rolling vq->wraps > counter. > --- > drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++--- > 1 file changed, 47 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 00f64f2f8b72..613ec0503509 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -12,6 +12,8 @@ > #include <linux/hrtimer.h> > #include <linux/dma-mapping.h> > #include <linux/spinlock.h> > +#include <linux/bits.h> > +#include <linux/bitfield.h> > #include <xen/xen.h> > > static bool force_used_validation = false; > @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444); > #define LAST_ADD_TIME_INVALID(vq) > #endif > > +#define VRING_IDX_MASK GENMASK(15, 0) > +#define VRING_GET_IDX(opaque) \ > + ((u16)FIELD_GET(VRING_IDX_MASK, (opaque))) > + > +#define VRING_WRAPS_MASK GENMASK(31, 16) > +#define VRING_GET_WRAPS(opaque) \ > + ((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque))) > + > +#define VRING_BUILD_OPAQUE(idx, wraps) \ > + (FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK)) > + Maybe prefix with VRING_POLL_ since that is the only user. > struct vring_desc_state_split { > void *data; /* Data for callback. */ > struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > @@ -117,6 +130,8 @@ struct vring_virtqueue { > /* Last used index we've seen. */ > u16 last_used_idx; > > + u16 wraps; > + > /* Hint for event idx: already triggered no need to disable. */ > bool event_triggered; > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, > ret = vq->split.desc_state[i].data; > detach_buf_split(vq, i, ctx); > vq->last_used_idx++; > + if (unlikely(!vq->last_used_idx)) > + vq->wraps++; > /* If we expect an interrupt for the next entry, tell host > * by writing event index and flush out the write before > * the read in the next get_buf call. */ So most drivers don't call virtqueue_poll. Concerned about the overhead here: another option is with a flag that will have to be set whenever a driver wants to use virtqueue_poll. Could you pls do a quick perf test e.g. using tools/virtio/ to see what's faster? > @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, > if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) { > vq->last_used_idx -= vq->packed.vring.num; > vq->packed.used_wrap_counter ^= 1; > + vq->wraps++; > } > > /* > @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed( > vq->weak_barriers = weak_barriers; > vq->broken = false; > vq->last_used_idx = 0; > + vq->wraps = 0; > vq->event_triggered = false; > vq->num_added = 0; > vq->packed_ring = true; > @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb); > */ > unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) > { > + unsigned int last_used_idx; > struct vring_virtqueue *vq = to_vvq(_vq); > > if (vq->event_triggered) > vq->event_triggered = false; > > - return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) : > - virtqueue_enable_cb_prepare_split(_vq); > + last_used_idx = vq->packed_ring ? > + virtqueue_enable_cb_prepare_packed(_vq) : > + virtqueue_enable_cb_prepare_split(_vq); > + > + return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps); > } > EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > * @_vq: the struct virtqueue we're talking about. > * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare). > * > + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(), > + * is an opaque value representing the queue state and it is built as follows: > + * > + * --------------------------------------------------------- > + * | vq->wraps | vq->last_used_idx | > + * 31------------------------------------------------------0 > + * > + * The MSB 16bits embedding the wraps counter for the underlying virtqueue > + * is stripped out here before reaching into the lower layer helpers. > + * > + * This structure of the opaque value mitigates the scenario in which, when > + * exactly 2**16 messages are marked as used between two successive calls to > + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived > + * since the pure last_used_idx is exactly the same. > + * Do you want to move this comment to where the macros implementing it are? > * Returns "true" if there are pending used buffers in the queue. > * > * This does not need to be serialized. > @@ -2118,9 +2156,13 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx) > if (unlikely(vq->broken)) > return false; > > + if (unlikely(vq->wraps != VRING_GET_WRAPS(last_used_idx))) > + return true; > + > virtio_mb(vq->weak_barriers); > - return vq->packed_ring ? virtqueue_poll_packed(_vq, last_used_idx) : > - virtqueue_poll_split(_vq, last_used_idx); > + return vq->packed_ring ? > + virtqueue_poll_packed(_vq, VRING_GET_IDX(last_used_idx)) : > + virtqueue_poll_split(_vq, VRING_GET_IDX(last_used_idx)); > } > EXPORT_SYMBOL_GPL(virtqueue_poll); > > @@ -2245,6 +2287,7 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index, > vq->weak_barriers = weak_barriers; > vq->broken = false; > vq->last_used_idx = 0; > + vq->wraps = 0; > vq->event_triggered = false; > vq->num_added = 0; > vq->use_dma_api = vring_use_dma_api(vdev); > -- > 2.17.1
On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote: > Looks correct, thanks. Some minor comments below: > Hi Michael, thanks for the feedback. > On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote: > > Exported API virtqueue_poll() can be used to support polling mode operation > > on top of virtio layer if needed; currently the parameter last_used_idx is > > the opaque value that needs to be passed to the virtqueue_poll() function > > to check if there are new pending used buffers in the queue: such opaque > > value would have been previously obtained by a call to the API function > > virtqueue_enable_cb_prepare(). > > > > Since such opaque value is indeed containing simply a snapshot in time of > > the internal > > to add: 16 bit > > > last_used_index (roughly), it is possible that, > > to add here: > > if another thread calls virtqueue_add_*() > at the same time (which existing drivers don't do, > but does not seem to be documented as prohibited anywhere), and > > > if exactly > > 2**16 buffers are marked as used between two successive calls to > > virtqueue_poll(), the caller is fooled into thinking that nothing is > > pending (ABA problem). > > Keep a full fledged internal wraps counter > > s/full fledged/a 16 bit/ > > since I don't see why is a 16 bit counter full but not e.g. a 32 bit one > .. :D I wanted to stress the fact that this being a 16bits counter has a higher rollover than a 1-bit one wrap_counter already used...but indeed all are just counters at the end, it's justthe wrapround that changes... I'll fix. > > per virtqueue and embed it into > > the upper 16bits of the returned opaque value, so that the above scenario > > can be detected transparently by virtqueue_poll(): this way each single > > possible last_used_idx value is really belonging to a different wrap. > > Just to add here: the ABA problem can in theory still happen but > now that's after 2^32 requests, which seems sufficient in practice. > Sure, I'll fix the commit message as above advised. > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > Cc: Igor Skalkin <igor.skalkin@opensynergy.com> > > Cc: Peter Hilber <peter.hilber@opensynergy.com> > > Cc: virtualization@lists.linux-foundation.org > > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> > > --- > > Still no perf data on this, I was wondering what exactly to measure in > > term of perf metrics to evaluate the impact of the rolling vq->wraps > > counter. > > --- > > drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++--- > > 1 file changed, 47 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 00f64f2f8b72..613ec0503509 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -12,6 +12,8 @@ > > #include <linux/hrtimer.h> > > #include <linux/dma-mapping.h> > > #include <linux/spinlock.h> > > +#include <linux/bits.h> > > +#include <linux/bitfield.h> > > #include <xen/xen.h> > > > > static bool force_used_validation = false; > > @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444); > > #define LAST_ADD_TIME_INVALID(vq) > > #endif > > > > +#define VRING_IDX_MASK GENMASK(15, 0) > > +#define VRING_GET_IDX(opaque) \ > > + ((u16)FIELD_GET(VRING_IDX_MASK, (opaque))) > > + > > +#define VRING_WRAPS_MASK GENMASK(31, 16) > > +#define VRING_GET_WRAPS(opaque) \ > > + ((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque))) > > + > > +#define VRING_BUILD_OPAQUE(idx, wraps) \ > > + (FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK)) > > + > > Maybe prefix with VRING_POLL_ since that is the only user. > I'll do. > > > struct vring_desc_state_split { > > void *data; /* Data for callback. */ > > struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > > @@ -117,6 +130,8 @@ struct vring_virtqueue { > > /* Last used index we've seen. */ > > u16 last_used_idx; > > > > + u16 wraps; > > + > > /* Hint for event idx: already triggered no need to disable. */ > > bool event_triggered; > > > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, > > ret = vq->split.desc_state[i].data; > > detach_buf_split(vq, i, ctx); > > vq->last_used_idx++; > > + if (unlikely(!vq->last_used_idx)) > > + vq->wraps++; > > /* If we expect an interrupt for the next entry, tell host > > * by writing event index and flush out the write before > > * the read in the next get_buf call. */ > > So most drivers don't call virtqueue_poll. > Concerned about the overhead here: another option is > with a flag that will have to be set whenever a driver > wants to use virtqueue_poll. Do you mean a compile time flag/Kconfig to just remove the possible overhead instructions as a whole when not needed by the driver ? Or do you mean at runtime since checking the flag evry time should be less costly than checking the wrpas each time AND counting when it happens ? > Could you pls do a quick perf test e.g. using tools/virtio/ > to see what's faster? Yes I'll do, thanks for the hint, I have some compilation issues in tools/virtio due to my additions (missing mirrored hehaders) or to some recently added stuff (missing drv_to_virtio & friends for suppressed_used_validation thing)...anyway I fixed those now and I'll post related tools/virtio patches with next iteration. Anyway, do you mean perf data about vringh_test and virtio_test/vhost right ? (ringtest/ excluded 'cause does not use any API is just prototyping) > > > > > @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, > > if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) { > > vq->last_used_idx -= vq->packed.vring.num; > > vq->packed.used_wrap_counter ^= 1; > > + vq->wraps++; > > } > > > > /* > > @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed( > > vq->weak_barriers = weak_barriers; > > vq->broken = false; > > vq->last_used_idx = 0; > > + vq->wraps = 0; > > vq->event_triggered = false; > > vq->num_added = 0; > > vq->packed_ring = true; > > @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb); > > */ > > unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) > > { > > + unsigned int last_used_idx; > > struct vring_virtqueue *vq = to_vvq(_vq); > > > > if (vq->event_triggered) > > vq->event_triggered = false; > > > > - return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) : > > - virtqueue_enable_cb_prepare_split(_vq); > > + last_used_idx = vq->packed_ring ? > > + virtqueue_enable_cb_prepare_packed(_vq) : > > + virtqueue_enable_cb_prepare_split(_vq); > > + > > + return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps); > > } > > EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > > > @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > * @_vq: the struct virtqueue we're talking about. > > * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare). > > * > > + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(), > > + * is an opaque value representing the queue state and it is built as follows: > > + * > > + * --------------------------------------------------------- > > + * | vq->wraps | vq->last_used_idx | > > + * 31------------------------------------------------------0 > > + * > > + * The MSB 16bits embedding the wraps counter for the underlying virtqueue > > + * is stripped out here before reaching into the lower layer helpers. > > + * > > + * This structure of the opaque value mitigates the scenario in which, when > > + * exactly 2**16 messages are marked as used between two successive calls to > > + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived > > + * since the pure last_used_idx is exactly the same. > > + * > > Do you want to move this comment to where the macros implementing it > are? > Sure, I'll do. Thanks, Cristian
On Thu, Feb 03, 2022 at 10:51:19AM +0000, Cristian Marussi wrote: > On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote: > > Looks correct, thanks. Some minor comments below: > > > > Hi Michael, > > thanks for the feedback. > > > On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote: > > > Exported API virtqueue_poll() can be used to support polling mode operation > > > on top of virtio layer if needed; currently the parameter last_used_idx is > > > the opaque value that needs to be passed to the virtqueue_poll() function > > > to check if there are new pending used buffers in the queue: such opaque > > > value would have been previously obtained by a call to the API function > > > virtqueue_enable_cb_prepare(). > > > > > > Since such opaque value is indeed containing simply a snapshot in time of > > > the internal > > > > to add: 16 bit > > > > > last_used_index (roughly), it is possible that, > > > > to add here: > > > > if another thread calls virtqueue_add_*() > > at the same time (which existing drivers don't do, > > but does not seem to be documented as prohibited anywhere), and > > > > > if exactly > > > 2**16 buffers are marked as used between two successive calls to > > > virtqueue_poll(), the caller is fooled into thinking that nothing is > > > pending (ABA problem). > > > Keep a full fledged internal wraps counter > > > > s/full fledged/a 16 bit/ > > > > since I don't see why is a 16 bit counter full but not e.g. a 32 bit one > > > .. :D I wanted to stress the fact that this being a 16bits counter has a > higher rollover than a 1-bit one wrap_counter already used...but indeed > all are just counters at the end, it's justthe wrapround that changes... > > I'll fix. > > > > per virtqueue and embed it into > > > the upper 16bits of the returned opaque value, so that the above scenario > > > can be detected transparently by virtqueue_poll(): this way each single > > > possible last_used_idx value is really belonging to a different wrap. > > > > Just to add here: the ABA problem can in theory still happen but > > now that's after 2^32 requests, which seems sufficient in practice. > > > > Sure, I'll fix the commit message as above advised. > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > > Cc: Igor Skalkin <igor.skalkin@opensynergy.com> > > > Cc: Peter Hilber <peter.hilber@opensynergy.com> > > > Cc: virtualization@lists.linux-foundation.org > > > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> > > > --- > > > Still no perf data on this, I was wondering what exactly to measure in > > > term of perf metrics to evaluate the impact of the rolling vq->wraps > > > counter. > > > --- > > > drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++--- > > > 1 file changed, 47 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > index 00f64f2f8b72..613ec0503509 100644 > > > --- a/drivers/virtio/virtio_ring.c > > > +++ b/drivers/virtio/virtio_ring.c > > > @@ -12,6 +12,8 @@ > > > #include <linux/hrtimer.h> > > > #include <linux/dma-mapping.h> > > > #include <linux/spinlock.h> > > > +#include <linux/bits.h> > > > +#include <linux/bitfield.h> > > > #include <xen/xen.h> > > > > > > static bool force_used_validation = false; > > > @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444); > > > #define LAST_ADD_TIME_INVALID(vq) > > > #endif > > > > > > +#define VRING_IDX_MASK GENMASK(15, 0) > > > +#define VRING_GET_IDX(opaque) \ > > > + ((u16)FIELD_GET(VRING_IDX_MASK, (opaque))) > > > + > > > +#define VRING_WRAPS_MASK GENMASK(31, 16) > > > +#define VRING_GET_WRAPS(opaque) \ > > > + ((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque))) > > > + > > > +#define VRING_BUILD_OPAQUE(idx, wraps) \ > > > + (FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK)) > > > + > > > > Maybe prefix with VRING_POLL_ since that is the only user. > > > > I'll do. > > > > > > struct vring_desc_state_split { > > > void *data; /* Data for callback. */ > > > struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > > > @@ -117,6 +130,8 @@ struct vring_virtqueue { > > > /* Last used index we've seen. */ > > > u16 last_used_idx; > > > > > > + u16 wraps; > > > + > > > /* Hint for event idx: already triggered no need to disable. */ > > > bool event_triggered; > > > > > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, > > > ret = vq->split.desc_state[i].data; > > > detach_buf_split(vq, i, ctx); > > > vq->last_used_idx++; > > > + if (unlikely(!vq->last_used_idx)) > > > + vq->wraps++; > > > /* If we expect an interrupt for the next entry, tell host > > > * by writing event index and flush out the write before > > > * the read in the next get_buf call. */ > > > > So most drivers don't call virtqueue_poll. > > Concerned about the overhead here: another option is > > with a flag that will have to be set whenever a driver > > wants to use virtqueue_poll. > > Do you mean a compile time flag/Kconfig to just remove the possible > overhead instructions as a whole when not needed by the driver ? > > Or do you mean at runtime since checking the flag evry time should be > less costly than checking the wrpas each time AND counting when it > happens ? The later. > > Could you pls do a quick perf test e.g. using tools/virtio/ > > to see what's faster? > > Yes I'll do, thanks for the hint, I have some compilation issues in > tools/virtio due to my additions (missing mirrored hehaders) or to some > recently added stuff (missing drv_to_virtio & friends for > suppressed_used_validation thing)...anyway I fixed those now and I'll > post related tools/virtio patches with next iteration. > > Anyway, do you mean perf data about vringh_test and virtio_test/vhost > right ? (ringtest/ excluded 'cause does not use any API is just > prototyping) can be either or both, virtio_test/vhost is a bit easier to use. > > > > > > > > > @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, > > > if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) { > > > vq->last_used_idx -= vq->packed.vring.num; > > > vq->packed.used_wrap_counter ^= 1; > > > + vq->wraps++; > > > } > > > > > > /* > > > @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed( > > > vq->weak_barriers = weak_barriers; > > > vq->broken = false; > > > vq->last_used_idx = 0; > > > + vq->wraps = 0; > > > vq->event_triggered = false; > > > vq->num_added = 0; > > > vq->packed_ring = true; > > > @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb); > > > */ > > > unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) > > > { > > > + unsigned int last_used_idx; > > > struct vring_virtqueue *vq = to_vvq(_vq); > > > > > > if (vq->event_triggered) > > > vq->event_triggered = false; > > > > > > - return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) : > > > - virtqueue_enable_cb_prepare_split(_vq); > > > + last_used_idx = vq->packed_ring ? > > > + virtqueue_enable_cb_prepare_packed(_vq) : > > > + virtqueue_enable_cb_prepare_split(_vq); > > > + > > > + return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps); > > > } > > > EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > > > > > @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > > * @_vq: the struct virtqueue we're talking about. > > > * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare). > > > * > > > + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(), > > > + * is an opaque value representing the queue state and it is built as follows: > > > + * > > > + * --------------------------------------------------------- > > > + * | vq->wraps | vq->last_used_idx | > > > + * 31------------------------------------------------------0 > > > + * > > > + * The MSB 16bits embedding the wraps counter for the underlying virtqueue > > > + * is stripped out here before reaching into the lower layer helpers. > > > + * > > > + * This structure of the opaque value mitigates the scenario in which, when > > > + * exactly 2**16 messages are marked as used between two successive calls to > > > + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived > > > + * since the pure last_used_idx is exactly the same. > > > + * > > > > Do you want to move this comment to where the macros implementing it > > are? > > > > Sure, I'll do. > > Thanks, > Cristian
On Thu, Feb 03, 2022 at 06:32:29AM -0500, Michael S. Tsirkin wrote: > On Thu, Feb 03, 2022 at 10:51:19AM +0000, Cristian Marussi wrote: > > On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote: > > > Looks correct, thanks. Some minor comments below: > > > > > > > Hi Michael, > > > > thanks for the feedback. > > Hi Michael, [snip] > > > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, > > > > ret = vq->split.desc_state[i].data; > > > > detach_buf_split(vq, i, ctx); > > > > vq->last_used_idx++; > > > > + if (unlikely(!vq->last_used_idx)) > > > > + vq->wraps++; > > > > /* If we expect an interrupt for the next entry, tell host > > > > * by writing event index and flush out the write before > > > > * the read in the next get_buf call. */ > > > > > > So most drivers don't call virtqueue_poll. > > > Concerned about the overhead here: another option is > > > with a flag that will have to be set whenever a driver > > > wants to use virtqueue_poll. > > > > Do you mean a compile time flag/Kconfig to just remove the possible > > overhead instructions as a whole when not needed by the driver ? > > > > Or do you mean at runtime since checking the flag evry time should be > > less costly than checking the wrpas each time AND counting when it > > happens ? > > The later. > > > > Could you pls do a quick perf test e.g. using tools/virtio/ > > > to see what's faster? > > > > Yes I'll do, thanks for the hint, I have some compilation issues in > > tools/virtio due to my additions (missing mirrored hehaders) or to some > > recently added stuff (missing drv_to_virtio & friends for > > suppressed_used_validation thing)...anyway I fixed those now and I'll > > post related tools/virtio patches with next iteration. > > > > Anyway, do you mean perf data about vringh_test and virtio_test/vhost > > right ? (ringtest/ excluded 'cause does not use any API is just > > prototyping) > > can be either or both, virtio_test/vhost is a bit easier to use. > After a number of round tests with tools/virtio/virtio_test, below you can find the most reliable results I had. Using the flag as you suggested as in: if (unlikely(vq->use_wrap_counter)) vq->wraps += !vq->last_used_idx; seems definitely better as per the result in virtio_test_flag_V1. The last run with virtio_test_flag_V1 --wrap-counters is the case in which the test code request to set the use_wrap_counter flag to true at start. Since such flag is nothig spec related, I added a new EXPORT API virtqueue_use_wrap_counter() to allow a driver willing to use polling to ask for wrap counting at probe time. Since all of this required a few build-fixes in tool/virtio both before and after my additions, I'm going to post the proposed change in a new series independent from this SCMI virtio series (and add later the call to virtqueue_use_wrap_counter() to the SCMI virtio driver. How does this sound ? Thanks, Cristian --- + cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_unpatched Performance counter stats for 'nice -n -20 /root/virtio_test_unpatched' (25 runs): 6100.81 msec task-clock # 1.002 CPUs utilized ( +- 0.16% ) 19 context-switches # 3.126 /sec ( +- 2.08% ) 0 cpu-migrations # 0.000 /sec 134 page-faults # 22.049 /sec ( +- 0.03% ) 18249525657 cycles # 3.003 GHz ( +- 0.07% ) 45583397473 instructions # 2.52 insn per cycle ( +- 0.09% ) 14009712668 branches # 2.305 G/sec ( +- 0.09% ) 10075872 branch-misses # 0.07% of all branches ( +- 0.83% ) 6.0908 +- 0.0107 seconds time elapsed ( +- 0.18% ) + cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_wraps_noflag Performance counter stats for 'nice -n -20 /root/virtio_test_wraps_noflag' (25 runs): 7982.99 msec task-clock # 0.996 CPUs utilized ( +- 0.14% ) 16 context-switches # 1.999 /sec ( +- 2.56% ) 0 cpu-migrations # 0.000 /sec 134 page-faults # 16.744 /sec ( +- 0.03% ) 23691074946 cycles # 2.960 GHz ( +- 0.06% ) 68176350359 instructions # 2.88 insn per cycle ( +- 0.09% ) 21037768642 branches # 2.629 G/sec ( +- 0.09% ) 9083084 branch-misses # 0.04% of all branches ( +- 0.74% ) 8.0125 +- 0.0114 seconds time elapsed ( +- 0.14% ) + cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_flag_V1 Performance counter stats for 'nice -n -20 /root/virtio_test_flag_V1' (25 runs): 6182.21 msec task-clock # 1.007 CPUs utilized ( +- 0.25% ) 19 context-switches # 3.104 /sec ( +- 1.68% ) 0 cpu-migrations # 0.000 /sec 134 page-faults # 21.889 /sec ( +- 0.03% ) 18142274957 cycles # 2.963 GHz ( +- 0.13% ) 48973010013 instructions # 2.71 insn per cycle ( +- 0.18% ) 15064825126 branches # 2.461 G/sec ( +- 0.18% ) 8697800 branch-misses # 0.06% of all branches ( +- 0.89% ) 6.1382 +- 0.0172 seconds time elapsed ( +- 0.28% ) + cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_flag_V1 --wrap-counters Performance counter stats for 'nice -n -20 /root/virtio_test_flag_V1 --wrap-counters' (25 runs): 6051.58 msec task-clock # 0.984 CPUs utilized ( +- 0.22% ) 21 context-switches # 3.424 /sec ( +- 1.25% ) 0 cpu-migrations # 0.000 /sec 134 page-faults # 21.846 /sec ( +- 0.03% ) 17928356478 cycles # 2.923 GHz ( +- 0.11% ) 48147192304 instructions # 2.67 insn per cycle ( +- 0.14% ) 14808798588 branches # 2.414 G/sec ( +- 0.15% ) 9108899 branch-misses # 0.06% of all branches ( +- 1.22% ) 6.1525 +- 0.0155 seconds time elapsed ( +- 0.25% )
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 00f64f2f8b72..613ec0503509 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -12,6 +12,8 @@ #include <linux/hrtimer.h> #include <linux/dma-mapping.h> #include <linux/spinlock.h> +#include <linux/bits.h> +#include <linux/bitfield.h> #include <xen/xen.h> static bool force_used_validation = false; @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444); #define LAST_ADD_TIME_INVALID(vq) #endif +#define VRING_IDX_MASK GENMASK(15, 0) +#define VRING_GET_IDX(opaque) \ + ((u16)FIELD_GET(VRING_IDX_MASK, (opaque))) + +#define VRING_WRAPS_MASK GENMASK(31, 16) +#define VRING_GET_WRAPS(opaque) \ + ((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque))) + +#define VRING_BUILD_OPAQUE(idx, wraps) \ + (FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK)) + struct vring_desc_state_split { void *data; /* Data for callback. */ struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ @@ -117,6 +130,8 @@ struct vring_virtqueue { /* Last used index we've seen. */ u16 last_used_idx; + u16 wraps; + /* Hint for event idx: already triggered no need to disable. */ bool event_triggered; @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, ret = vq->split.desc_state[i].data; detach_buf_split(vq, i, ctx); vq->last_used_idx++; + if (unlikely(!vq->last_used_idx)) + vq->wraps++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before * the read in the next get_buf call. */ @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) { vq->last_used_idx -= vq->packed.vring.num; vq->packed.used_wrap_counter ^= 1; + vq->wraps++; } /* @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed( vq->weak_barriers = weak_barriers; vq->broken = false; vq->last_used_idx = 0; + vq->wraps = 0; vq->event_triggered = false; vq->num_added = 0; vq->packed_ring = true; @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb); */ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) { + unsigned int last_used_idx; struct vring_virtqueue *vq = to_vvq(_vq); if (vq->event_triggered) vq->event_triggered = false; - return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) : - virtqueue_enable_cb_prepare_split(_vq); + last_used_idx = vq->packed_ring ? + virtqueue_enable_cb_prepare_packed(_vq) : + virtqueue_enable_cb_prepare_split(_vq); + + return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps); } EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); * @_vq: the struct virtqueue we're talking about. * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare). * + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(), + * is an opaque value representing the queue state and it is built as follows: + * + * --------------------------------------------------------- + * | vq->wraps | vq->last_used_idx | + * 31------------------------------------------------------0 + * + * The MSB 16bits embedding the wraps counter for the underlying virtqueue + * is stripped out here before reaching into the lower layer helpers. + * + * This structure of the opaque value mitigates the scenario in which, when + * exactly 2**16 messages are marked as used between two successive calls to + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived + * since the pure last_used_idx is exactly the same. + * * Returns "true" if there are pending used buffers in the queue. * * This does not need to be serialized. @@ -2118,9 +2156,13 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx) if (unlikely(vq->broken)) return false; + if (unlikely(vq->wraps != VRING_GET_WRAPS(last_used_idx))) + return true; + virtio_mb(vq->weak_barriers); - return vq->packed_ring ? virtqueue_poll_packed(_vq, last_used_idx) : - virtqueue_poll_split(_vq, last_used_idx); + return vq->packed_ring ? + virtqueue_poll_packed(_vq, VRING_GET_IDX(last_used_idx)) : + virtqueue_poll_split(_vq, VRING_GET_IDX(last_used_idx)); } EXPORT_SYMBOL_GPL(virtqueue_poll); @@ -2245,6 +2287,7 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index, vq->weak_barriers = weak_barriers; vq->broken = false; vq->last_used_idx = 0; + vq->wraps = 0; vq->event_triggered = false; vq->num_added = 0; vq->use_dma_api = vring_use_dma_api(vdev);
Exported API virtqueue_poll() can be used to support polling mode operation on top of virtio layer if needed; currently the parameter last_used_idx is the opaque value that needs to be passed to the virtqueue_poll() function to check if there are new pending used buffers in the queue: such opaque value would have been previously obtained by a call to the API function virtqueue_enable_cb_prepare(). Since such opaque value is indeed containing simply a snapshot in time of the internal last_used_index (roughly), it is possible that, if exactly 2**16 buffers are marked as used between two successive calls to virtqueue_poll(), the caller is fooled into thinking that nothing is pending (ABA problem). Keep a full fledged internal wraps counter per virtqueue and embed it into the upper 16bits of the returned opaque value, so that the above scenario can be detected transparently by virtqueue_poll(): this way each single possible last_used_idx value is really belonging to a different wrap. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Igor Skalkin <igor.skalkin@opensynergy.com> Cc: Peter Hilber <peter.hilber@opensynergy.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> --- Still no perf data on this, I was wondering what exactly to measure in term of perf metrics to evaluate the impact of the rolling vq->wraps counter. --- drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++--- 1 file changed, 47 insertions(+), 4 deletions(-)