diff mbox series

[v2,3/9,RFC] virtio_ring: Embed a wrap counter in opaque poll index value

Message ID 20220201171601.53316-4-cristian.marussi@arm.com (mailing list archive)
State New, archived
Headers show
Series Add SCMI Virtio & Clock atomic support | expand

Commit Message

Cristian Marussi Feb. 1, 2022, 5:15 p.m. UTC
Exported API virtqueue_poll() can be used to support polling mode operation
on top of virtio layer if needed; currently the parameter last_used_idx is
the opaque value that needs to be passed to the virtqueue_poll() function
to check if there are new pending used buffers in the queue: such opaque
value would have been previously obtained by a call to the API function
virtqueue_enable_cb_prepare().

Since such opaque value is indeed containing simply a snapshot in time of
the internal last_used_index (roughly), it is possible that, if exactly
2**16 buffers are marked as used between two successive calls to
virtqueue_poll(), the caller is fooled into thinking that nothing is
pending (ABA problem).

Keep a full fledged internal wraps counter per virtqueue and embed it into
the upper 16bits of the returned opaque value, so that the above scenario
can be detected transparently by virtqueue_poll(): this way each single
possible last_used_idx value is really belonging to a different wrap.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Skalkin <igor.skalkin@opensynergy.com>
Cc: Peter Hilber <peter.hilber@opensynergy.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
---
Still no perf data on this, I was wondering what exactly to measure in
term of perf metrics to evaluate the impact of the rolling vq->wraps
counter.
---
 drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 4 deletions(-)

Comments

Michael S. Tsirkin Feb. 1, 2022, 6:27 p.m. UTC | #1
Looks correct, thanks. Some minor comments below:

On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote:
> Exported API virtqueue_poll() can be used to support polling mode operation
> on top of virtio layer if needed; currently the parameter last_used_idx is
> the opaque value that needs to be passed to the virtqueue_poll() function
> to check if there are new pending used buffers in the queue: such opaque
> value would have been previously obtained by a call to the API function
> virtqueue_enable_cb_prepare().
> 
> Since such opaque value is indeed containing simply a snapshot in time of
> the internal

to add: 16 bit

> last_used_index (roughly), it is possible that,

to add here: 

if another thread calls virtqueue_add_*()
at the same time (which existing drivers don't do,
but does not seem to be documented as prohibited anywhere), and

> if exactly
> 2**16 buffers are marked as used between two successive calls to
> virtqueue_poll(), the caller is fooled into thinking that nothing is
> pending (ABA problem).
> Keep a full fledged internal wraps counter

s/full fledged/a 16 bit/

since I don't see why is a 16 bit counter full but not e.g. a 32 bit one

> per virtqueue and embed it into
> the upper 16bits of the returned opaque value, so that the above scenario
> can be detected transparently by virtqueue_poll(): this way each single
> possible last_used_idx value is really belonging to a different wrap.

Just to add here: the ABA problem can in theory still happen but
now that's after 2^32 requests, which seems sufficient in practice.

> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Skalkin <igor.skalkin@opensynergy.com>
> Cc: Peter Hilber <peter.hilber@opensynergy.com>
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> ---
> Still no perf data on this, I was wondering what exactly to measure in
> term of perf metrics to evaluate the impact of the rolling vq->wraps
> counter.
> ---
>  drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++---
>  1 file changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 00f64f2f8b72..613ec0503509 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -12,6 +12,8 @@
>  #include <linux/hrtimer.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/spinlock.h>
> +#include <linux/bits.h>
> +#include <linux/bitfield.h>
>  #include <xen/xen.h>
>  
>  static bool force_used_validation = false;
> @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444);
>  #define LAST_ADD_TIME_INVALID(vq)
>  #endif
>  
> +#define VRING_IDX_MASK					GENMASK(15, 0)
> +#define VRING_GET_IDX(opaque)				\
> +	((u16)FIELD_GET(VRING_IDX_MASK, (opaque)))
> +
> +#define VRING_WRAPS_MASK				GENMASK(31, 16)
> +#define VRING_GET_WRAPS(opaque)				\
> +	((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque)))
> +
> +#define VRING_BUILD_OPAQUE(idx, wraps)			\
> +	(FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK))
> +

Maybe prefix with VRING_POLL_  since that is the only user.


>  struct vring_desc_state_split {
>  	void *data;			/* Data for callback. */
>  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> @@ -117,6 +130,8 @@ struct vring_virtqueue {
>  	/* Last used index we've seen. */
>  	u16 last_used_idx;
>  
> +	u16 wraps;
> +
>  	/* Hint for event idx: already triggered no need to disable. */
>  	bool event_triggered;
>  
> @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
>  	ret = vq->split.desc_state[i].data;
>  	detach_buf_split(vq, i, ctx);
>  	vq->last_used_idx++;
> +	if (unlikely(!vq->last_used_idx))
> +		vq->wraps++;
>  	/* If we expect an interrupt for the next entry, tell host
>  	 * by writing event index and flush out the write before
>  	 * the read in the next get_buf call. */

So most drivers don't call virtqueue_poll.
Concerned about the overhead here: another option is
with a flag that will have to be set whenever a driver
wants to use virtqueue_poll.
Could you pls do a quick perf test e.g. using tools/virtio/
to see what's faster?



> @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
>  	if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) {
>  		vq->last_used_idx -= vq->packed.vring.num;
>  		vq->packed.used_wrap_counter ^= 1;
> +		vq->wraps++;
>  	}
>  
>  	/*
> @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed(
>  	vq->weak_barriers = weak_barriers;
>  	vq->broken = false;
>  	vq->last_used_idx = 0;
> +	vq->wraps = 0;
>  	vq->event_triggered = false;
>  	vq->num_added = 0;
>  	vq->packed_ring = true;
> @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
>   */
>  unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>  {
> +	unsigned int last_used_idx;
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  
>  	if (vq->event_triggered)
>  		vq->event_triggered = false;
>  
> -	return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) :
> -				 virtqueue_enable_cb_prepare_split(_vq);
> +	last_used_idx = vq->packed_ring ?
> +			virtqueue_enable_cb_prepare_packed(_vq) :
> +			virtqueue_enable_cb_prepare_split(_vq);
> +
> +	return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>  
> @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>   * @_vq: the struct virtqueue we're talking about.
>   * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare).
>   *
> + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(),
> + * is an opaque value representing the queue state and it is built as follows:
> + *
> + *	---------------------------------------------------------
> + *	|	vq->wraps	|	vq->last_used_idx	|
> + *	31------------------------------------------------------0
> + *
> + * The MSB 16bits embedding the wraps counter for the underlying virtqueue
> + * is stripped out here before reaching into the lower layer helpers.
> + *
> + * This structure of the opaque value mitigates the scenario in which, when
> + * exactly 2**16 messages are marked as used between two successive calls to
> + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived
> + * since the pure last_used_idx is exactly the same.
> + *

Do you want to move this comment to where the macros implementing it
are?

>   * Returns "true" if there are pending used buffers in the queue.
>   *
>   * This does not need to be serialized.
> @@ -2118,9 +2156,13 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
>  	if (unlikely(vq->broken))
>  		return false;
>  
> +	if (unlikely(vq->wraps != VRING_GET_WRAPS(last_used_idx)))
> +		return true;
> +
>  	virtio_mb(vq->weak_barriers);
> -	return vq->packed_ring ? virtqueue_poll_packed(_vq, last_used_idx) :
> -				 virtqueue_poll_split(_vq, last_used_idx);
> +	return vq->packed_ring ?
> +		virtqueue_poll_packed(_vq, VRING_GET_IDX(last_used_idx)) :
> +			virtqueue_poll_split(_vq, VRING_GET_IDX(last_used_idx));
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_poll);
>  
> @@ -2245,6 +2287,7 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
>  	vq->weak_barriers = weak_barriers;
>  	vq->broken = false;
>  	vq->last_used_idx = 0;
> +	vq->wraps = 0;
>  	vq->event_triggered = false;
>  	vq->num_added = 0;
>  	vq->use_dma_api = vring_use_dma_api(vdev);
> -- 
> 2.17.1
Cristian Marussi Feb. 3, 2022, 10:51 a.m. UTC | #2
On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote:
> Looks correct, thanks. Some minor comments below:
> 

Hi Michael,

thanks for the feedback.

> On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote:
> > Exported API virtqueue_poll() can be used to support polling mode operation
> > on top of virtio layer if needed; currently the parameter last_used_idx is
> > the opaque value that needs to be passed to the virtqueue_poll() function
> > to check if there are new pending used buffers in the queue: such opaque
> > value would have been previously obtained by a call to the API function
> > virtqueue_enable_cb_prepare().
> > 
> > Since such opaque value is indeed containing simply a snapshot in time of
> > the internal
> 
> to add: 16 bit
> 
> > last_used_index (roughly), it is possible that,
> 
> to add here: 
> 
> if another thread calls virtqueue_add_*()
> at the same time (which existing drivers don't do,
> but does not seem to be documented as prohibited anywhere), and
> 
> > if exactly
> > 2**16 buffers are marked as used between two successive calls to
> > virtqueue_poll(), the caller is fooled into thinking that nothing is
> > pending (ABA problem).
> > Keep a full fledged internal wraps counter
> 
> s/full fledged/a 16 bit/
> 
> since I don't see why is a 16 bit counter full but not e.g. a 32 bit one
> 
.. :D I wanted to stress the fact that this being a 16bits counter has a
higher rollover than a 1-bit one wrap_counter already used...but indeed
all are just counters at the end, it's justthe wrapround that changes...

I'll fix.

> > per virtqueue and embed it into
> > the upper 16bits of the returned opaque value, so that the above scenario
> > can be detected transparently by virtqueue_poll(): this way each single
> > possible last_used_idx value is really belonging to a different wrap.
> 
> Just to add here: the ABA problem can in theory still happen but
> now that's after 2^32 requests, which seems sufficient in practice.
> 

Sure, I'll fix the commit message as above advised.

> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Igor Skalkin <igor.skalkin@opensynergy.com>
> > Cc: Peter Hilber <peter.hilber@opensynergy.com>
> > Cc: virtualization@lists.linux-foundation.org
> > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> > ---
> > Still no perf data on this, I was wondering what exactly to measure in
> > term of perf metrics to evaluate the impact of the rolling vq->wraps
> > counter.
> > ---
> >  drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++---
> >  1 file changed, 47 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 00f64f2f8b72..613ec0503509 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -12,6 +12,8 @@
> >  #include <linux/hrtimer.h>
> >  #include <linux/dma-mapping.h>
> >  #include <linux/spinlock.h>
> > +#include <linux/bits.h>
> > +#include <linux/bitfield.h>
> >  #include <xen/xen.h>
> >  
> >  static bool force_used_validation = false;
> > @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444);
> >  #define LAST_ADD_TIME_INVALID(vq)
> >  #endif
> >  
> > +#define VRING_IDX_MASK					GENMASK(15, 0)
> > +#define VRING_GET_IDX(opaque)				\
> > +	((u16)FIELD_GET(VRING_IDX_MASK, (opaque)))
> > +
> > +#define VRING_WRAPS_MASK				GENMASK(31, 16)
> > +#define VRING_GET_WRAPS(opaque)				\
> > +	((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque)))
> > +
> > +#define VRING_BUILD_OPAQUE(idx, wraps)			\
> > +	(FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK))
> > +
> 
> Maybe prefix with VRING_POLL_  since that is the only user.
> 

I'll do.

> 
> >  struct vring_desc_state_split {
> >  	void *data;			/* Data for callback. */
> >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> > @@ -117,6 +130,8 @@ struct vring_virtqueue {
> >  	/* Last used index we've seen. */
> >  	u16 last_used_idx;
> >  
> > +	u16 wraps;
> > +
> >  	/* Hint for event idx: already triggered no need to disable. */
> >  	bool event_triggered;
> >  
> > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
> >  	ret = vq->split.desc_state[i].data;
> >  	detach_buf_split(vq, i, ctx);
> >  	vq->last_used_idx++;
> > +	if (unlikely(!vq->last_used_idx))
> > +		vq->wraps++;
> >  	/* If we expect an interrupt for the next entry, tell host
> >  	 * by writing event index and flush out the write before
> >  	 * the read in the next get_buf call. */
> 
> So most drivers don't call virtqueue_poll.
> Concerned about the overhead here: another option is
> with a flag that will have to be set whenever a driver
> wants to use virtqueue_poll.

Do you mean a compile time flag/Kconfig to just remove the possible
overhead instructions as a whole when not needed by the driver ?

Or do you mean at runtime since checking the flag evry time should be
less costly than checking the wrpas each time AND counting when it
happens ?

> Could you pls do a quick perf test e.g. using tools/virtio/
> to see what's faster?

Yes I'll do, thanks for the hint, I have some compilation issues in
tools/virtio due to my additions (missing mirrored hehaders) or to some
recently added stuff (missing drv_to_virtio & friends for
suppressed_used_validation thing)...anyway I fixed those now and I'll
post related tools/virtio patches with next iteration.

Anyway, do you mean perf data about vringh_test and virtio_test/vhost
right ? (ringtest/ excluded 'cause does not use any API is just
prototyping)

> 
> 
> 
> > @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >  	if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) {
> >  		vq->last_used_idx -= vq->packed.vring.num;
> >  		vq->packed.used_wrap_counter ^= 1;
> > +		vq->wraps++;
> >  	}
> >  
> >  	/*
> > @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed(
> >  	vq->weak_barriers = weak_barriers;
> >  	vq->broken = false;
> >  	vq->last_used_idx = 0;
> > +	vq->wraps = 0;
> >  	vq->event_triggered = false;
> >  	vq->num_added = 0;
> >  	vq->packed_ring = true;
> > @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
> >   */
> >  unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
> >  {
> > +	unsigned int last_used_idx;
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> >  	if (vq->event_triggered)
> >  		vq->event_triggered = false;
> >  
> > -	return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) :
> > -				 virtqueue_enable_cb_prepare_split(_vq);
> > +	last_used_idx = vq->packed_ring ?
> > +			virtqueue_enable_cb_prepare_packed(_vq) :
> > +			virtqueue_enable_cb_prepare_split(_vq);
> > +
> > +	return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps);
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
> >  
> > @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
> >   * @_vq: the struct virtqueue we're talking about.
> >   * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare).
> >   *
> > + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(),
> > + * is an opaque value representing the queue state and it is built as follows:
> > + *
> > + *	---------------------------------------------------------
> > + *	|	vq->wraps	|	vq->last_used_idx	|
> > + *	31------------------------------------------------------0
> > + *
> > + * The MSB 16bits embedding the wraps counter for the underlying virtqueue
> > + * is stripped out here before reaching into the lower layer helpers.
> > + *
> > + * This structure of the opaque value mitigates the scenario in which, when
> > + * exactly 2**16 messages are marked as used between two successive calls to
> > + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived
> > + * since the pure last_used_idx is exactly the same.
> > + *
> 
> Do you want to move this comment to where the macros implementing it
> are?
> 

Sure, I'll do.

Thanks,
Cristian
Michael S. Tsirkin Feb. 3, 2022, 11:32 a.m. UTC | #3
On Thu, Feb 03, 2022 at 10:51:19AM +0000, Cristian Marussi wrote:
> On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote:
> > Looks correct, thanks. Some minor comments below:
> > 
> 
> Hi Michael,
> 
> thanks for the feedback.
> 
> > On Tue, Feb 01, 2022 at 05:15:55PM +0000, Cristian Marussi wrote:
> > > Exported API virtqueue_poll() can be used to support polling mode operation
> > > on top of virtio layer if needed; currently the parameter last_used_idx is
> > > the opaque value that needs to be passed to the virtqueue_poll() function
> > > to check if there are new pending used buffers in the queue: such opaque
> > > value would have been previously obtained by a call to the API function
> > > virtqueue_enable_cb_prepare().
> > > 
> > > Since such opaque value is indeed containing simply a snapshot in time of
> > > the internal
> > 
> > to add: 16 bit
> > 
> > > last_used_index (roughly), it is possible that,
> > 
> > to add here: 
> > 
> > if another thread calls virtqueue_add_*()
> > at the same time (which existing drivers don't do,
> > but does not seem to be documented as prohibited anywhere), and
> > 
> > > if exactly
> > > 2**16 buffers are marked as used between two successive calls to
> > > virtqueue_poll(), the caller is fooled into thinking that nothing is
> > > pending (ABA problem).
> > > Keep a full fledged internal wraps counter
> > 
> > s/full fledged/a 16 bit/
> > 
> > since I don't see why is a 16 bit counter full but not e.g. a 32 bit one
> > 
> .. :D I wanted to stress the fact that this being a 16bits counter has a
> higher rollover than a 1-bit one wrap_counter already used...but indeed
> all are just counters at the end, it's justthe wrapround that changes...
> 
> I'll fix.
> 
> > > per virtqueue and embed it into
> > > the upper 16bits of the returned opaque value, so that the above scenario
> > > can be detected transparently by virtqueue_poll(): this way each single
> > > possible last_used_idx value is really belonging to a different wrap.
> > 
> > Just to add here: the ABA problem can in theory still happen but
> > now that's after 2^32 requests, which seems sufficient in practice.
> > 
> 
> Sure, I'll fix the commit message as above advised.
> 
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Igor Skalkin <igor.skalkin@opensynergy.com>
> > > Cc: Peter Hilber <peter.hilber@opensynergy.com>
> > > Cc: virtualization@lists.linux-foundation.org
> > > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> > > ---
> > > Still no perf data on this, I was wondering what exactly to measure in
> > > term of perf metrics to evaluate the impact of the rolling vq->wraps
> > > counter.
> > > ---
> > >  drivers/virtio/virtio_ring.c | 51 +++++++++++++++++++++++++++++++++---
> > >  1 file changed, 47 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 00f64f2f8b72..613ec0503509 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -12,6 +12,8 @@
> > >  #include <linux/hrtimer.h>
> > >  #include <linux/dma-mapping.h>
> > >  #include <linux/spinlock.h>
> > > +#include <linux/bits.h>
> > > +#include <linux/bitfield.h>
> > >  #include <xen/xen.h>
> > >  
> > >  static bool force_used_validation = false;
> > > @@ -69,6 +71,17 @@ module_param(force_used_validation, bool, 0444);
> > >  #define LAST_ADD_TIME_INVALID(vq)
> > >  #endif
> > >  
> > > +#define VRING_IDX_MASK					GENMASK(15, 0)
> > > +#define VRING_GET_IDX(opaque)				\
> > > +	((u16)FIELD_GET(VRING_IDX_MASK, (opaque)))
> > > +
> > > +#define VRING_WRAPS_MASK				GENMASK(31, 16)
> > > +#define VRING_GET_WRAPS(opaque)				\
> > > +	((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque)))
> > > +
> > > +#define VRING_BUILD_OPAQUE(idx, wraps)			\
> > > +	(FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK))
> > > +
> > 
> > Maybe prefix with VRING_POLL_  since that is the only user.
> > 
> 
> I'll do.
> 
> > 
> > >  struct vring_desc_state_split {
> > >  	void *data;			/* Data for callback. */
> > >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> > > @@ -117,6 +130,8 @@ struct vring_virtqueue {
> > >  	/* Last used index we've seen. */
> > >  	u16 last_used_idx;
> > >  
> > > +	u16 wraps;
> > > +
> > >  	/* Hint for event idx: already triggered no need to disable. */
> > >  	bool event_triggered;
> > >  
> > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
> > >  	ret = vq->split.desc_state[i].data;
> > >  	detach_buf_split(vq, i, ctx);
> > >  	vq->last_used_idx++;
> > > +	if (unlikely(!vq->last_used_idx))
> > > +		vq->wraps++;
> > >  	/* If we expect an interrupt for the next entry, tell host
> > >  	 * by writing event index and flush out the write before
> > >  	 * the read in the next get_buf call. */
> > 
> > So most drivers don't call virtqueue_poll.
> > Concerned about the overhead here: another option is
> > with a flag that will have to be set whenever a driver
> > wants to use virtqueue_poll.
> 
> Do you mean a compile time flag/Kconfig to just remove the possible
> overhead instructions as a whole when not needed by the driver ?
> 
> Or do you mean at runtime since checking the flag evry time should be
> less costly than checking the wrpas each time AND counting when it
> happens ?

The later.

> > Could you pls do a quick perf test e.g. using tools/virtio/
> > to see what's faster?
> 
> Yes I'll do, thanks for the hint, I have some compilation issues in
> tools/virtio due to my additions (missing mirrored hehaders) or to some
> recently added stuff (missing drv_to_virtio & friends for
> suppressed_used_validation thing)...anyway I fixed those now and I'll
> post related tools/virtio patches with next iteration.
> 
> Anyway, do you mean perf data about vringh_test and virtio_test/vhost
> right ? (ringtest/ excluded 'cause does not use any API is just
> prototyping)

can be either or both, virtio_test/vhost is a bit easier to use.

> > 
> > 
> > 
> > > @@ -1508,6 +1525,7 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> > >  	if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) {
> > >  		vq->last_used_idx -= vq->packed.vring.num;
> > >  		vq->packed.used_wrap_counter ^= 1;
> > > +		vq->wraps++;
> > >  	}
> > >  
> > >  	/*
> > > @@ -1744,6 +1762,7 @@ static struct virtqueue *vring_create_virtqueue_packed(
> > >  	vq->weak_barriers = weak_barriers;
> > >  	vq->broken = false;
> > >  	vq->last_used_idx = 0;
> > > +	vq->wraps = 0;
> > >  	vq->event_triggered = false;
> > >  	vq->num_added = 0;
> > >  	vq->packed_ring = true;
> > > @@ -2092,13 +2111,17 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
> > >   */
> > >  unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
> > >  {
> > > +	unsigned int last_used_idx;
> > >  	struct vring_virtqueue *vq = to_vvq(_vq);
> > >  
> > >  	if (vq->event_triggered)
> > >  		vq->event_triggered = false;
> > >  
> > > -	return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) :
> > > -				 virtqueue_enable_cb_prepare_split(_vq);
> > > +	last_used_idx = vq->packed_ring ?
> > > +			virtqueue_enable_cb_prepare_packed(_vq) :
> > > +			virtqueue_enable_cb_prepare_split(_vq);
> > > +
> > > +	return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps);
> > >  }
> > >  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
> > >  
> > > @@ -2107,6 +2130,21 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
> > >   * @_vq: the struct virtqueue we're talking about.
> > >   * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare).
> > >   *
> > > + * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(),
> > > + * is an opaque value representing the queue state and it is built as follows:
> > > + *
> > > + *	---------------------------------------------------------
> > > + *	|	vq->wraps	|	vq->last_used_idx	|
> > > + *	31------------------------------------------------------0
> > > + *
> > > + * The MSB 16bits embedding the wraps counter for the underlying virtqueue
> > > + * is stripped out here before reaching into the lower layer helpers.
> > > + *
> > > + * This structure of the opaque value mitigates the scenario in which, when
> > > + * exactly 2**16 messages are marked as used between two successive calls to
> > > + * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived
> > > + * since the pure last_used_idx is exactly the same.
> > > + *
> > 
> > Do you want to move this comment to where the macros implementing it
> > are?
> > 
> 
> Sure, I'll do.
> 
> Thanks,
> Cristian
Cristian Marussi Feb. 7, 2022, 6:52 p.m. UTC | #4
On Thu, Feb 03, 2022 at 06:32:29AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 03, 2022 at 10:51:19AM +0000, Cristian Marussi wrote:
> > On Tue, Feb 01, 2022 at 01:27:38PM -0500, Michael S. Tsirkin wrote:
> > > Looks correct, thanks. Some minor comments below:
> > > 
> > 
> > Hi Michael,
> > 
> > thanks for the feedback.
> > 

Hi Michael,

 
[snip]

> > > > @@ -806,6 +821,8 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
> > > >  	ret = vq->split.desc_state[i].data;
> > > >  	detach_buf_split(vq, i, ctx);
> > > >  	vq->last_used_idx++;
> > > > +	if (unlikely(!vq->last_used_idx))
> > > > +		vq->wraps++;
> > > >  	/* If we expect an interrupt for the next entry, tell host
> > > >  	 * by writing event index and flush out the write before
> > > >  	 * the read in the next get_buf call. */
> > > 
> > > So most drivers don't call virtqueue_poll.
> > > Concerned about the overhead here: another option is
> > > with a flag that will have to be set whenever a driver
> > > wants to use virtqueue_poll.
> > 
> > Do you mean a compile time flag/Kconfig to just remove the possible
> > overhead instructions as a whole when not needed by the driver ?
> > 
> > Or do you mean at runtime since checking the flag evry time should be
> > less costly than checking the wrpas each time AND counting when it
> > happens ?
> 
> The later.
> 
> > > Could you pls do a quick perf test e.g. using tools/virtio/
> > > to see what's faster?
> > 
> > Yes I'll do, thanks for the hint, I have some compilation issues in
> > tools/virtio due to my additions (missing mirrored hehaders) or to some
> > recently added stuff (missing drv_to_virtio & friends for
> > suppressed_used_validation thing)...anyway I fixed those now and I'll
> > post related tools/virtio patches with next iteration.
> > 
> > Anyway, do you mean perf data about vringh_test and virtio_test/vhost
> > right ? (ringtest/ excluded 'cause does not use any API is just
> > prototyping)
> 
> can be either or both, virtio_test/vhost is a bit easier to use.
> 

After a number of round tests with tools/virtio/virtio_test, below you can find
the most reliable results I had.

Using the flag as you suggested as in:

if (unlikely(vq->use_wrap_counter))
	vq->wraps += !vq->last_used_idx;


seems definitely better as per the result in virtio_test_flag_V1.

The last run with virtio_test_flag_V1 --wrap-counters is the case in which
the test code request to set the use_wrap_counter flag to true at start.

Since such flag is nothig spec related, I added a new EXPORT API

virtqueue_use_wrap_counter()

to allow a driver willing to use polling to ask for wrap counting at
probe time.

Since all of this required a few build-fixes in tool/virtio both before
and after my additions, I'm going to post the proposed change in a new series
independent from this SCMI virtio series (and add later the call to
virtqueue_use_wrap_counter() to the SCMI virtio driver.

How does this sound ?

Thanks,
Cristian

---
+ cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_unpatched
 Performance counter stats for 'nice -n -20 /root/virtio_test_unpatched' (25 runs):

           6100.81 msec task-clock                #    1.002 CPUs utilized            ( +-  0.16% )
                19      context-switches          #    3.126 /sec                     ( +-  2.08% )
                 0      cpu-migrations            #    0.000 /sec                   
               134      page-faults               #   22.049 /sec                     ( +-  0.03% )
       18249525657      cycles                    #    3.003 GHz                      ( +-  0.07% )
       45583397473      instructions              #    2.52  insn per cycle           ( +-  0.09% )
       14009712668      branches                  #    2.305 G/sec                    ( +-  0.09% )
          10075872      branch-misses             #    0.07% of all branches          ( +-  0.83% )

            6.0908 +- 0.0107 seconds time elapsed  ( +-  0.18% )


+ cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_wraps_noflag
 Performance counter stats for 'nice -n -20 /root/virtio_test_wraps_noflag' (25 runs):

           7982.99 msec task-clock                #    0.996 CPUs utilized            ( +-  0.14% )
                16      context-switches          #    1.999 /sec                     ( +-  2.56% )
                 0      cpu-migrations            #    0.000 /sec                   
               134      page-faults               #   16.744 /sec                     ( +-  0.03% )
       23691074946      cycles                    #    2.960 GHz                      ( +-  0.06% )
       68176350359      instructions              #    2.88  insn per cycle           ( +-  0.09% )
       21037768642      branches                  #    2.629 G/sec                    ( +-  0.09% )
           9083084      branch-misses             #    0.04% of all branches          ( +-  0.74% )

            8.0125 +- 0.0114 seconds time elapsed  ( +-  0.14% )


+ cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_flag_V1
 Performance counter stats for 'nice -n -20 /root/virtio_test_flag_V1' (25 runs):

           6182.21 msec task-clock                #    1.007 CPUs utilized            ( +-  0.25% )
                19      context-switches          #    3.104 /sec                     ( +-  1.68% )
                 0      cpu-migrations            #    0.000 /sec                   
               134      page-faults               #   21.889 /sec                     ( +-  0.03% )
       18142274957      cycles                    #    2.963 GHz                      ( +-  0.13% )
       48973010013      instructions              #    2.71  insn per cycle           ( +-  0.18% )
       15064825126      branches                  #    2.461 G/sec                    ( +-  0.18% )
           8697800      branch-misses             #    0.06% of all branches          ( +-  0.89% )

            6.1382 +- 0.0172 seconds time elapsed  ( +-  0.28% )

+ cset shield --exec -- perf stat --repeat 25 -- nice -n -20 /root/virtio_test_flag_V1 --wrap-counters
 Performance counter stats for 'nice -n -20 /root/virtio_test_flag_V1 --wrap-counters' (25 runs):

           6051.58 msec task-clock                #    0.984 CPUs utilized            ( +-  0.22% )
                21      context-switches          #    3.424 /sec                     ( +-  1.25% )
                 0      cpu-migrations            #    0.000 /sec                   
               134      page-faults               #   21.846 /sec                     ( +-  0.03% )
       17928356478      cycles                    #    2.923 GHz                      ( +-  0.11% )
       48147192304      instructions              #    2.67  insn per cycle           ( +-  0.14% )
       14808798588      branches                  #    2.414 G/sec                    ( +-  0.15% )
           9108899      branch-misses             #    0.06% of all branches          ( +-  1.22% )

            6.1525 +- 0.0155 seconds time elapsed  ( +-  0.25% )
diff mbox series

Patch

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 00f64f2f8b72..613ec0503509 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -12,6 +12,8 @@ 
 #include <linux/hrtimer.h>
 #include <linux/dma-mapping.h>
 #include <linux/spinlock.h>
+#include <linux/bits.h>
+#include <linux/bitfield.h>
 #include <xen/xen.h>
 
 static bool force_used_validation = false;
@@ -69,6 +71,17 @@  module_param(force_used_validation, bool, 0444);
 #define LAST_ADD_TIME_INVALID(vq)
 #endif
 
+#define VRING_IDX_MASK					GENMASK(15, 0)
+#define VRING_GET_IDX(opaque)				\
+	((u16)FIELD_GET(VRING_IDX_MASK, (opaque)))
+
+#define VRING_WRAPS_MASK				GENMASK(31, 16)
+#define VRING_GET_WRAPS(opaque)				\
+	((u16)FIELD_GET(VRING_WRAPS_MASK, (opaque)))
+
+#define VRING_BUILD_OPAQUE(idx, wraps)			\
+	(FIELD_PREP(VRING_WRAPS_MASK, (wraps)) | ((idx) & VRING_IDX_MASK))
+
 struct vring_desc_state_split {
 	void *data;			/* Data for callback. */
 	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
@@ -117,6 +130,8 @@  struct vring_virtqueue {
 	/* Last used index we've seen. */
 	u16 last_used_idx;
 
+	u16 wraps;
+
 	/* Hint for event idx: already triggered no need to disable. */
 	bool event_triggered;
 
@@ -806,6 +821,8 @@  static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
 	ret = vq->split.desc_state[i].data;
 	detach_buf_split(vq, i, ctx);
 	vq->last_used_idx++;
+	if (unlikely(!vq->last_used_idx))
+		vq->wraps++;
 	/* If we expect an interrupt for the next entry, tell host
 	 * by writing event index and flush out the write before
 	 * the read in the next get_buf call. */
@@ -1508,6 +1525,7 @@  static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
 	if (unlikely(vq->last_used_idx >= vq->packed.vring.num)) {
 		vq->last_used_idx -= vq->packed.vring.num;
 		vq->packed.used_wrap_counter ^= 1;
+		vq->wraps++;
 	}
 
 	/*
@@ -1744,6 +1762,7 @@  static struct virtqueue *vring_create_virtqueue_packed(
 	vq->weak_barriers = weak_barriers;
 	vq->broken = false;
 	vq->last_used_idx = 0;
+	vq->wraps = 0;
 	vq->event_triggered = false;
 	vq->num_added = 0;
 	vq->packed_ring = true;
@@ -2092,13 +2111,17 @@  EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
  */
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 {
+	unsigned int last_used_idx;
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
 	if (vq->event_triggered)
 		vq->event_triggered = false;
 
-	return vq->packed_ring ? virtqueue_enable_cb_prepare_packed(_vq) :
-				 virtqueue_enable_cb_prepare_split(_vq);
+	last_used_idx = vq->packed_ring ?
+			virtqueue_enable_cb_prepare_packed(_vq) :
+			virtqueue_enable_cb_prepare_split(_vq);
+
+	return VRING_BUILD_OPAQUE(last_used_idx, vq->wraps);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
@@ -2107,6 +2130,21 @@  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
  * @_vq: the struct virtqueue we're talking about.
  * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare).
  *
+ * The provided last_used_idx, as returned by virtqueue_enable_cb_prepare(),
+ * is an opaque value representing the queue state and it is built as follows:
+ *
+ *	---------------------------------------------------------
+ *	|	vq->wraps	|	vq->last_used_idx	|
+ *	31------------------------------------------------------0
+ *
+ * The MSB 16bits embedding the wraps counter for the underlying virtqueue
+ * is stripped out here before reaching into the lower layer helpers.
+ *
+ * This structure of the opaque value mitigates the scenario in which, when
+ * exactly 2**16 messages are marked as used between two successive calls to
+ * virtqueue_poll(), the caller is fooled into thinking nothing new has arrived
+ * since the pure last_used_idx is exactly the same.
+ *
  * Returns "true" if there are pending used buffers in the queue.
  *
  * This does not need to be serialized.
@@ -2118,9 +2156,13 @@  bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
 	if (unlikely(vq->broken))
 		return false;
 
+	if (unlikely(vq->wraps != VRING_GET_WRAPS(last_used_idx)))
+		return true;
+
 	virtio_mb(vq->weak_barriers);
-	return vq->packed_ring ? virtqueue_poll_packed(_vq, last_used_idx) :
-				 virtqueue_poll_split(_vq, last_used_idx);
+	return vq->packed_ring ?
+		virtqueue_poll_packed(_vq, VRING_GET_IDX(last_used_idx)) :
+			virtqueue_poll_split(_vq, VRING_GET_IDX(last_used_idx));
 }
 EXPORT_SYMBOL_GPL(virtqueue_poll);
 
@@ -2245,6 +2287,7 @@  struct virtqueue *__vring_new_virtqueue(unsigned int index,
 	vq->weak_barriers = weak_barriers;
 	vq->broken = false;
 	vq->last_used_idx = 0;
+	vq->wraps = 0;
 	vq->event_triggered = false;
 	vq->num_added = 0;
 	vq->use_dma_api = vring_use_dma_api(vdev);