diff mbox

virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

Message ID 87obmym8jv.fsf@rustcorp.com.au (mailing list archive)
State New, archived
Headers show

Commit Message

Rusty Russell July 29, 2012, 11:50 p.m. UTC
On Fri, 27 Jul 2012 10:11:26 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/07/2012 08:27, Rusty Russell ha scritto:
> >> > +int virtqueue_add_buf_sg(struct virtqueue *_vq,
> >> > +			 struct scatterlist *sg_out,
> >> > +			 unsigned int out,
> >> > +			 struct scatterlist *sg_in,
> >> > +			 unsigned int in,
> >> > +			 void *data,
> >> > +			 gfp_t gfp)
> > The point of chained scatterlists is they're self-terminated, so the
> > in & out counts should be calculated.
> > 
> > Counting them is not *that* bad, since we're about to read them all
> > anyway.
> > 
> > (Yes, the chained scatterlist stuff is complete crack, but I lost that
> > debate years ago.)
> > 
> > Here's my variant.  Networking, console and block seem OK, at least
> > (ie. it booted!).
> 
> I hate the for loops, even though we're about indeed to read all the
> scatterlists anyway... all they do is lengthen critical sections.

You're preaching to the choir: I agree.  But even more, I hate the
passing of a number and a scatterlist: it makes it doubly ambigious
whether we should use the number or the terminator.  And I really hate
bad APIs, even more than a bit of icache loss.

> Also, being the first user of chained scatterlist doesn't exactly give
> me warm fuzzies.

We're far from the first user: they've been in the kernel for well over
7 years.  They were introduced for the block layer, but they tended to
ignore private uses of scatterlists like this one.

> I think it's simpler if we provide an API to add individual buffers to
> the virtqueue, so that you can do multiple virtqueue_add_buf_more
> (whatever) before kicking the virtqueue.  The idea is that I can still
> use indirect buffers for the scatterlists that come from the block layer
> or from an skb, but I will use direct buffers for the request/response
> descriptors.  The direct buffers are always a small number (usually 2),
> so you can balance the effect by making the virtqueue bigger.  And for
> small reads and writes, you save a kmalloc on a very hot path.

This is why I hate chained scatterlists: there's no sane way to tell the
difference between passing a simple array and passing a chain.  We're
mugging the type system.

I think the right way of doing this is a flag.  We could abuse gfp_t,
but that's super-ugly.  Perhaps we should create our own
VQ_ATOMIC/VQ_KERNEL/VQ_DIRECT enum?

Or we could do what we previously suggested, and try to do some
intelligent size heuristic.  I've posted a patch from 3 years ago below.
 
> (BTW, scatterlists will have separate entries for each page; we do not
> need this in virtio buffers.  Collapsing physically-adjacent entries
> will speed up QEMU and will also help avoiding indirect buffers).

Yes, we should do this.  But note that this means an iteration, so we
might as well combine the loops :)

Cheers,
Rusty.

FIXME: remove printk
virtio: use indirect buffers based on demand (heuristic)

virtio_ring uses a ring buffer of descriptors: indirect support allows
a single descriptor to refer to a table of descriptors.  This saves
space in the ring, but requires a kmalloc/kfree.

Rather than try to figure out what the right threshold at which to use
indirect buffers, we drop the threshold dynamically when the ring is
under stress.

Note: to stress this, I reduced the ring size to 32 in lguest, and a
1G send reduced the threshold to 9.

Note2: I moved the BUG_ON()s above the indirect test, where they belong
(indirect falls thru on OOM, so the constraints still apply).
---
 drivers/virtio/virtio_ring.c |   61 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 52 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini July 30, 2012, 7:12 a.m. UTC | #1
Il 30/07/2012 01:50, Rusty Russell ha scritto:
>> Also, being the first user of chained scatterlist doesn't exactly give
>> me warm fuzzies.
> 
> We're far from the first user: they've been in the kernel for well over
> 7 years.  They were introduced for the block layer, but they tended to
> ignore private uses of scatterlists like this one.

Yeah, but sg_chain has no users in drivers, only a private one in
lib/scatterlist.c.  The internal API could be changed to something else
and leave virtio-scsi screwed...

> Yes, we should do this.  But note that this means an iteration, so we
> might as well combine the loops :)

I'm really bad at posting pseudo-code, but you can count the number of
physically-contiguous entries at the beginning of the list only.  So if
everything is contiguous, you use a single non-indirect buffer and save
a kmalloc.  If you use indirect buffers, I suspect it's much less
effective to collapse physically-contiguous entries.  More elaborate
heuristics do need a loop, though.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh July 30, 2012, 8:56 a.m. UTC | #2
On 07/30/2012 10:12 AM, Paolo Bonzini wrote:

> Il 30/07/2012 01:50, Rusty Russell ha scritto:
>>> Also, being the first user of chained scatterlist doesn't exactly give
>>> me warm fuzzies.
>>
>> We're far from the first user: they've been in the kernel for well over
>> 7 years.  They were introduced for the block layer, but they tended to
>> ignore private uses of scatterlists like this one.
> 
> Yeah, but sg_chain has no users in drivers, only a private one in
> lib/scatterlist.c.  The internal API could be changed to something else
> and leave virtio-scsi screwed...
> 
>> Yes, we should do this.  But note that this means an iteration, so we
>> might as well combine the loops :)
> 
> I'm really bad at posting pseudo-code, but you can count the number of
> physically-contiguous entries at the beginning of the list only.  So if
> everything is contiguous, you use a single non-indirect buffer and save
> a kmalloc.  If you use indirect buffers, I suspect it's much less
> effective to collapse physically-contiguous entries.  More elaborate
> heuristics do need a loop, though.
> 


[All the below with a grain of salt, from my senile memory]

You must not forget some facts about the scatterlist received here at the LLD.
It has already been DMA mapped and locked by the generic layer.
Which means that the DMA engine has already collapsed physically-contiguous
entries. Those you get here are already unique physically.
(There were bugs in the past, where this was not true, please complain
 if you find them again)

A scatterlist is two different lists taking the same space, but with two
different length.
- One list is the PAGE pointers plus offset && length, which is bigger or
  equal to the 2nd list. The end marker corresponds to this list.

  This list is the input into the DMA engine.

- Second list is the physical DMA addresses list. With their physical-lengths.
  Offset is not needed because it is incorporated in the DMA address.

  This list is the output from the DMA engine.

  The reason 2nd list is shorter is because the DMA engine tries to minimize
  the physical scatter-list entries which is usually a limited HW resource.

  This list might follow chains but it's end is determined by the received
  sg_count from the DMA engine, not by the end marker.

At the time my opinion, and I think Rusty agreed, was that the scatterlist
should be split in two. The input page-ptr list is just the BIO, and the
output of the DMA-engine should just be the physical part of the sg_list,
as a separate parameter. But all this was berried under too much APIs and
the noise was two strong, for any single brave sole.

So I'd just trust blindly the returned sg_count from the DMA engine, it is
already optimized. I THINK

> Paolo


Boaz
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -63,6 +63,8 @@  struct vring_virtqueue
 
 	/* Host supports indirect buffers */
 	bool indirect;
+	/* Threshold before we go indirect. */
+	unsigned int indirect_threshold;
 
 	/* Number of free buffers */
 	unsigned int num_free;
@@ -139,6 +141,32 @@  static int vring_add_indirect(struct vri
 	return head;
 }
 
+static void adjust_threshold(struct vring_virtqueue *vq,
+			     unsigned int out, unsigned int in)
+{
+	/* There are really two species of virtqueue, and it matters here.
+	 * If there are no output parts, it's a "normally full" receive queue,
+	 * otherwise it's a "normally empty" send queue. */
+	if (out) {
+		/* Leave threshold unless we're full. */
+		if (out + in < vq->num_free)
+			return;
+	} else {
+		/* Leave threshold unless we're empty. */
+		if (vq->num_free != vq->vring.num)
+			return;
+	}
+
+	/* Never drop threshold below 1 */
+	vq->indirect_threshold /= 2;
+	vq->indirect_threshold |= 1;
+
+	printk("%s %s: indirect threshold %u (%u+%u vs %u)\n",
+	       dev_name(&vq->vq.vdev->dev),
+	       vq->vq.name, vq->indirect_threshold,
+	       out, in, vq->num_free);
+}
+
 static int vring_add_buf(struct virtqueue *_vq,
 			 struct scatterlist sg[],
 			 unsigned int out,
@@ -151,19 +179,33 @@  static int vring_add_buf(struct virtqueu
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
-
-	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
-		head = vring_add_indirect(vq, sg, out, in);
-		if (head != vq->vring.num)
-			goto add_head;
-	}
-
 	BUG_ON(out + in > vq->vring.num);
 	BUG_ON(out + in == 0);
 
 	vq->addbuf_total++;
+
+	/* If the host supports indirect descriptor tables, consider it. */
+	if (vq->indirect) {
+		bool try_indirect;
+
+		/* We tweak the threshold automatically. */
+		adjust_threshold(vq, out, in);
+
+		/* If we can't fit any at all, fall through. */
+		if (vq->num_free == 0)
+			try_indirect = false;
+		else if (out + in > vq->num_free)
+			try_indirect = true;
+		else
+			try_indirect = (out + in > vq->indirect_threshold);
+
+		if (try_indirect) {
+			head = vring_add_indirect(vq, sg, out, in);
+			if (head != vq->vring.num)
+				goto add_head;
+		}
+	}
+
 	if (vq->num_free < out + in) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
@@ -401,6 +443,7 @@  struct virtqueue *vring_new_virtqueue(un
 		= vq->other_notify = 0;
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->indirect_threshold = num;
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)