diff mbox

[V3,3/3] virtio-blk: Add bio-based IO path for virtio-blk

Message ID 1342168731-11797-4-git-send-email-asias@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Asias He July 13, 2012, 8:38 a.m. UTC
This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
    clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
    clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
    clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
    clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
    clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
    clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
    clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
    clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
    clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
    clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
    clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
    clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
    clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
    clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
    clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
    clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |  203 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 163 insertions(+), 40 deletions(-)

Comments

Rusty Russell July 27, 2012, 12:33 a.m. UTC | #1
On Fri, 13 Jul 2012 16:38:51 +0800, Asias He <asias@redhat.com> wrote:
> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
> use_bio=1' to enable ->make_request_fn() based I/O path.

This patch conflicts with Paolo's Bonzini's 'virtio-blk: allow toggling
host cache between writeback and writethrough' which is also queued (see
linux-next).

I'm not sure what the correct behavior for bio & cacheflush is, if any.

But as to the patch itself: it's a hack.

1) Leaving the guest's admin to turn on the switch is a terrible choice.
2) The block layer should stop merging and sorting when a device is
   fast, not the driver.
3) I pointed out that slow disks have low IOPS, so why is this
   conditional?  Sure, more guest exits, but it's still a small number
   for a slow device.
4) The only case where we want merging is on a slow device when the host
   isn't doing it.

Now, despite this, I'm prepared to commit it.  But in my mind it's a
hack: we should aim for use_bio to be based on a feature bit fed from
the host, and use the module parameter only if we want to override it.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell July 27, 2012, 6:29 a.m. UTC | #2
On Fri, 13 Jul 2012 16:38:51 +0800, Asias He <asias@redhat.com> wrote:
> This patch introduces bio-based IO path for virtio-blk.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

I just hope we can do better than a module option in future.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Asias He July 28, 2012, 2:38 a.m. UTC | #3
On 07/27/2012 08:33 AM, Rusty Russell wrote:
> On Fri, 13 Jul 2012 16:38:51 +0800, Asias He <asias@redhat.com> wrote:
>> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
>> use_bio=1' to enable ->make_request_fn() based I/O path.
>
> This patch conflicts with Paolo's Bonzini's 'virtio-blk: allow toggling
> host cache between writeback and writethrough' which is also queued (see
> linux-next).

Rebased against Paolo's patch in V4.

> I'm not sure what the correct behavior for bio & cacheflush is, if any.

REQ_FLUSH is not supported in the bio path.

> But as to the patch itself: it's a hack.
>
> 1) Leaving the guest's admin to turn on the switch is a terrible choice.
> 2) The block layer should stop merging and sorting when a device is
>     fast, not the driver.
> 3) I pointed out that slow disks have low IOPS, so why is this
>     conditional?  Sure, more guest exits, but it's still a small number
>     for a slow device.
> 4) The only case where we want merging is on a slow device when the host
>     isn't doing it.
>
> Now, despite this, I'm prepared to commit it.  But in my mind it's a
> hack: we should aim for use_bio to be based on a feature bit fed from
> the host, and use the module parameter only if we want to override it.

OK. A feature bit from host sound like a choice but a switch is also 
needed on host side. And for other OS, e.g. Windows, the bio thing does 
not apply at all.

Anyway, I have to admit that adding a module parameter here is not the 
best choice. Let's think more.
Paolo Bonzini July 28, 2012, 6:42 a.m. UTC | #4
> > I'm not sure what the correct behavior for bio & cacheflush is, if
> > any.
> 
> REQ_FLUSH is not supported in the bio path.

Ouch, that's correct:

@@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
 	u8 writeback = virtblk_get_cache_mode(vdev);
 	struct virtio_blk *vblk = vdev->priv;
 
-	if (writeback)
+	if (writeback && !use_bio)
 		blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
 	else
 		blk_queue_flush(vblk->disk->queue, 0);

then it is not safe against power losses.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin July 29, 2012, 12:59 p.m. UTC | #5
On Sat, Jul 28, 2012 at 10:38:41AM +0800, Asias He wrote:
> On 07/27/2012 08:33 AM, Rusty Russell wrote:
> >On Fri, 13 Jul 2012 16:38:51 +0800, Asias He <asias@redhat.com> wrote:
> >>Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
> >>use_bio=1' to enable ->make_request_fn() based I/O path.
> >
> >This patch conflicts with Paolo's Bonzini's 'virtio-blk: allow toggling
> >host cache between writeback and writethrough' which is also queued (see
> >linux-next).
> 
> Rebased against Paolo's patch in V4.
> 
> >I'm not sure what the correct behavior for bio & cacheflush is, if any.
> 
> REQ_FLUSH is not supported in the bio path.
> 
> >But as to the patch itself: it's a hack.
> >
> >1) Leaving the guest's admin to turn on the switch is a terrible choice.
> >2) The block layer should stop merging and sorting when a device is
> >    fast, not the driver.
> >3) I pointed out that slow disks have low IOPS, so why is this
> >    conditional?  Sure, more guest exits, but it's still a small number
> >    for a slow device.
> >4) The only case where we want merging is on a slow device when the host
> >    isn't doing it.
> >
> >Now, despite this, I'm prepared to commit it.  But in my mind it's a
> >hack: we should aim for use_bio to be based on a feature bit fed from
> >the host, and use the module parameter only if we want to override it.
> 
> OK. A feature bit from host sound like a choice but a switch is also
> needed on host side.

qemu automatically gives you the ability to control
any feature bit.

> And for other OS, e.g. Windows, the bio thing
> does not apply at all.

Let's try to define when it's a good idea. Is it a hint to guest that
backend handles small accesses efficiently so ok to disable batching?

> Anyway, I have to admit that adding a module parameter here is not
> the best choice. Let's think more.
> 
> -- 
> Asias
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Asias He July 30, 2012, 12:26 a.m. UTC | #6
On 07/29/2012 08:59 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 28, 2012 at 10:38:41AM +0800, Asias He wrote:
>> On 07/27/2012 08:33 AM, Rusty Russell wrote:
>>> On Fri, 13 Jul 2012 16:38:51 +0800, Asias He <asias@redhat.com> wrote:
>>>> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
>>>> use_bio=1' to enable ->make_request_fn() based I/O path.
>>>
>>> This patch conflicts with Paolo's Bonzini's 'virtio-blk: allow toggling
>>> host cache between writeback and writethrough' which is also queued (see
>>> linux-next).
>>
>> Rebased against Paolo's patch in V4.
>>
>>> I'm not sure what the correct behavior for bio & cacheflush is, if any.
>>
>> REQ_FLUSH is not supported in the bio path.
>>
>>> But as to the patch itself: it's a hack.
>>>
>>> 1) Leaving the guest's admin to turn on the switch is a terrible choice.
>>> 2) The block layer should stop merging and sorting when a device is
>>>     fast, not the driver.
>>> 3) I pointed out that slow disks have low IOPS, so why is this
>>>     conditional?  Sure, more guest exits, but it's still a small number
>>>     for a slow device.
>>> 4) The only case where we want merging is on a slow device when the host
>>>     isn't doing it.
>>>
>>> Now, despite this, I'm prepared to commit it.  But in my mind it's a
>>> hack: we should aim for use_bio to be based on a feature bit fed from
>>> the host, and use the module parameter only if we want to override it.
>>
>> OK. A feature bit from host sound like a choice but a switch is also
>> needed on host side.
>
> qemu automatically gives you the ability to control
> any feature bit.

Automatically?

>> And for other OS, e.g. Windows, the bio thing
>> does not apply at all.
>
> Let's try to define when it's a good idea. Is it a hint to guest that
> backend handles small accesses efficiently so ok to disable batching?

Yes. It's also a hint for latency reduction.

>> Anyway, I have to admit that adding a module parameter here is not
>> the best choice. Let's think more.
>>
>> --
>> Asias
Asias He July 30, 2012, 4:43 a.m. UTC | #7
On 07/28/2012 02:42 PM, Paolo Bonzini wrote:
>>> I'm not sure what the correct behavior for bio & cacheflush is, if
>>> any.
>>
>> REQ_FLUSH is not supported in the bio path.
>
> Ouch, that's correct:
>
> @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
>   	u8 writeback = virtblk_get_cache_mode(vdev);
>   	struct virtio_blk *vblk = vdev->priv;
>
> -	if (writeback)
> +	if (writeback && !use_bio)
>   		blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
>   	else
>   		blk_queue_flush(vblk->disk->queue, 0);
>
> then it is not safe against power losses.

Yes. Something like this:

   qemu -drive file=foo.img,cache=writeback/unsafe

is not safe against power losses also?

I think we can add REQ_FLUSH & REQ_FUA support to bio path and that 
deserves another patch.
Paolo Bonzini July 30, 2012, 7:31 a.m. UTC | #8
Il 30/07/2012 06:43, Asias He ha scritto:
>>
> 
> Yes. Something like this:
> 
>   qemu -drive file=foo.img,cache=writeback/unsafe
> 
> is not safe against power losses also?

cache=writeback and cache=none are safe, cache=unsafe isn't.

> I think we can add REQ_FLUSH & REQ_FUA support to bio path and that
> deserves another patch.

You only need to add REQ_FLUSH support.  The virtio-blk protocol does
not support REQ_FUA, because there's no easy way to do it in userspace.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 30, 2012, 1:43 p.m. UTC | #9
On Mon, Jul 30, 2012 at 12:43:12PM +0800, Asias He wrote:
> I think we can add REQ_FLUSH & REQ_FUA support to bio path and that 
> deserves another patch.

Adding it is a requirement for merging the code.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 30, 2012, 1:44 p.m. UTC | #10
On Mon, Jul 30, 2012 at 09:31:06AM +0200, Paolo Bonzini wrote:
> You only need to add REQ_FLUSH support.  The virtio-blk protocol does
> not support REQ_FUA, because there's no easy way to do it in userspace.

A bio-based driver needs to handle both REQ_FLUSH and REQ_FUA as it does
not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers
can request.  To what the REQ_FUA request gets translated is a different story.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Asias He July 31, 2012, 8:43 a.m. UTC | #11
On 07/30/2012 09:43 PM, Christoph Hellwig wrote:
> On Mon, Jul 30, 2012 at 12:43:12PM +0800, Asias He wrote:
>> I think we can add REQ_FLUSH & REQ_FUA support to bio path and that
>> deserves another patch.
>
> Adding it is a requirement for merging the code.
>

OK. Will add that.
Asias He Aug. 2, 2012, 6:28 a.m. UTC | #12
On 07/30/2012 09:44 PM, Christoph Hellwig wrote:
> On Mon, Jul 30, 2012 at 09:31:06AM +0200, Paolo Bonzini wrote:
>> You only need to add REQ_FLUSH support.  The virtio-blk protocol does
>> not support REQ_FUA, because there's no easy way to do it in userspace.
>
> A bio-based driver needs to handle both REQ_FLUSH and REQ_FUA as it does
> not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers
> can request.  To what the REQ_FUA request gets translated is a different story.

I just sent out V5 to support both REQ_FLUSH AND REQ_FUA.
Thanks, Christoph!
diff mbox

Patch

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 774c31d..e137190 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -14,6 +14,9 @@ 
 
 #define PART_BITS 4
 
+static bool use_bio;
+module_param(use_bio, bool, S_IRUGO);
+
 static int major;
 static DEFINE_IDA(vd_index_ida);
 
@@ -23,6 +26,7 @@  struct virtio_blk
 {
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
+	wait_queue_head_t queue_wait;
 
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
@@ -51,53 +55,87 @@  struct virtio_blk
 struct virtblk_req
 {
 	struct request *req;
+	struct bio *bio;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
 	u8 status;
+	struct scatterlist sg[];
 };
 
-static void blk_done(struct virtqueue *vq)
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+	switch (vbr->status) {
+	case VIRTIO_BLK_S_OK:
+		return 0;
+	case VIRTIO_BLK_S_UNSUPP:
+		return -ENOTTY;
+	default:
+		return -EIO;
+	}
+}
+
+static inline void virtblk_request_done(struct virtio_blk *vblk,
+					struct virtblk_req *vbr)
+{
+	struct request *req = vbr->req;
+	int error = virtblk_result(vbr);
+
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		req->resid_len = vbr->in_hdr.residual;
+		req->sense_len = vbr->in_hdr.sense_len;
+		req->errors = vbr->in_hdr.errors;
+	} else if (req->cmd_type == REQ_TYPE_SPECIAL) {
+		req->errors = (error != 0);
+	}
+
+	__blk_end_request_all(req, error);
+	mempool_free(vbr, vblk->pool);
+}
+
+static inline void virtblk_bio_done(struct virtio_blk *vblk,
+				    struct virtblk_req *vbr)
+{
+	bio_endio(vbr->bio, virtblk_result(vbr));
+	mempool_free(vbr, vblk->pool);
+}
+
+static void virtblk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
+	unsigned long bio_done = 0, req_done = 0;
 	struct virtblk_req *vbr;
-	unsigned int len;
 	unsigned long flags;
+	unsigned int len;
 
 	spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
-		int error;
-
-		switch (vbr->status) {
-		case VIRTIO_BLK_S_OK:
-			error = 0;
-			break;
-		case VIRTIO_BLK_S_UNSUPP:
-			error = -ENOTTY;
-			break;
-		default:
-			error = -EIO;
-			break;
-		}
-
-		switch (vbr->req->cmd_type) {
-		case REQ_TYPE_BLOCK_PC:
-			vbr->req->resid_len = vbr->in_hdr.residual;
-			vbr->req->sense_len = vbr->in_hdr.sense_len;
-			vbr->req->errors = vbr->in_hdr.errors;
-			break;
-		case REQ_TYPE_SPECIAL:
-			vbr->req->errors = (error != 0);
-			break;
-		default:
-			break;
+		if (vbr->bio) {
+			virtblk_bio_done(vblk, vbr);
+			bio_done++;
+		} else {
+			virtblk_request_done(vblk, vbr);
+			req_done++;
 		}
-
-		__blk_end_request_all(vbr->req, error);
-		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
-	blk_start_queue(vblk->disk->queue);
+	if (req_done)
+		blk_start_queue(vblk->disk->queue);
 	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
+
+	if (bio_done)
+		wake_up(&vblk->queue_wait);
+}
+
+static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
+						    gfp_t gfp_mask)
+{
+	struct virtblk_req *vbr;
+
+	vbr = mempool_alloc(vblk->pool, gfp_mask);
+	if (vbr && use_bio)
+		sg_init_table(vbr->sg, vblk->sg_elems);
+
+	return vbr;
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -106,13 +144,13 @@  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	unsigned long num, out = 0, in = 0;
 	struct virtblk_req *vbr;
 
-	vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+	vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
 	if (!vbr)
 		/* When another request finishes we'll try again. */
 		return false;
 
 	vbr->req = req;
-
+	vbr->bio = NULL;
 	if (req->cmd_flags & REQ_FLUSH) {
 		vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
 		vbr->out_hdr.sector = 0;
@@ -172,7 +210,8 @@  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
+			      GFP_ATOMIC) < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
@@ -180,7 +219,7 @@  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	return true;
 }
 
-static void do_virtblk_request(struct request_queue *q)
+static void virtblk_request(struct request_queue *q)
 {
 	struct virtio_blk *vblk = q->queuedata;
 	struct request *req;
@@ -203,6 +242,82 @@  static void do_virtblk_request(struct request_queue *q)
 		virtqueue_kick(vblk->vq);
 }
 
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+				 struct virtblk_req *vbr,
+				 unsigned long out,
+				 unsigned long in)
+{
+	DEFINE_WAIT(wait);
+
+	for (;;) {
+		prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+
+		spin_lock_irq(vblk->disk->queue->queue_lock);
+		if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+				      GFP_ATOMIC) < 0) {
+			spin_unlock_irq(vblk->disk->queue->queue_lock);
+			io_schedule();
+		} else {
+			virtqueue_kick(vblk->vq);
+			spin_unlock_irq(vblk->disk->queue->queue_lock);
+			break;
+		}
+
+	}
+
+	finish_wait(&vblk->queue_wait, &wait);
+}
+
+static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct virtio_blk *vblk = q->queuedata;
+	unsigned int num, out = 0, in = 0;
+	struct virtblk_req *vbr;
+
+	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+	vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+	if (!vbr) {
+		bio_endio(bio, -ENOMEM);
+		return;
+	}
+
+	vbr->bio = bio;
+	vbr->req = NULL;
+	vbr->out_hdr.type = 0;
+	vbr->out_hdr.sector = bio->bi_sector;
+	vbr->out_hdr.ioprio = bio_prio(bio);
+
+	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+	num = blk_bio_map_sg(q, bio, vbr->sg + out);
+
+	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	if (num) {
+		if (bio->bi_rw & REQ_WRITE) {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			out += num;
+		} else {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			in += num;
+		}
+	}
+
+	spin_lock_irq(vblk->disk->queue->queue_lock);
+	if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+				       GFP_ATOMIC) < 0)) {
+		spin_unlock_irq(vblk->disk->queue->queue_lock);
+		virtblk_add_buf_wait(vblk, vbr, out, in);
+		return;
+	}
+	virtqueue_kick(vblk->vq);
+	spin_unlock_irq(vblk->disk->queue->queue_lock);
+}
+
 /* return id (s/n) string for *disk to *id_str
  */
 static int virtblk_get_id(struct gendisk *disk, char *id_str)
@@ -360,7 +475,7 @@  static int init_vq(struct virtio_blk *vblk)
 	int err = 0;
 
 	/* We expect one virtqueue, for output. */
-	vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
+	vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
 	if (IS_ERR(vblk->vq))
 		err = PTR_ERR(vblk->vq);
 
@@ -400,6 +515,8 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 	struct virtio_blk *vblk;
 	struct request_queue *q;
 	int err, index;
+	int pool_size;
+
 	u64 cap;
 	u32 v, blk_size, sg_elems, opt_io_size;
 	u16 min_io_size;
@@ -429,10 +546,12 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_free_index;
 	}
 
+	init_waitqueue_head(&vblk->queue_wait);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);
 	mutex_init(&vblk->config_lock);
+
 	INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
 	vblk->config_enable = true;
 
@@ -440,7 +559,10 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vblk;
 
-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	pool_size = sizeof(struct virtblk_req);
+	if (use_bio)
+		pool_size += sizeof(struct scatterlist) * sg_elems;
+	vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
 	if (!vblk->pool) {
 		err = -ENOMEM;
 		goto out_free_vq;
@@ -453,12 +575,14 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_mempool;
 	}
 
-	q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
+	q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
 	if (!q) {
 		err = -ENOMEM;
 		goto out_put_disk;
 	}
 
+	if (use_bio)
+		blk_queue_make_request(q, virtblk_make_request);
 	q->queuedata = vblk;
 
 	virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
@@ -471,7 +595,7 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->index = index;
 
 	/* configure queue flush support */
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_bio)
 		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
@@ -544,7 +668,6 @@  static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (!err && opt_io_size)
 		blk_queue_io_opt(q, blk_size * opt_io_size);
 
-
 	add_disk(vblk->disk);
 	err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
 	if (err)