Message ID | 1343442065-15646-4-git-send-email-asias@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/28/2012 04:21 AM, Asias He wrote: > This patch introduces bio-based IO path for virtio-blk. > > Compared to request-based IO path, bio-based IO path uses driver > provided ->make_request_fn() method to bypasses the IO scheduler. It > handles the bio to device directly without allocating a request in block > layer. This reduces the IO path in guest kernel to achieve high IOPS > and lower latency. The downside is that guest can not use the IO > scheduler to merge and sort requests. However, this is not a big problem > if the backend disk in host side uses faster disk device. > > When the bio-based IO path is not enabled, virtio-blk still uses the > original request-based IO path, no performance difference is observed. > > Performance evaluation: > ----------------------------- > 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using > kvm tool. > > Short version: > With bio-based IO path, sequential read/write, random read/write > IOPS boost : 28%, 24%, 21%, 16% > Latency improvement: 32%, 17%, 21%, 16% > > Long version: > With bio-based IO path: > seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec > seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec > rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec > rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec > clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30 > clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35 > clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39 > clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45 > cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954 > cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137 > cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648 > cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375 > With request-based IO path: > seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec > seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec > rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec > rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec > clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49 > clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15 > clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27 > clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68 > cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622 > cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959 > cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082 > cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985 > > 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using > kvm tool. > > Short version: > With bio-based IO path, sequential read/write, random read/write > IOPS boost : 11%, 11%, 13%, 10% > Latency improvement: 10%, 10%, 12%, 10% > Long Version: > With bio-based IO path: > read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec > write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec > read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec > write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec > clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29 > clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80 > clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07 > clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25 > cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517 > cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604 > cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612 > cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105 > With request-based IO path: > read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec > write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec > read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec > write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec > clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84 > clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64 > clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55 > clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95 > cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641 > cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689 > cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223 > cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409 What are the cases where we'll see a performance degradation with using the bio path? Could we measure performance for those as well? > How to use: > ----------------------------- > Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk > use_bio=1' to enable ->make_request_fn() based I/O path. If there are, in fact, no cases where performance is degraded, can use_bio=1 be the default? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote: > This patch introduces bio-based IO path for virtio-blk. > > Compared to request-based IO path, bio-based IO path uses driver > provided ->make_request_fn() method to bypasses the IO scheduler. It > handles the bio to device directly without allocating a request in block > layer. This reduces the IO path in guest kernel to achieve high IOPS > and lower latency. The downside is that guest can not use the IO > scheduler to merge and sort requests. However, this is not a big problem > if the backend disk in host side uses faster disk device. If this optimization depends on the host, then it should be reported to the guest using a feature bit, as opposed to being guest driven. > When the bio-based IO path is not enabled, virtio-blk still uses the > original request-based IO path, no performance difference is observed. > > Performance evaluation: > ----------------------------- > 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using > kvm tool. > > Short version: > With bio-based IO path, sequential read/write, random read/write > IOPS boost : 28%, 24%, 21%, 16% > Latency improvement: 32%, 17%, 21%, 16% > > Long version: > With bio-based IO path: > seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec > seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec > rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec > rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec > clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30 > clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35 > clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39 > clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45 > cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954 > cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137 > cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648 > cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375 > With request-based IO path: > seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec > seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec > rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec > rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec > clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49 > clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15 > clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27 > clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68 > cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622 > cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959 > cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082 > cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985 So bio based causes cpu to jump up by some 30%? What happens if you divide IOPS/CPU? Any improvement that comes from increasing the cpu share of the given guest on the host will not scale well on a typical overcommitted host. > 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using > kvm tool. > > Short version: > With bio-based IO path, sequential read/write, random read/write > IOPS boost : 11%, 11%, 13%, 10% > Latency improvement: 10%, 10%, 12%, 10% > Long Version: > With bio-based IO path: > read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec > write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec > read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec > write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec > clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29 > clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80 > clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07 > clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25 > cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517 > cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604 > cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612 > cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105 > With request-based IO path: > read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec > write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec > read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec > write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec > clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84 > clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64 > clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55 > clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95 > cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641 > cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689 > cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223 > cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409 Is this host or guest cpu? We should probably measure host cpu as this includes device overhead which could vary by load. > How to use: > ----------------------------- > Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk > use_bio=1' to enable ->make_request_fn() based I/O path. > > Cc: Rusty Russell <rusty@rustcorp.com.au> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: virtualization@lists.linux-foundation.org > Cc: kvm@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Acked-by: Rusty Russell <rusty@rustcorp.com.au> > Signed-off-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Asias He <asias@redhat.com> > --- > drivers/block/virtio_blk.c | 203 +++++++++++++++++++++++++++++++++++--------- > 1 file changed, 163 insertions(+), 40 deletions(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index c0bbeb4..95cfeed 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -14,6 +14,9 @@ > > #define PART_BITS 4 > > +static bool use_bio; > +module_param(use_bio, bool, S_IRUGO); > + > static int major; > static DEFINE_IDA(vd_index_ida); > > @@ -23,6 +26,7 @@ struct virtio_blk > { > struct virtio_device *vdev; > struct virtqueue *vq; > + wait_queue_head_t queue_wait; > > /* The disk structure for the kernel. */ > struct gendisk *disk; > @@ -51,53 +55,87 @@ struct virtio_blk > struct virtblk_req > { > struct request *req; > + struct bio *bio; > struct virtio_blk_outhdr out_hdr; > struct virtio_scsi_inhdr in_hdr; > u8 status; > + struct scatterlist sg[]; > }; > > -static void blk_done(struct virtqueue *vq) > +static inline int virtblk_result(struct virtblk_req *vbr) > +{ > + switch (vbr->status) { > + case VIRTIO_BLK_S_OK: > + return 0; > + case VIRTIO_BLK_S_UNSUPP: > + return -ENOTTY; > + default: > + return -EIO; > + } > +} > + > +static inline void virtblk_request_done(struct virtio_blk *vblk, > + struct virtblk_req *vbr) > +{ > + struct request *req = vbr->req; > + int error = virtblk_result(vbr); > + > + if (req->cmd_type == REQ_TYPE_BLOCK_PC) { > + req->resid_len = vbr->in_hdr.residual; > + req->sense_len = vbr->in_hdr.sense_len; > + req->errors = vbr->in_hdr.errors; > + } else if (req->cmd_type == REQ_TYPE_SPECIAL) { > + req->errors = (error != 0); > + } > + > + __blk_end_request_all(req, error); > + mempool_free(vbr, vblk->pool); > +} > + > +static inline void virtblk_bio_done(struct virtio_blk *vblk, > + struct virtblk_req *vbr) > +{ > + bio_endio(vbr->bio, virtblk_result(vbr)); > + mempool_free(vbr, vblk->pool); > +} > + > +static void virtblk_done(struct virtqueue *vq) > { > struct virtio_blk *vblk = vq->vdev->priv; > + unsigned long bio_done = 0, req_done = 0; > struct virtblk_req *vbr; > - unsigned int len; > unsigned long flags; > + unsigned int len; > > spin_lock_irqsave(vblk->disk->queue->queue_lock, flags); > while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) { > - int error; > - > - switch (vbr->status) { > - case VIRTIO_BLK_S_OK: > - error = 0; > - break; > - case VIRTIO_BLK_S_UNSUPP: > - error = -ENOTTY; > - break; > - default: > - error = -EIO; > - break; > - } > - > - switch (vbr->req->cmd_type) { > - case REQ_TYPE_BLOCK_PC: > - vbr->req->resid_len = vbr->in_hdr.residual; > - vbr->req->sense_len = vbr->in_hdr.sense_len; > - vbr->req->errors = vbr->in_hdr.errors; > - break; > - case REQ_TYPE_SPECIAL: > - vbr->req->errors = (error != 0); > - break; > - default: > - break; > + if (vbr->bio) { > + virtblk_bio_done(vblk, vbr); > + bio_done++; > + } else { > + virtblk_request_done(vblk, vbr); > + req_done++; > } > - > - __blk_end_request_all(vbr->req, error); > - mempool_free(vbr, vblk->pool); > } > /* In case queue is stopped waiting for more buffers. */ > - blk_start_queue(vblk->disk->queue); > + if (req_done) > + blk_start_queue(vblk->disk->queue); > spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags); > + > + if (bio_done) > + wake_up(&vblk->queue_wait); > +} > + > +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk, > + gfp_t gfp_mask) > +{ > + struct virtblk_req *vbr; > + > + vbr = mempool_alloc(vblk->pool, gfp_mask); > + if (vbr && use_bio) > + sg_init_table(vbr->sg, vblk->sg_elems); > + > + return vbr; > } > > static bool do_req(struct request_queue *q, struct virtio_blk *vblk, > @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, > unsigned long num, out = 0, in = 0; > struct virtblk_req *vbr; > > - vbr = mempool_alloc(vblk->pool, GFP_ATOMIC); > + vbr = virtblk_alloc_req(vblk, GFP_ATOMIC); > if (!vbr) > /* When another request finishes we'll try again. */ > return false; > > vbr->req = req; > - > + vbr->bio = NULL; > if (req->cmd_flags & REQ_FLUSH) { > vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH; > vbr->out_hdr.sector = 0; > @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, > } > } > > - if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) { > + if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, > + GFP_ATOMIC) < 0) { > mempool_free(vbr, vblk->pool); > return false; > } > @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, > return true; > } > > -static void do_virtblk_request(struct request_queue *q) > +static void virtblk_request(struct request_queue *q) > { > struct virtio_blk *vblk = q->queuedata; > struct request *req; > @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q) > virtqueue_kick(vblk->vq); > } > > +static void virtblk_add_buf_wait(struct virtio_blk *vblk, > + struct virtblk_req *vbr, > + unsigned long out, > + unsigned long in) > +{ > + DEFINE_WAIT(wait); > + > + for (;;) { > + prepare_to_wait_exclusive(&vblk->queue_wait, &wait, > + TASK_UNINTERRUPTIBLE); > + > + spin_lock_irq(vblk->disk->queue->queue_lock); > + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, > + GFP_ATOMIC) < 0) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + io_schedule(); > + } else { > + virtqueue_kick(vblk->vq); > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + break; > + } > + > + } > + > + finish_wait(&vblk->queue_wait, &wait); > +} > + > +static void virtblk_make_request(struct request_queue *q, struct bio *bio) > +{ > + struct virtio_blk *vblk = q->queuedata; > + unsigned int num, out = 0, in = 0; > + struct virtblk_req *vbr; > + > + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems); > + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA)); > + > + vbr = virtblk_alloc_req(vblk, GFP_NOIO); > + if (!vbr) { > + bio_endio(bio, -ENOMEM); > + return; > + } > + > + vbr->bio = bio; > + vbr->req = NULL; > + vbr->out_hdr.type = 0; > + vbr->out_hdr.sector = bio->bi_sector; > + vbr->out_hdr.ioprio = bio_prio(bio); > + > + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr)); > + > + num = blk_bio_map_sg(q, bio, vbr->sg + out); > + > + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status, > + sizeof(vbr->status)); > + > + if (num) { > + if (bio->bi_rw & REQ_WRITE) { > + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT; > + out += num; > + } else { > + vbr->out_hdr.type |= VIRTIO_BLK_T_IN; > + in += num; > + } > + } > + > + spin_lock_irq(vblk->disk->queue->queue_lock); > + if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, > + GFP_ATOMIC) < 0)) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + virtblk_add_buf_wait(vblk, vbr, out, in); > + return; > + } > + virtqueue_kick(vblk->vq); > + spin_unlock_irq(vblk->disk->queue->queue_lock); > +} > + > /* return id (s/n) string for *disk to *id_str > */ > static int virtblk_get_id(struct gendisk *disk, char *id_str) > @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk) > int err = 0; > > /* We expect one virtqueue, for output. */ > - vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests"); > + vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests"); > if (IS_ERR(vblk->vq)) > err = PTR_ERR(vblk->vq); > > @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev) > u8 writeback = virtblk_get_cache_mode(vdev); > struct virtio_blk *vblk = vdev->priv; > > - if (writeback) > + if (writeback && !use_bio) > blk_queue_flush(vblk->disk->queue, REQ_FLUSH); > else > blk_queue_flush(vblk->disk->queue, 0); > @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) > struct virtio_blk *vblk; > struct request_queue *q; > int err, index; > + int pool_size; > + > u64 cap; > u32 v, blk_size, sg_elems, opt_io_size; > u16 min_io_size; > @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) > goto out_free_index; > } > > + init_waitqueue_head(&vblk->queue_wait); > vblk->vdev = vdev; > vblk->sg_elems = sg_elems; > sg_init_table(vblk->sg, vblk->sg_elems); > mutex_init(&vblk->config_lock); > + > INIT_WORK(&vblk->config_work, virtblk_config_changed_work); > vblk->config_enable = true; > > @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) > if (err) > goto out_free_vblk; > > - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); > + pool_size = sizeof(struct virtblk_req); > + if (use_bio) > + pool_size += sizeof(struct scatterlist) * sg_elems; > + vblk->pool = mempool_create_kmalloc_pool(1, pool_size); > if (!vblk->pool) { > err = -ENOMEM; > goto out_free_vq; > @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) > goto out_mempool; > } > > - q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL); > + q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL); > if (!q) { > err = -ENOMEM; > goto out_put_disk; > } > > + if (use_bio) > + blk_queue_make_request(q, virtblk_make_request); > q->queuedata = vblk; > > virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN); > @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) > if (!err && opt_io_size) > blk_queue_io_opt(q, blk_size * opt_io_size); > > - > add_disk(vblk->disk); > err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial); > if (err) > -- > 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 29 Jul 2012 14:11:15 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote: > > This patch introduces bio-based IO path for virtio-blk. > > > > Compared to request-based IO path, bio-based IO path uses driver > > provided ->make_request_fn() method to bypasses the IO scheduler. It > > handles the bio to device directly without allocating a request in block > > layer. This reduces the IO path in guest kernel to achieve high IOPS > > and lower latency. The downside is that guest can not use the IO > > scheduler to merge and sort requests. However, this is not a big problem > > if the backend disk in host side uses faster disk device. > > If this optimization depends on the host, then it > should be reported to the guest using a feature bit, > as opposed to being guest driven. I consider this approach a half-way step. Quick attempts on my laptop and I couldn't find a case where the bio path was a loss, but in theory if the host wasn't doing any reordering and it was a slow device, you'd want the guest to do so. I'm not sure if current qemu can be configured to do such a thing? Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/29/2012 07:11 PM, Michael S. Tsirkin wrote: > On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote: >> This patch introduces bio-based IO path for virtio-blk. >> >> Compared to request-based IO path, bio-based IO path uses driver >> provided ->make_request_fn() method to bypasses the IO scheduler. It >> handles the bio to device directly without allocating a request in block >> layer. This reduces the IO path in guest kernel to achieve high IOPS >> and lower latency. The downside is that guest can not use the IO >> scheduler to merge and sort requests. However, this is not a big problem >> if the backend disk in host side uses faster disk device. > > If this optimization depends on the host, then it > should be reported to the guest using a feature bit, > as opposed to being guest driven. > >> When the bio-based IO path is not enabled, virtio-blk still uses the >> original request-based IO path, no performance difference is observed. >> >> Performance evaluation: >> ----------------------------- >> 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using >> kvm tool. >> >> Short version: >> With bio-based IO path, sequential read/write, random read/write >> IOPS boost : 28%, 24%, 21%, 16% >> Latency improvement: 32%, 17%, 21%, 16% >> >> Long version: >> With bio-based IO path: >> seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec >> seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec >> rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec >> rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec >> clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30 >> clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35 >> clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39 >> clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45 >> cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954 >> cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137 >> cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648 >> cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375 >> With request-based IO path: >> seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec >> seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec >> rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec >> rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec >> clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49 >> clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15 >> clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27 >> clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68 >> cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622 >> cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959 >> cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082 >> cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985 > > > So bio based causes cpu to jump up by some 30%? > What happens if you divide IOPS/CPU? > Any improvement that comes from increasing the cpu share > of the given guest on the host will not scale well on > a typical overcommitted host. if you add sys and usr time up, the jump is that much. For ramdisk based device, bio-based ------------------ >>> 74.08 + 703.84 777.9200000000001 >>> 70.92 + 702.81 773.7299999999999 >>> 72.23 + 695.37 767.6 >>> 69.69 + 654.13 723.8199999999999 >>> 53.28 + 724.19 777.47 req-based ------------------ >>> 53.28 + 724.19 777.47 >>> 49.03 + 633.20 682.23 >>> 55.78 + 722.40 778.18 >>> 56.55 + 690.83 747.38 And for real device(fusion io), the cpu time is smaller with bio path. bio-based ------------------ >>> 56.79 + 421.70 478.49 >>> 61.81 + 455.53 517.3399999999999 >>> 63.10+455.38 518.48 >>> 62.04 + 453.58 515.62 req-based ------------------ >>> 44.08 + 590.71 634.7900000000001 >>> 48.73 + 610.78 659.51 >>> 45.58 + 581.16 626.74 >>> 48.40 + 599.65 648.05 > >> 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using >> kvm tool. >> >> Short version: >> With bio-based IO path, sequential read/write, random read/write >> IOPS boost : 11%, 11%, 13%, 10% >> Latency improvement: 10%, 10%, 12%, 10% >> Long Version: >> With bio-based IO path: >> read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec >> write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec >> read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec >> write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec >> clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29 >> clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80 >> clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07 >> clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25 >> cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517 >> cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604 >> cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612 >> cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105 >> With request-based IO path: >> read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec >> write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec >> read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec >> write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec >> clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84 >> clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64 >> clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55 >> clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95 >> cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641 >> cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689 >> cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223 >> cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409 > > > Is this host or guest cpu? We should probably measure host cpu > as this includes device overhead which could vary by load. It's guest cpu. Yes, host cpu is also interesting. > >> How to use: >> ----------------------------- >> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk >> use_bio=1' to enable ->make_request_fn() based I/O path. >> >> Cc: Rusty Russell <rusty@rustcorp.com.au> >> Cc: "Michael S. Tsirkin" <mst@redhat.com> >> Cc: virtualization@lists.linux-foundation.org >> Cc: kvm@vger.kernel.org >> Cc: linux-kernel@vger.kernel.org >> Acked-by: Rusty Russell <rusty@rustcorp.com.au> >> Signed-off-by: Christoph Hellwig <hch@lst.de> >> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> >> Signed-off-by: Asias He <asias@redhat.com> >> --- >> drivers/block/virtio_blk.c | 203 +++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 163 insertions(+), 40 deletions(-) >> >> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c >> index c0bbeb4..95cfeed 100644 >> --- a/drivers/block/virtio_blk.c >> +++ b/drivers/block/virtio_blk.c >> @@ -14,6 +14,9 @@ >> >> #define PART_BITS 4 >> >> +static bool use_bio; >> +module_param(use_bio, bool, S_IRUGO); >> + >> static int major; >> static DEFINE_IDA(vd_index_ida); >> >> @@ -23,6 +26,7 @@ struct virtio_blk >> { >> struct virtio_device *vdev; >> struct virtqueue *vq; >> + wait_queue_head_t queue_wait; >> >> /* The disk structure for the kernel. */ >> struct gendisk *disk; >> @@ -51,53 +55,87 @@ struct virtio_blk >> struct virtblk_req >> { >> struct request *req; >> + struct bio *bio; >> struct virtio_blk_outhdr out_hdr; >> struct virtio_scsi_inhdr in_hdr; >> u8 status; >> + struct scatterlist sg[]; >> }; >> >> -static void blk_done(struct virtqueue *vq) >> +static inline int virtblk_result(struct virtblk_req *vbr) >> +{ >> + switch (vbr->status) { >> + case VIRTIO_BLK_S_OK: >> + return 0; >> + case VIRTIO_BLK_S_UNSUPP: >> + return -ENOTTY; >> + default: >> + return -EIO; >> + } >> +} >> + >> +static inline void virtblk_request_done(struct virtio_blk *vblk, >> + struct virtblk_req *vbr) >> +{ >> + struct request *req = vbr->req; >> + int error = virtblk_result(vbr); >> + >> + if (req->cmd_type == REQ_TYPE_BLOCK_PC) { >> + req->resid_len = vbr->in_hdr.residual; >> + req->sense_len = vbr->in_hdr.sense_len; >> + req->errors = vbr->in_hdr.errors; >> + } else if (req->cmd_type == REQ_TYPE_SPECIAL) { >> + req->errors = (error != 0); >> + } >> + >> + __blk_end_request_all(req, error); >> + mempool_free(vbr, vblk->pool); >> +} >> + >> +static inline void virtblk_bio_done(struct virtio_blk *vblk, >> + struct virtblk_req *vbr) >> +{ >> + bio_endio(vbr->bio, virtblk_result(vbr)); >> + mempool_free(vbr, vblk->pool); >> +} >> + >> +static void virtblk_done(struct virtqueue *vq) >> { >> struct virtio_blk *vblk = vq->vdev->priv; >> + unsigned long bio_done = 0, req_done = 0; >> struct virtblk_req *vbr; >> - unsigned int len; >> unsigned long flags; >> + unsigned int len; >> >> spin_lock_irqsave(vblk->disk->queue->queue_lock, flags); >> while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) { >> - int error; >> - >> - switch (vbr->status) { >> - case VIRTIO_BLK_S_OK: >> - error = 0; >> - break; >> - case VIRTIO_BLK_S_UNSUPP: >> - error = -ENOTTY; >> - break; >> - default: >> - error = -EIO; >> - break; >> - } >> - >> - switch (vbr->req->cmd_type) { >> - case REQ_TYPE_BLOCK_PC: >> - vbr->req->resid_len = vbr->in_hdr.residual; >> - vbr->req->sense_len = vbr->in_hdr.sense_len; >> - vbr->req->errors = vbr->in_hdr.errors; >> - break; >> - case REQ_TYPE_SPECIAL: >> - vbr->req->errors = (error != 0); >> - break; >> - default: >> - break; >> + if (vbr->bio) { >> + virtblk_bio_done(vblk, vbr); >> + bio_done++; >> + } else { >> + virtblk_request_done(vblk, vbr); >> + req_done++; >> } >> - >> - __blk_end_request_all(vbr->req, error); >> - mempool_free(vbr, vblk->pool); >> } >> /* In case queue is stopped waiting for more buffers. */ >> - blk_start_queue(vblk->disk->queue); >> + if (req_done) >> + blk_start_queue(vblk->disk->queue); >> spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags); >> + >> + if (bio_done) >> + wake_up(&vblk->queue_wait); >> +} >> + >> +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk, >> + gfp_t gfp_mask) >> +{ >> + struct virtblk_req *vbr; >> + >> + vbr = mempool_alloc(vblk->pool, gfp_mask); >> + if (vbr && use_bio) >> + sg_init_table(vbr->sg, vblk->sg_elems); >> + >> + return vbr; >> } >> >> static bool do_req(struct request_queue *q, struct virtio_blk *vblk, >> @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, >> unsigned long num, out = 0, in = 0; >> struct virtblk_req *vbr; >> >> - vbr = mempool_alloc(vblk->pool, GFP_ATOMIC); >> + vbr = virtblk_alloc_req(vblk, GFP_ATOMIC); >> if (!vbr) >> /* When another request finishes we'll try again. */ >> return false; >> >> vbr->req = req; >> - >> + vbr->bio = NULL; >> if (req->cmd_flags & REQ_FLUSH) { >> vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH; >> vbr->out_hdr.sector = 0; >> @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, >> } >> } >> >> - if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) { >> + if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, >> + GFP_ATOMIC) < 0) { >> mempool_free(vbr, vblk->pool); >> return false; >> } >> @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, >> return true; >> } >> >> -static void do_virtblk_request(struct request_queue *q) >> +static void virtblk_request(struct request_queue *q) >> { >> struct virtio_blk *vblk = q->queuedata; >> struct request *req; >> @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q) >> virtqueue_kick(vblk->vq); >> } >> >> +static void virtblk_add_buf_wait(struct virtio_blk *vblk, >> + struct virtblk_req *vbr, >> + unsigned long out, >> + unsigned long in) >> +{ >> + DEFINE_WAIT(wait); >> + >> + for (;;) { >> + prepare_to_wait_exclusive(&vblk->queue_wait, &wait, >> + TASK_UNINTERRUPTIBLE); >> + >> + spin_lock_irq(vblk->disk->queue->queue_lock); >> + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, >> + GFP_ATOMIC) < 0) { >> + spin_unlock_irq(vblk->disk->queue->queue_lock); >> + io_schedule(); >> + } else { >> + virtqueue_kick(vblk->vq); >> + spin_unlock_irq(vblk->disk->queue->queue_lock); >> + break; >> + } >> + >> + } >> + >> + finish_wait(&vblk->queue_wait, &wait); >> +} >> + >> +static void virtblk_make_request(struct request_queue *q, struct bio *bio) >> +{ >> + struct virtio_blk *vblk = q->queuedata; >> + unsigned int num, out = 0, in = 0; >> + struct virtblk_req *vbr; >> + >> + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems); >> + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA)); >> + >> + vbr = virtblk_alloc_req(vblk, GFP_NOIO); >> + if (!vbr) { >> + bio_endio(bio, -ENOMEM); >> + return; >> + } >> + >> + vbr->bio = bio; >> + vbr->req = NULL; >> + vbr->out_hdr.type = 0; >> + vbr->out_hdr.sector = bio->bi_sector; >> + vbr->out_hdr.ioprio = bio_prio(bio); >> + >> + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr)); >> + >> + num = blk_bio_map_sg(q, bio, vbr->sg + out); >> + >> + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status, >> + sizeof(vbr->status)); >> + >> + if (num) { >> + if (bio->bi_rw & REQ_WRITE) { >> + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT; >> + out += num; >> + } else { >> + vbr->out_hdr.type |= VIRTIO_BLK_T_IN; >> + in += num; >> + } >> + } >> + >> + spin_lock_irq(vblk->disk->queue->queue_lock); >> + if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, >> + GFP_ATOMIC) < 0)) { >> + spin_unlock_irq(vblk->disk->queue->queue_lock); >> + virtblk_add_buf_wait(vblk, vbr, out, in); >> + return; >> + } >> + virtqueue_kick(vblk->vq); >> + spin_unlock_irq(vblk->disk->queue->queue_lock); >> +} >> + >> /* return id (s/n) string for *disk to *id_str >> */ >> static int virtblk_get_id(struct gendisk *disk, char *id_str) >> @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk) >> int err = 0; >> >> /* We expect one virtqueue, for output. */ >> - vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests"); >> + vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests"); >> if (IS_ERR(vblk->vq)) >> err = PTR_ERR(vblk->vq); >> >> @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev) >> u8 writeback = virtblk_get_cache_mode(vdev); >> struct virtio_blk *vblk = vdev->priv; >> >> - if (writeback) >> + if (writeback && !use_bio) >> blk_queue_flush(vblk->disk->queue, REQ_FLUSH); >> else >> blk_queue_flush(vblk->disk->queue, 0); >> @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) >> struct virtio_blk *vblk; >> struct request_queue *q; >> int err, index; >> + int pool_size; >> + >> u64 cap; >> u32 v, blk_size, sg_elems, opt_io_size; >> u16 min_io_size; >> @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) >> goto out_free_index; >> } >> >> + init_waitqueue_head(&vblk->queue_wait); >> vblk->vdev = vdev; >> vblk->sg_elems = sg_elems; >> sg_init_table(vblk->sg, vblk->sg_elems); >> mutex_init(&vblk->config_lock); >> + >> INIT_WORK(&vblk->config_work, virtblk_config_changed_work); >> vblk->config_enable = true; >> >> @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) >> if (err) >> goto out_free_vblk; >> >> - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); >> + pool_size = sizeof(struct virtblk_req); >> + if (use_bio) >> + pool_size += sizeof(struct scatterlist) * sg_elems; >> + vblk->pool = mempool_create_kmalloc_pool(1, pool_size); >> if (!vblk->pool) { >> err = -ENOMEM; >> goto out_free_vq; >> @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) >> goto out_mempool; >> } >> >> - q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL); >> + q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL); >> if (!q) { >> err = -ENOMEM; >> goto out_put_disk; >> } >> >> + if (use_bio) >> + blk_queue_make_request(q, virtblk_make_request); >> q->queuedata = vblk; >> >> virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN); >> @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) >> if (!err && opt_io_size) >> blk_queue_io_opt(q, blk_size * opt_io_size); >> >> - >> add_disk(vblk->disk); >> err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial); >> if (err) >> -- >> 1.7.10.4
Hello Sasha, On 07/28/2012 02:35 PM, Sasha Levin wrote: > On 07/28/2012 04:21 AM, Asias He wrote: >> This patch introduces bio-based IO path for virtio-blk. >> >> Compared to request-based IO path, bio-based IO path uses driver >> provided ->make_request_fn() method to bypasses the IO scheduler. It >> handles the bio to device directly without allocating a request in block >> layer. This reduces the IO path in guest kernel to achieve high IOPS >> and lower latency. The downside is that guest can not use the IO >> scheduler to merge and sort requests. However, this is not a big problem >> if the backend disk in host side uses faster disk device. >> >> When the bio-based IO path is not enabled, virtio-blk still uses the >> original request-based IO path, no performance difference is observed. >> >> Performance evaluation: >> ----------------------------- >> 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using >> kvm tool. >> >> Short version: >> With bio-based IO path, sequential read/write, random read/write >> IOPS boost : 28%, 24%, 21%, 16% >> Latency improvement: 32%, 17%, 21%, 16% >> >> Long version: >> With bio-based IO path: >> seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec >> seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec >> rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec >> rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec >> clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30 >> clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35 >> clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39 >> clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45 >> cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954 >> cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137 >> cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648 >> cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375 >> With request-based IO path: >> seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec >> seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec >> rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec >> rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec >> clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49 >> clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15 >> clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27 >> clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68 >> cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622 >> cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959 >> cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082 >> cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985 >> >> 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using >> kvm tool. >> >> Short version: >> With bio-based IO path, sequential read/write, random read/write >> IOPS boost : 11%, 11%, 13%, 10% >> Latency improvement: 10%, 10%, 12%, 10% >> Long Version: >> With bio-based IO path: >> read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec >> write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec >> read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec >> write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec >> clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29 >> clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80 >> clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07 >> clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25 >> cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517 >> cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604 >> cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612 >> cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105 >> With request-based IO path: >> read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec >> write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec >> read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec >> write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec >> clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84 >> clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64 >> clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55 >> clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95 >> cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641 >> cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689 >> cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223 >> cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409 > > What are the cases where we'll see a performance degradation with using the bio path? Could we measure performance for those as well? > >> How to use: >> ----------------------------- >> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk >> use_bio=1' to enable ->make_request_fn() based I/O path. > > If there are, in fact, no cases where performance is degraded, can use_bio=1 be the default? Here are some results on a SATA based image file. In this case, the bio path is slower than req path doing sequential read/write. qemu use_bio=0 (req-based) ------------------------------------- read : io=120964KB, bw=26098KB/s, iops=6470 , runt= 4635msec write: io=190236KB, bw=35505KB/s, iops=8829 , runt= 5358msec read : io=257400KB, bw=1963.7KB/s, iops=488 , runt=131081msec write: io=258944KB, bw=1302.7KB/s, iops=324 , runt=198872msec clat (msec): min=1 , max=1527 , avg=30.73, stdev=144.73 clat (usec): min=811 , max=247072 , avg=28451.71, stdev=16107.22 clat (msec): min=6 , max=2519 , avg=513.91, stdev=231.07 clat (msec): min=33 , max=2621 , avg=772.33, stdev=348.39 cpu : usr=4.05%, sys=14.56%, ctx=38199, majf=0, minf=4 cpu : usr=4.02%, sys=15.48%, ctx=53724, majf=0, minf=0 cpu : usr=0.15%, sys=0.30%, ctx=20535, majf=0, minf=16 cpu : usr=0.32%, sys=0.96%, ctx=101465, majf=0, minf=0 qemu use_bio=1 (bio-based) ------------------------------------- read : io=202736KB, bw=25569KB/s, iops=6360 , runt= 7929msec write: io=217844KB, bw=20335KB/s, iops=5060 , runt= 10713msec read : io=256980KB, bw=1958.2KB/s, iops=487 , runt=131235msec write: io=258288KB, bw=1423.9KB/s, iops=354 , runt=181405msec clat (usec): min=922 , max=1578.2K, avg=38702.18, stdev=99248.33 clat (usec): min=460 , max=241314 , avg=49326.52, stdev=18705.68 clat (msec): min=19 , max=2370 , avg=515.30, stdev=200.84 clat (msec): min=11 , max=3751 , avg=702.60, stdev=286.93 cpu : usr=2.54%, sys=8.75%, ctx=68522, majf=0, minf=6 cpu : usr=1.96%, sys=7.70%, ctx=70003, majf=0, minf=0 cpu : usr=0.39%, sys=1.46%, ctx=259459, majf=0, minf=16 cpu : usr=0.28%, sys=1.21%, ctx=265148, majf=0, minf=0 lkvm use_bio=0 (req-based) ------------------------------------- read : io=150120KB, bw=40420KB/s, iops=10037 , runt= 3714msec write: io=194932KB, bw=27029KB/s, iops=6722 , runt= 7212msec read : io=257136KB, bw=2001.1KB/s, iops=498 , runt=128443msec write: io=258276KB, bw=1537.2KB/s, iops=382 , runt=168028msec clat (msec): min=1 , max=1542 , avg=24.84, stdev=32.45 clat (msec): min=3 , max=628 , avg=35.62, stdev=39.71 clat (msec): min=8 , max=2540 , avg=503.28, stdev=236.97 clat (msec): min=41 , max=4398 , avg=653.88, stdev=302.61 cpu : usr=3.91%, sys=15.75%, ctx=26968, majf=0, minf=23 cpu : usr=2.50%, sys=10.56%, ctx=19090, majf=0, minf=0 cpu : usr=0.16%, sys=0.43%, ctx=20159, majf=0, minf=16 cpu : usr=0.18%, sys=0.53%, ctx=81364, majf=0, minf=0 lkvm use_bio=1 (bio-based) ------------------------------------- read : io=124812KB, bw=36537KB/s, iops=9060 , runt= 3416msec write: io=169180KB, bw=24406KB/s, iops=6065 , runt= 6932msec read : io=256200KB, bw=2089.3KB/s, iops=520 , runt=122630msec write: io=257988KB, bw=1545.7KB/s, iops=384 , runt=166910msec clat (msec): min=1 , max=1527 , avg=28.06, stdev=89.54 clat (msec): min=2 , max=344 , avg=41.12, stdev=38.70 clat (msec): min=8 , max=1984 , avg=490.63, stdev=207.28 clat (msec): min=33 , max=4131 , avg=659.19, stdev=304.71 cpu : usr=4.85%, sys=17.15%, ctx=31593, majf=0, minf=7 cpu : usr=3.04%, sys=11.45%, ctx=39377, majf=0, minf=0 cpu : usr=0.47%, sys=1.59%, ctx=262986, majf=0, minf=16 cpu : usr=0.47%, sys=1.46%, ctx=337410, majf=0, minf=0
On Mon, Jul 30, 2012 at 11:25:51AM +0930, Rusty Russell wrote: > I consider this approach a half-way step. Quick attempts on my laptop > and I couldn't find a case where the bio path was a loss, but in theory > if the host wasn't doing any reordering and it was a slow device, you'd > want the guest to do so. > > I'm not sure if current qemu can be configured to do such a thing? The host kernel will do the I/O scheduling for you unless you explicitly disable it. And we should be able to assume an administrator will only disable it when they have a reason for it - if not they'll get worse performance for non-virtualized workloads as well. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index c0bbeb4..95cfeed 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -14,6 +14,9 @@ #define PART_BITS 4 +static bool use_bio; +module_param(use_bio, bool, S_IRUGO); + static int major; static DEFINE_IDA(vd_index_ida); @@ -23,6 +26,7 @@ struct virtio_blk { struct virtio_device *vdev; struct virtqueue *vq; + wait_queue_head_t queue_wait; /* The disk structure for the kernel. */ struct gendisk *disk; @@ -51,53 +55,87 @@ struct virtio_blk struct virtblk_req { struct request *req; + struct bio *bio; struct virtio_blk_outhdr out_hdr; struct virtio_scsi_inhdr in_hdr; u8 status; + struct scatterlist sg[]; }; -static void blk_done(struct virtqueue *vq) +static inline int virtblk_result(struct virtblk_req *vbr) +{ + switch (vbr->status) { + case VIRTIO_BLK_S_OK: + return 0; + case VIRTIO_BLK_S_UNSUPP: + return -ENOTTY; + default: + return -EIO; + } +} + +static inline void virtblk_request_done(struct virtio_blk *vblk, + struct virtblk_req *vbr) +{ + struct request *req = vbr->req; + int error = virtblk_result(vbr); + + if (req->cmd_type == REQ_TYPE_BLOCK_PC) { + req->resid_len = vbr->in_hdr.residual; + req->sense_len = vbr->in_hdr.sense_len; + req->errors = vbr->in_hdr.errors; + } else if (req->cmd_type == REQ_TYPE_SPECIAL) { + req->errors = (error != 0); + } + + __blk_end_request_all(req, error); + mempool_free(vbr, vblk->pool); +} + +static inline void virtblk_bio_done(struct virtio_blk *vblk, + struct virtblk_req *vbr) +{ + bio_endio(vbr->bio, virtblk_result(vbr)); + mempool_free(vbr, vblk->pool); +} + +static void virtblk_done(struct virtqueue *vq) { struct virtio_blk *vblk = vq->vdev->priv; + unsigned long bio_done = 0, req_done = 0; struct virtblk_req *vbr; - unsigned int len; unsigned long flags; + unsigned int len; spin_lock_irqsave(vblk->disk->queue->queue_lock, flags); while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) { - int error; - - switch (vbr->status) { - case VIRTIO_BLK_S_OK: - error = 0; - break; - case VIRTIO_BLK_S_UNSUPP: - error = -ENOTTY; - break; - default: - error = -EIO; - break; - } - - switch (vbr->req->cmd_type) { - case REQ_TYPE_BLOCK_PC: - vbr->req->resid_len = vbr->in_hdr.residual; - vbr->req->sense_len = vbr->in_hdr.sense_len; - vbr->req->errors = vbr->in_hdr.errors; - break; - case REQ_TYPE_SPECIAL: - vbr->req->errors = (error != 0); - break; - default: - break; + if (vbr->bio) { + virtblk_bio_done(vblk, vbr); + bio_done++; + } else { + virtblk_request_done(vblk, vbr); + req_done++; } - - __blk_end_request_all(vbr->req, error); - mempool_free(vbr, vblk->pool); } /* In case queue is stopped waiting for more buffers. */ - blk_start_queue(vblk->disk->queue); + if (req_done) + blk_start_queue(vblk->disk->queue); spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags); + + if (bio_done) + wake_up(&vblk->queue_wait); +} + +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk, + gfp_t gfp_mask) +{ + struct virtblk_req *vbr; + + vbr = mempool_alloc(vblk->pool, gfp_mask); + if (vbr && use_bio) + sg_init_table(vbr->sg, vblk->sg_elems); + + return vbr; } static bool do_req(struct request_queue *q, struct virtio_blk *vblk, @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, unsigned long num, out = 0, in = 0; struct virtblk_req *vbr; - vbr = mempool_alloc(vblk->pool, GFP_ATOMIC); + vbr = virtblk_alloc_req(vblk, GFP_ATOMIC); if (!vbr) /* When another request finishes we'll try again. */ return false; vbr->req = req; - + vbr->bio = NULL; if (req->cmd_flags & REQ_FLUSH) { vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH; vbr->out_hdr.sector = 0; @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, } } - if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) { + if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, + GFP_ATOMIC) < 0) { mempool_free(vbr, vblk->pool); return false; } @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, return true; } -static void do_virtblk_request(struct request_queue *q) +static void virtblk_request(struct request_queue *q) { struct virtio_blk *vblk = q->queuedata; struct request *req; @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q) virtqueue_kick(vblk->vq); } +static void virtblk_add_buf_wait(struct virtio_blk *vblk, + struct virtblk_req *vbr, + unsigned long out, + unsigned long in) +{ + DEFINE_WAIT(wait); + + for (;;) { + prepare_to_wait_exclusive(&vblk->queue_wait, &wait, + TASK_UNINTERRUPTIBLE); + + spin_lock_irq(vblk->disk->queue->queue_lock); + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, + GFP_ATOMIC) < 0) { + spin_unlock_irq(vblk->disk->queue->queue_lock); + io_schedule(); + } else { + virtqueue_kick(vblk->vq); + spin_unlock_irq(vblk->disk->queue->queue_lock); + break; + } + + } + + finish_wait(&vblk->queue_wait, &wait); +} + +static void virtblk_make_request(struct request_queue *q, struct bio *bio) +{ + struct virtio_blk *vblk = q->queuedata; + unsigned int num, out = 0, in = 0; + struct virtblk_req *vbr; + + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems); + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA)); + + vbr = virtblk_alloc_req(vblk, GFP_NOIO); + if (!vbr) { + bio_endio(bio, -ENOMEM); + return; + } + + vbr->bio = bio; + vbr->req = NULL; + vbr->out_hdr.type = 0; + vbr->out_hdr.sector = bio->bi_sector; + vbr->out_hdr.ioprio = bio_prio(bio); + + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr)); + + num = blk_bio_map_sg(q, bio, vbr->sg + out); + + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status, + sizeof(vbr->status)); + + if (num) { + if (bio->bi_rw & REQ_WRITE) { + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT; + out += num; + } else { + vbr->out_hdr.type |= VIRTIO_BLK_T_IN; + in += num; + } + } + + spin_lock_irq(vblk->disk->queue->queue_lock); + if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr, + GFP_ATOMIC) < 0)) { + spin_unlock_irq(vblk->disk->queue->queue_lock); + virtblk_add_buf_wait(vblk, vbr, out, in); + return; + } + virtqueue_kick(vblk->vq); + spin_unlock_irq(vblk->disk->queue->queue_lock); +} + /* return id (s/n) string for *disk to *id_str */ static int virtblk_get_id(struct gendisk *disk, char *id_str) @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk) int err = 0; /* We expect one virtqueue, for output. */ - vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests"); + vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests"); if (IS_ERR(vblk->vq)) err = PTR_ERR(vblk->vq); @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev) u8 writeback = virtblk_get_cache_mode(vdev); struct virtio_blk *vblk = vdev->priv; - if (writeback) + if (writeback && !use_bio) blk_queue_flush(vblk->disk->queue, REQ_FLUSH); else blk_queue_flush(vblk->disk->queue, 0); @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) struct virtio_blk *vblk; struct request_queue *q; int err, index; + int pool_size; + u64 cap; u32 v, blk_size, sg_elems, opt_io_size; u16 min_io_size; @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) goto out_free_index; } + init_waitqueue_head(&vblk->queue_wait); vblk->vdev = vdev; vblk->sg_elems = sg_elems; sg_init_table(vblk->sg, vblk->sg_elems); mutex_init(&vblk->config_lock); + INIT_WORK(&vblk->config_work, virtblk_config_changed_work); vblk->config_enable = true; @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (err) goto out_free_vblk; - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); + pool_size = sizeof(struct virtblk_req); + if (use_bio) + pool_size += sizeof(struct scatterlist) * sg_elems; + vblk->pool = mempool_create_kmalloc_pool(1, pool_size); if (!vblk->pool) { err = -ENOMEM; goto out_free_vq; @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) goto out_mempool; } - q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL); + q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL); if (!q) { err = -ENOMEM; goto out_put_disk; } + if (use_bio) + blk_queue_make_request(q, virtblk_make_request); q->queuedata = vblk; virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN); @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (!err && opt_io_size) blk_queue_io_opt(q, blk_size * opt_io_size); - add_disk(vblk->disk); err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial); if (err)