diff mbox

virtio-blk: set QUEUE_ORDERED_DRAIN by default

Message ID 20090820205616.GA5503@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Aug. 20, 2009, 8:56 p.m. UTC
Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
means it does not allow filesystems to use barriers.  But the typical use
case for virtio-blk is to use a backed that uses synchronous I/O, and in
that case we can simply set QUEUE_ORDERED_DRAIN to make the block layer
drain the request queue around barrier I/O and provide the semantics that
the filesystems need.  This is what the SCSI disk driver does for disks
that have the write cache disabled.

With this patch we incorrectly advertise barrier support if someone
configure qemu with write back caching.  While this displays wrong
information in the guest there is nothing that guest could have done
even if we rightfully told it that we do not support any barriers.


Signed-off-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christian Borntraeger Aug. 21, 2009, 7:30 a.m. UTC | #1
Am Donnerstag 20 August 2009 22:56:16 schrieb Christoph Hellwig:
> Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
> means it does not allow filesystems to use barriers.  But the typical use
> case for virtio-blk is to use a backed that uses synchronous I/O, and in
> that case we can simply set QUEUE_ORDERED_DRAIN to make the block layer
> drain the request queue around barrier I/O and provide the semantics that
> the filesystems need.  This is what the SCSI disk driver does for disks
> that have the write cache disabled.
>
> With this patch we incorrectly advertise barrier support if someone
> configure qemu with write back caching.  While this displays wrong
> information in the guest there is nothing that guest could have done
> even if we rightfully told it that we do not support any barriers.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Make sense to me.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>

[...]
> -	/* If barriers are supported, tell block layer that queue is ordered */
> +	/*
> +	 * If barriers are supported, tell block layer that queue is ordered.
> +	 *
> +	 * If no barriers are supported assume the host uses synchronous
> +	 * writes and just drain the the queue before and after the barrier.
> +	 */
>  	if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
>  		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
> +	else
> +		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_DRAIN, NULL);
[...]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell Aug. 25, 2009, 2:11 p.m. UTC | #2
On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
> Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
> means it does not allow filesystems to use barriers.  But the typical use
> case for virtio-blk is to use a backed that uses synchronous I/O

Really?  Does qemu open with O_SYNC?

I'm definitely no block expert, but this seems strange...
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 25, 2009, 2:16 p.m. UTC | #3
On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote:
> On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
> > Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
> > means it does not allow filesystems to use barriers.  But the typical use
> > case for virtio-blk is to use a backed that uses synchronous I/O
> 
> Really?  Does qemu open with O_SYNC?
> 
> I'm definitely no block expert, but this seems strange...
> Rusty.

Qemu can open it various ways, but the only one that is fully safe
is O_SYNC (cache=writethrough).  The O_DIRECT (cache=none) option is also
fully safe with the above patch under some limited circumstances 
(disk write caches off and using a host device or fully allocated file).

Fixing the cache=writeback option and the majority case for cache=none
requires implementing a cache flush command, and for the latter one
also fixes to the host kernel I'm working on.  You will get another
patch to implement the proper cache controls in virtio-blk for me in
a couple of days, too.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell Aug. 26, 2009, 12:06 p.m. UTC | #4
On Tue, 25 Aug 2009 11:46:08 pm Christoph Hellwig wrote:
> On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote:
> > On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
> > > Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
> > > means it does not allow filesystems to use barriers.  But the typical use
> > > case for virtio-blk is to use a backed that uses synchronous I/O
> > 
> > Really?  Does qemu open with O_SYNC?
> > 
> > I'm definitely no block expert, but this seems strange...
> > Rusty.
> 
> Qemu can open it various ways, but the only one that is fully safe
> is O_SYNC (cache=writethrough).

(Rusty goes away and reads the qemu man page).

	By default, if no explicit caching is specified for a qcow2 disk image,
	cache=writeback will be used.

Are you claiming qcow2 is unusual?  I can believe snapshot is less common,
though I use it all the time.

You'd normally have to add a feature for something like this.  I don't
think this is different.

Sorry,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 26, 2009, 12:28 p.m. UTC | #5
On 08/26/2009 03:06 PM, Rusty Russell wrote:
> On Tue, 25 Aug 2009 11:46:08 pm Christoph Hellwig wrote:
>    
>> On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote:
>>      
>>> On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
>>>        
>>>> Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
>>>> means it does not allow filesystems to use barriers.  But the typical use
>>>> case for virtio-blk is to use a backed that uses synchronous I/O
>>>>          
>>> Really?  Does qemu open with O_SYNC?
>>>
>>> I'm definitely no block expert, but this seems strange...
>>> Rusty.
>>>        
>> Qemu can open it various ways, but the only one that is fully safe
>> is O_SYNC (cache=writethrough).
>>      
> (Rusty goes away and reads the qemu man page).
>
> 	By default, if no explicit caching is specified for a qcow2 disk image,
> 	cache=writeback will be used.
>    

It's now switched to writethrough.  In any case, cache=writeback means 
"lie to the guest, we don't care about integrity".

> Are you claiming qcow2 is unusual?  I can believe snapshot is less common,
> though I use it all the time.
>
> You'd normally have to add a feature for something like this.  I don't
> think this is different.
>    

Why do we need to add a feature for this?
Rusty Russell Aug. 27, 2009, 10:43 a.m. UTC | #6
On Wed, 26 Aug 2009 09:58:13 pm Avi Kivity wrote:
> On 08/26/2009 03:06 PM, Rusty Russell wrote:
> > On Tue, 25 Aug 2009 11:46:08 pm Christoph Hellwig wrote:
> >    
> >> On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote:
> >>      
> >>> On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
> >>>        
> >>>> Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
> >>>> means it does not allow filesystems to use barriers.  But the typical use
> >>>> case for virtio-blk is to use a backed that uses synchronous I/O
> >>>>          
> >>> Really?  Does qemu open with O_SYNC?
> >>>
> >>> I'm definitely no block expert, but this seems strange...
> >>> Rusty.
> >>>        
> >> Qemu can open it various ways, but the only one that is fully safe
> >> is O_SYNC (cache=writethrough).
> >>      
> > (Rusty goes away and reads the qemu man page).
> >
> > 	By default, if no explicit caching is specified for a qcow2 disk image,
> > 	cache=writeback will be used.
> >    
> 
> It's now switched to writethrough.  In any case, cache=writeback means 
> "lie to the guest, we don't care about integrity".

Well, that was the intent of the virtio barrier feature; *don't* lie to the
guest, make it aware of the limitations.

Of course, having read Christoph's excellent summary of the situation, it's
clear I failed.

> > Are you claiming qcow2 is unusual?  I can believe snapshot is less common,
> > though I use it all the time.
> >
> > You'd normally have to add a feature for something like this.  I don't
> > think this is different.
> 
> Why do we need to add a feature for this?

Because cache=writeback should *not* lie to the guest?

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 27, 2009, 11:04 a.m. UTC | #7
On 08/27/2009 01:43 PM, Rusty Russell wrote:
>
>>> Are you claiming qcow2 is unusual?  I can believe snapshot is less common,
>>> though I use it all the time.
>>>
>>> You'd normally have to add a feature for something like this.  I don't
>>> think this is different.
>>>        
>> Why do we need to add a feature for this?
>>      
> Because cache=writeback should *not* lie to the guest?
>
>    

No, it should.

There are two possible semantics to cache=writeback:

- simulate a drive with a huge write cache; use fsync() to implement 
barriers
- tell the host that we aren't interested in data integrity, lie to the 
guest to get best performance

The first semantic is not very useful; guests don't expect huge write 
caches so you can't be sure of your integrity guarantees, and it's 
slower than cache=none due to double caching and extra copies.  The 
second semantic is not useful for production, but is very useful for 
testing out things where you aren't worries about host crashes and 
you're usually rebooting the guest very often (you can't rely on guest 
caches, so you want the host to cache).
Rusty Russell Aug. 28, 2009, 1:15 a.m. UTC | #8
On Thu, 27 Aug 2009 08:34:19 pm Avi Kivity wrote:
> There are two possible semantics to cache=writeback:
> 
> - simulate a drive with a huge write cache; use fsync() to implement 
> barriers
> - tell the host that we aren't interested in data integrity, lie to the 
> guest to get best performance

Why lie to the guest?  Just say we're not ordered, and don't support barriers.
Gets even *better* performance since it won't drain the queues.

Maybe you're thinking of full virtualization where we guest ignorance is
bliss.  But lying always gets us in trouble later on when other cases come
up.

> The second semantic is not useful for production, but is very useful for 
> testing out things where you aren't worries about host crashes and 
> you're usually rebooting the guest very often (you can't rely on guest 
> caches, so you want the host to cache).

This is not the ideal world; people will do things for performance "in
production".

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 28, 2009, 6:33 a.m. UTC | #9
On 08/28/2009 04:15 AM, Rusty Russell wrote:
> On Thu, 27 Aug 2009 08:34:19 pm Avi Kivity wrote:
>    
>> There are two possible semantics to cache=writeback:
>>
>> - simulate a drive with a huge write cache; use fsync() to implement
>> barriers
>> - tell the host that we aren't interested in data integrity, lie to the
>> guest to get best performance
>>      
> Why lie to the guest?  Just say we're not ordered, and don't support barriers.
> Gets even *better* performance since it won't drain the queues.
>    

In that case, honesty is preferable.  It means testing with 
cache=writeback exercises different guest code paths, but that's acceptable.

> Maybe you're thinking of full virtualization where we guest ignorance is
> bliss.  But lying always gets us in trouble later on when other cases come
> up.
>
>    
>> The second semantic is not useful for production, but is very useful for
>> testing out things where you aren't worries about host crashes and
>> you're usually rebooting the guest very often (you can't rely on guest
>> caches, so you want the host to cache).
>>      
> This is not the ideal world; people will do things for performance "in
> production".
>
>    

We found that cache=none is faster than cache=writeback when you're 
really interested in performance (no qcow2).
Christoph Hellwig Sept. 17, 2009, 5:31 p.m. UTC | #10
Err, I'll take this one back for now pending some more discussion.
What we need more urgently is the writeback cache flag, which is now
implemented in qemu, patch following ASAP.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell Sept. 22, 2009, 6:27 a.m. UTC | #11
On Fri, 18 Sep 2009 03:01:42 am Christoph Hellwig wrote:
> Err, I'll take this one back for now pending some more discussion.
> What we need more urgently is the writeback cache flag, which is now
> implemented in qemu, patch following ASAP.

OK, still catching up on mail.  I'll push them out of the queue for now.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2009-08-20 17:41:37.019718433 -0300
+++ linux-2.6/drivers/block/virtio_blk.c	2009-08-20 17:45:40.511747922 -0300
@@ -336,9 +336,16 @@  static int __devinit virtblk_probe(struc
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;
 
-	/* If barriers are supported, tell block layer that queue is ordered */
+	/*
+	 * If barriers are supported, tell block layer that queue is ordered.
+	 *
+	 * If no barriers are supported assume the host uses synchronous
+	 * writes and just drain the the queue before and after the barrier.
+	 */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
 		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+	else
+		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_DRAIN, NULL);
 
 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))