Barriers still not passing on simple dm devices...

Message ID	Pine.LNX.4.64.0903241000010.29968@hs20-bc2-1.build.redhat.com (mailing list archive)
State	RFC, archived
Headers	show Received: from hormel.redhat.com (hormel1.redhat.com [209.132.177.33]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n2OE2jDI022202 for <patchwork-dm-devel@patchwork.kernel.org>; Tue, 24 Mar 2009 14:02:45 GMT Date: Tue, 24 Mar 2009 10:02:43 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: device-mapper development <dm-devel@redhat.com> Subject: Re: [dm-devel] Barriers still not passing on simple dm devices... In-Reply-To: <49C7DD3C.2020401@redhat.com> Message-ID: <Pine.LNX.4.64.0903241000010.29968@hs20-bc2-1.build.redhat.com> References: <49C7DD3C.2020401@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Jens Axboe <jens.axboe@oracle.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, "MASON, CHRISTOPHER" <CHRIS.MASON@oracle.com>, Andi Kleen <ak@suse.de> Precedence: junk Reply-To: device-mapper development <dm-devel@redhat.com> Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Mikulas Patocka March 24, 2009, 2:02 p.m. UTC

On Mon, 23 Mar 2009, Eric Sandeen wrote:

> I've noticed that on 2.6.29-rcX, with Andi's patch
> (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> simple devices) barriers are still getting rejected on these simple devices.
> 
> The problem is in __generic_make_request():
> 
>                 if (bio_barrier(bio) && bio_has_data(bio) &&
>                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
>                         err = -EOPNOTSUPP;
>                         goto end_io;
>                 }
> 
> and dm isn't flagging its queue as supporting ordered writes, so it's
> rejected here.
> 
> Doing something like this:
> 
> + if (t->barriers_supported)
> +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> 
> somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> certainly the wrong thing to do) did get my dm-linear device to mount
> with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> 
> So what's the right way around this?  What should dm (or md for that
> matter) advertise on their queues about ordered-ness?  Should there be
> some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> doesn't care, ask the next level" or somesuch?  Or should it inherit the
> flag from the next level down?  Ideas?
> 
> Thanks,
> -Eric
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

Hi

This is misdesign in generic bio layer and it should be fixed there. I 
think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
attached patch.

Mikulas

----

Move test for not-supported barriers to __make_request.

This test prevents barriers from being dispatched to device mapper
and md.

This test is sensible only for drivers that use requests (such as disk
drivers), not for drivers that use bios.

It is better to fix it in generic code than to make workaround for it
in device mapper and md.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-core.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 24, 2009, 2:05 p.m. UTC | #1

On Tue, Mar 24 2009, Mikulas Patocka wrote:
> 
> 
> On Mon, 23 Mar 2009, Eric Sandeen wrote:
> 
> > I've noticed that on 2.6.29-rcX, with Andi's patch
> > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> > simple devices) barriers are still getting rejected on these simple devices.
> > 
> > The problem is in __generic_make_request():
> > 
> >                 if (bio_barrier(bio) && bio_has_data(bio) &&
> >                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
> >                         err = -EOPNOTSUPP;
> >                         goto end_io;
> >                 }
> > 
> > and dm isn't flagging its queue as supporting ordered writes, so it's
> > rejected here.
> > 
> > Doing something like this:
> > 
> > + if (t->barriers_supported)
> > +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> > 
> > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> > certainly the wrong thing to do) did get my dm-linear device to mount
> > with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> > 
> > So what's the right way around this?  What should dm (or md for that
> > matter) advertise on their queues about ordered-ness?  Should there be
> > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> > doesn't care, ask the next level" or somesuch?  Or should it inherit the
> > flag from the next level down?  Ideas?
> > 
> > Thanks,
> > -Eric
> > 
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> 
> Hi
> 
> This is misdesign in generic bio layer and it should be fixed there. I 
> think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
> attached patch.
> 
> Mikulas
> 
> ----
> 
> Move test for not-supported barriers to __make_request.
> 
> This test prevents barriers from being dispatched to device mapper
> and md.
> 
> This test is sensible only for drivers that use requests (such as disk
> drivers), not for drivers that use bios.
> 
> It is better to fix it in generic code than to make workaround for it
> in device mapper and md.

So you audited any ->make_request_fn style driver and made sure they
rejected barriers?

> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  block/blk-core.c |   11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.29-rc6-devel/block/blk-core.c
> ===================================================================
> --- linux-2.6.29-rc6-devel.orig/block/blk-core.c	2009-02-23 18:43:37.000000000 +0100
> +++ linux-2.6.29-rc6-devel/block/blk-core.c	2009-02-23 18:44:27.000000000 +0100
> @@ -1145,6 +1145,12 @@ static int __make_request(struct request
>  	const int unplug = bio_unplug(bio);
>  	int rw_flags;
>  
> +	if (bio_barrier(bio) && bio_has_data(bio) &&
> +	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
> +		bio_endio(bio, -EOPNOTSUPP);
> +		return 0;
> +	}
> +
>  	nr_sectors = bio_sectors(bio);
>  
>  	/*
> @@ -1450,11 +1456,6 @@ static inline void __generic_make_reques
>  			err = -EOPNOTSUPP;
>  			goto end_io;
>  		}
> -		if (bio_barrier(bio) && bio_has_data(bio) &&
> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
> -			err = -EOPNOTSUPP;
> -			goto end_io;
> -		}
>  
>  		ret = q->make_request_fn(q, bio);
>  	} while (ret);
> >

Mikulas Patocka March 24, 2009, 2:26 p.m. UTC | #2

On Tue, 24 Mar 2009, Jens Axboe wrote:

> On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 23 Mar 2009, Eric Sandeen wrote:
> > 
> > > I've noticed that on 2.6.29-rcX, with Andi's patch
> > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> > > simple devices) barriers are still getting rejected on these simple devices.
> > > 
> > > The problem is in __generic_make_request():
> > > 
> > >                 if (bio_barrier(bio) && bio_has_data(bio) &&
> > >                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
> > >                         err = -EOPNOTSUPP;
> > >                         goto end_io;
> > >                 }
> > > 
> > > and dm isn't flagging its queue as supporting ordered writes, so it's
> > > rejected here.
> > > 
> > > Doing something like this:
> > > 
> > > + if (t->barriers_supported)
> > > +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> > > 
> > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> > > certainly the wrong thing to do) did get my dm-linear device to mount
> > > with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> > > 
> > > So what's the right way around this?  What should dm (or md for that
> > > matter) advertise on their queues about ordered-ness?  Should there be
> > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> > > doesn't care, ask the next level" or somesuch?  Or should it inherit the
> > > flag from the next level down?  Ideas?
> > > 
> > > Thanks,
> > > -Eric
> > > 
> > > --
> > > dm-devel mailing list
> > > dm-devel@redhat.com
> > > https://www.redhat.com/mailman/listinfo/dm-devel
> > 
> > Hi
> > 
> > This is misdesign in generic bio layer and it should be fixed there. I 
> > think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
> > attached patch.
> > 
> > Mikulas
> > 
> > ----
> > 
> > Move test for not-supported barriers to __make_request.
> > 
> > This test prevents barriers from being dispatched to device mapper
> > and md.
> > 
> > This test is sensible only for drivers that use requests (such as disk
> > drivers), not for drivers that use bios.
> > 
> > It is better to fix it in generic code than to make workaround for it
> > in device mapper and md.
> 
> So you audited any ->make_request_fn style driver and made sure they
> rejected barriers?

I didn't.

If you grep for it, you get:

./arch/powerpc/sysdev/axonram.c:
doesn't reject barriers, but it is not needed, it ends all bios in 
make_request routine

./drivers/block/aoe/aoeblk.c:
* doesn't reject barriers, should be modified to do so

./drivers/block/brd.c
doesn't reject barriers, doesn't need to, ends all bios in make_request

./drivers/block/loop.c:
doesn't reject barriers, it's ok because it doesn't reorder requests

./drivers/block/pktcdvd.c
* doesn't reject barriers, should be modified to do so

./drivers/block/umem.c
* doesn't reject barriers, I don't know if it reorders requests or not.

./drivers/s390/block/xpram.c
doesn't reject barriers, doesn't need, ends bios immediatelly

./drivers/md/raid0.c
rejects barriers

./drivers/md/raid1.c
supports barriers

./drivers/md/raid10.c
rejects barriers

./drivers/md/raid5.c
rejects barriers

./drivers/md/linear.c
rejects barriers

./drivers/md/dm.c
supports barriers partially


Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 24, 2009, 2:30 p.m. UTC | #3

On Tue, Mar 24 2009, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Mar 2009, Jens Axboe wrote:
> 
> > On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 23 Mar 2009, Eric Sandeen wrote:
> > > 
> > > > I've noticed that on 2.6.29-rcX, with Andi's patch
> > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> > > > simple devices) barriers are still getting rejected on these simple devices.
> > > > 
> > > > The problem is in __generic_make_request():
> > > > 
> > > >                 if (bio_barrier(bio) && bio_has_data(bio) &&
> > > >                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
> > > >                         err = -EOPNOTSUPP;
> > > >                         goto end_io;
> > > >                 }
> > > > 
> > > > and dm isn't flagging its queue as supporting ordered writes, so it's
> > > > rejected here.
> > > > 
> > > > Doing something like this:
> > > > 
> > > > + if (t->barriers_supported)
> > > > +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> > > > 
> > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> > > > certainly the wrong thing to do) did get my dm-linear device to mount
> > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> > > > 
> > > > So what's the right way around this?  What should dm (or md for that
> > > > matter) advertise on their queues about ordered-ness?  Should there be
> > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> > > > doesn't care, ask the next level" or somesuch?  Or should it inherit the
> > > > flag from the next level down?  Ideas?
> > > > 
> > > > Thanks,
> > > > -Eric
> > > > 
> > > > --
> > > > dm-devel mailing list
> > > > dm-devel@redhat.com
> > > > https://www.redhat.com/mailman/listinfo/dm-devel
> > > 
> > > Hi
> > > 
> > > This is misdesign in generic bio layer and it should be fixed there. I 
> > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
> > > attached patch.
> > > 
> > > Mikulas
> > > 
> > > ----
> > > 
> > > Move test for not-supported barriers to __make_request.
> > > 
> > > This test prevents barriers from being dispatched to device mapper
> > > and md.
> > > 
> > > This test is sensible only for drivers that use requests (such as disk
> > > drivers), not for drivers that use bios.
> > > 
> > > It is better to fix it in generic code than to make workaround for it
> > > in device mapper and md.
> > 
> > So you audited any ->make_request_fn style driver and made sure they
> > rejected barriers?
> 
> I didn't.
> 
> If you grep for it, you get:
> 
> ./arch/powerpc/sysdev/axonram.c:
> doesn't reject barriers, but it is not needed, it ends all bios in 
> make_request routine
> 
> ./drivers/block/aoe/aoeblk.c:
> * doesn't reject barriers, should be modified to do so
> 
> ./drivers/block/brd.c
> doesn't reject barriers, doesn't need to, ends all bios in make_request
> 
> ./drivers/block/loop.c:
> doesn't reject barriers, it's ok because it doesn't reorder requests
> 
> ./drivers/block/pktcdvd.c
> * doesn't reject barriers, should be modified to do so
> 
> ./drivers/block/umem.c
> * doesn't reject barriers, I don't know if it reorders requests or not.
> 
> ./drivers/s390/block/xpram.c
> doesn't reject barriers, doesn't need, ends bios immediatelly
> 
> ./drivers/md/raid0.c
> rejects barriers
> 
> ./drivers/md/raid1.c
> supports barriers
> 
> ./drivers/md/raid10.c
> rejects barriers
> 
> ./drivers/md/raid5.c
> rejects barriers
> 
> ./drivers/md/linear.c
> rejects barriers
> 
> ./drivers/md/dm.c
> supports barriers partially

Not reordering is not enough to support the barrier primitive, unless
you always go to the same device and pass the barrier flag down with it.

I think having the check in generic_make_request() is perfectly fine,
even if the value doesn't completely apply to stacked devices. Perhaps
we can add such a value, then. My main point is that barrier support
should be opt-in, not a default thing. Over time we should have support
everywhere, but it needs to be checked, audited, and trusted.

Mikulas Patocka March 24, 2009, 2:45 p.m. UTC | #4

On Tue, 24 Mar 2009, Jens Axboe wrote:

> On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Mar 2009, Jens Axboe wrote:
> > 
> > > On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Mon, 23 Mar 2009, Eric Sandeen wrote:
> > > > 
> > > > > I've noticed that on 2.6.29-rcX, with Andi's patch
> > > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> > > > > simple devices) barriers are still getting rejected on these simple devices.
> > > > > 
> > > > > The problem is in __generic_make_request():
> > > > > 
> > > > >                 if (bio_barrier(bio) && bio_has_data(bio) &&
> > > > >                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
> > > > >                         err = -EOPNOTSUPP;
> > > > >                         goto end_io;
> > > > >                 }
> > > > > 
> > > > > and dm isn't flagging its queue as supporting ordered writes, so it's
> > > > > rejected here.
> > > > > 
> > > > > Doing something like this:
> > > > > 
> > > > > + if (t->barriers_supported)
> > > > > +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> > > > > 
> > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> > > > > certainly the wrong thing to do) did get my dm-linear device to mount
> > > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> > > > > 
> > > > > So what's the right way around this?  What should dm (or md for that
> > > > > matter) advertise on their queues about ordered-ness?  Should there be
> > > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> > > > > doesn't care, ask the next level" or somesuch?  Or should it inherit the
> > > > > flag from the next level down?  Ideas?
> > > > > 
> > > > > Thanks,
> > > > > -Eric
> > > > > 
> > > > > --
> > > > > dm-devel mailing list
> > > > > dm-devel@redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/dm-devel
> > > > 
> > > > Hi
> > > > 
> > > > This is misdesign in generic bio layer and it should be fixed there. I 
> > > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
> > > > attached patch.
> > > > 
> > > > Mikulas
> > > > 
> > > > ----
> > > > 
> > > > Move test for not-supported barriers to __make_request.
> > > > 
> > > > This test prevents barriers from being dispatched to device mapper
> > > > and md.
> > > > 
> > > > This test is sensible only for drivers that use requests (such as disk
> > > > drivers), not for drivers that use bios.
> > > > 
> > > > It is better to fix it in generic code than to make workaround for it
> > > > in device mapper and md.
> > > 
> > > So you audited any ->make_request_fn style driver and made sure they
> > > rejected barriers?
> > 
> > I didn't.
> > 
> > If you grep for it, you get:
> > 
> > ./arch/powerpc/sysdev/axonram.c:
> > doesn't reject barriers, but it is not needed, it ends all bios in 
> > make_request routine
> > 
> > ./drivers/block/aoe/aoeblk.c:
> > * doesn't reject barriers, should be modified to do so
> > 
> > ./drivers/block/brd.c
> > doesn't reject barriers, doesn't need to, ends all bios in make_request
> > 
> > ./drivers/block/loop.c:
> > doesn't reject barriers, it's ok because it doesn't reorder requests
> > 
> > ./drivers/block/pktcdvd.c
> > * doesn't reject barriers, should be modified to do so
> > 
> > ./drivers/block/umem.c
> > * doesn't reject barriers, I don't know if it reorders requests or not.
> > 
> > ./drivers/s390/block/xpram.c
> > doesn't reject barriers, doesn't need, ends bios immediatelly
> > 
> > ./drivers/md/raid0.c
> > rejects barriers
> > 
> > ./drivers/md/raid1.c
> > supports barriers
> > 
> > ./drivers/md/raid10.c
> > rejects barriers
> > 
> > ./drivers/md/raid5.c
> > rejects barriers
> > 
> > ./drivers/md/linear.c
> > rejects barriers
> > 
> > ./drivers/md/dm.c
> > supports barriers partially
> 
> Not reordering is not enough to support the barrier primitive, unless
> you always go to the same device and pass the barrier flag down with it.

For single-device drivers (not md/dm), not reordering should be good 
enough to claim barrier support.

> I think having the check in generic_make_request() is perfectly fine,
> even if the value doesn't completely apply to stacked devices. Perhaps
> we can add such a value, then. My main point is that barrier support
> should be opt-in, not a default thing.

So make some flag for these bio-based devices, so that they don't have to 
use one of those request-based options (which are meaningless for 
non-request based device).

> Over time we should have support everywhere, but it needs to be checked, 
> audited, and trusted.

BTW. What is the rule for barriers if the device can't prevent the 
requests from being delayed or reordered? (for example ATA<=3 disks with 
cache that lack cache-flush command ... or flash cards that do 
write-caching anyway and it can't be turned off). Should they support 
barriers and try to make best effort? Or should they reject barriers to 
inform the caller code that they have no data consistency?

Mikulas

> -- 
> Jens Axboe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 24, 2009, 3:05 p.m. UTC | #5

On Tue, Mar 24 2009, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Mar 2009, Jens Axboe wrote:
> 
> > On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Mar 2009, Jens Axboe wrote:
> > > 
> > > > On Tue, Mar 24 2009, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Mon, 23 Mar 2009, Eric Sandeen wrote:
> > > > > 
> > > > > > I've noticed that on 2.6.29-rcX, with Andi's patch
> > > > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on
> > > > > > simple devices) barriers are still getting rejected on these simple devices.
> > > > > > 
> > > > > > The problem is in __generic_make_request():
> > > > > > 
> > > > > >                 if (bio_barrier(bio) && bio_has_data(bio) &&
> > > > > >                     (q->next_ordered == QUEUE_ORDERED_NONE)) {
> > > > > >                         err = -EOPNOTSUPP;
> > > > > >                         goto end_io;
> > > > > >                 }
> > > > > > 
> > > > > > and dm isn't flagging its queue as supporting ordered writes, so it's
> > > > > > rejected here.
> > > > > > 
> > > > > > Doing something like this:
> > > > > > 
> > > > > > + if (t->barriers_supported)
> > > > > > +         blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
> > > > > > 
> > > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost
> > > > > > certainly the wrong thing to do) did get my dm-linear device to mount
> > > > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed.
> > > > > > 
> > > > > > So what's the right way around this?  What should dm (or md for that
> > > > > > matter) advertise on their queues about ordered-ness?  Should there be
> > > > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level
> > > > > > doesn't care, ask the next level" or somesuch?  Or should it inherit the
> > > > > > flag from the next level down?  Ideas?
> > > > > > 
> > > > > > Thanks,
> > > > > > -Eric
> > > > > > 
> > > > > > --
> > > > > > dm-devel mailing list
> > > > > > dm-devel@redhat.com
> > > > > > https://www.redhat.com/mailman/listinfo/dm-devel
> > > > > 
> > > > > Hi
> > > > > 
> > > > > This is misdesign in generic bio layer and it should be fixed there. I 
> > > > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the 
> > > > > attached patch.
> > > > > 
> > > > > Mikulas
> > > > > 
> > > > > ----
> > > > > 
> > > > > Move test for not-supported barriers to __make_request.
> > > > > 
> > > > > This test prevents barriers from being dispatched to device mapper
> > > > > and md.
> > > > > 
> > > > > This test is sensible only for drivers that use requests (such as disk
> > > > > drivers), not for drivers that use bios.
> > > > > 
> > > > > It is better to fix it in generic code than to make workaround for it
> > > > > in device mapper and md.
> > > > 
> > > > So you audited any ->make_request_fn style driver and made sure they
> > > > rejected barriers?
> > > 
> > > I didn't.
> > > 
> > > If you grep for it, you get:
> > > 
> > > ./arch/powerpc/sysdev/axonram.c:
> > > doesn't reject barriers, but it is not needed, it ends all bios in 
> > > make_request routine
> > > 
> > > ./drivers/block/aoe/aoeblk.c:
> > > * doesn't reject barriers, should be modified to do so
> > > 
> > > ./drivers/block/brd.c
> > > doesn't reject barriers, doesn't need to, ends all bios in make_request
> > > 
> > > ./drivers/block/loop.c:
> > > doesn't reject barriers, it's ok because it doesn't reorder requests
> > > 
> > > ./drivers/block/pktcdvd.c
> > > * doesn't reject barriers, should be modified to do so
> > > 
> > > ./drivers/block/umem.c
> > > * doesn't reject barriers, I don't know if it reorders requests or not.
> > > 
> > > ./drivers/s390/block/xpram.c
> > > doesn't reject barriers, doesn't need, ends bios immediatelly
> > > 
> > > ./drivers/md/raid0.c
> > > rejects barriers
> > > 
> > > ./drivers/md/raid1.c
> > > supports barriers
> > > 
> > > ./drivers/md/raid10.c
> > > rejects barriers
> > > 
> > > ./drivers/md/raid5.c
> > > rejects barriers
> > > 
> > > ./drivers/md/linear.c
> > > rejects barriers
> > > 
> > > ./drivers/md/dm.c
> > > supports barriers partially
> > 
> > Not reordering is not enough to support the barrier primitive, unless
> > you always go to the same device and pass the barrier flag down with it.
> 
> For single-device drivers (not md/dm), not reordering should be good 
> enough to claim barrier support.

Not reordering is what the barrier is all about, the problem is how far
down you extend that guarantee. For the linux barrier, it's ALL the way
down to and including the hardware. So it's only good enough, if it
includes the device not reordering the write. And signalling completion
when it's safe. "Good enough" is not an option, it's all or nothing.

> > I think having the check in generic_make_request() is perfectly fine,
> > even if the value doesn't completely apply to stacked devices. Perhaps
> > we can add such a value, then. My main point is that barrier support
> > should be opt-in, not a default thing.
> 
> So make some flag for these bio-based devices, so that they don't have to 
> use one of those request-based options (which are meaningless for 
> non-request based device).

Sure, but as I said, I think it's mainly a cosmetic issue. Signalling
simple barrier support is just fine.

> > Over time we should have support everywhere, but it needs to be checked, 
> > audited, and trusted.
> 
> BTW. What is the rule for barriers if the device can't prevent the 
> requests from being delayed or reordered? (for example ATA<=3 disks with 
> cache that lack cache-flush command ... or flash cards that do 
> write-caching anyway and it can't be turned off). Should they support 
> barriers and try to make best effort? Or should they reject barriers to 
> inform the caller code that they have no data consistency?

If they can't flush cache, then they must reject barriers unless they
have write through caching.

Mikulas Patocka March 25, 2009, 3:15 p.m. UTC | #6

> > > Over time we should have support everywhere, but it needs to be checked, 
> > > audited, and trusted.
> > 
> > BTW. What is the rule for barriers if the device can't prevent the 
> > requests from being delayed or reordered? (for example ATA<=3 disks with 
> > cache that lack cache-flush command ... or flash cards that do 
> > write-caching anyway and it can't be turned off). Should they support 
> > barriers and try to make best effort? Or should they reject barriers to 
> > inform the caller code that they have no data consistency?
> 
> If they can't flush cache, then they must reject barriers unless they
> have write through caching.

... and you suppose that journaled filesystems will use this error and 
mark filesystem for fsck if they are running over a device that doesn't 
support consistency?

In theory it would be nice, in practice it doesn't work this way because 
many devices *DO* support data consistency don't support barriers (the 
most common are DM and MD when run over disk without write cache).


So I think there should be flag (this device does/doesn't support data 
consistency) that the journaled filesystems can use to mark the disk dirty 
for fsck. And if you implement this flag, you can accept barriers always 
to all kind of devices regardless of whether they support consistency. You 
can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
they'd no longer need two commit paths and a clumsy way to restart 
-EOPNOTSUPPed requests.

Mikulas

> -- 
> Jens Axboe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 25, 2009, 3:27 p.m. UTC | #7

On Wed, Mar 25 2009, Mikulas Patocka wrote:
> > > > Over time we should have support everywhere, but it needs to be checked, 
> > > > audited, and trusted.
> > > 
> > > BTW. What is the rule for barriers if the device can't prevent the 
> > > requests from being delayed or reordered? (for example ATA<=3 disks with 
> > > cache that lack cache-flush command ... or flash cards that do 
> > > write-caching anyway and it can't be turned off). Should they support 
> > > barriers and try to make best effort? Or should they reject barriers to 
> > > inform the caller code that they have no data consistency?
> > 
> > If they can't flush cache, then they must reject barriers unless they
> > have write through caching.
> 
> ... and you suppose that journaled filesystems will use this error and 
> mark filesystem for fsck if they are running over a device that doesn't 
> support consistency?

No, but they can warn that data consistency isn't guarenteed. And they
all do, if you mount with barriers enabled and the barrier write fails.
If barriers aren't support, the first one will fail. So either they do
lazy detect, or they do a trial barrier write at mount time.

So yes, I suppose that file systems will use this error. Because that is
what they do.

> In theory it would be nice, in practice it doesn't work this way because 
> many devices *DO* support data consistency don't support barriers (the 
> most common are DM and MD when run over disk without write cache).

Your theory is nice, but most dm systems use write back caching. Any
desktop uses write back caching. Only higher end disks default to
write-through caching.

> So I think there should be flag (this device does/doesn't support data 
> consistency) that the journaled filesystems can use to mark the disk dirty 
> for fsck. And if you implement this flag, you can accept barriers always 
> to all kind of devices regardless of whether they support consistency. You 
> can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> they'd no longer need two commit paths and a clumsy way to restart 
> -EOPNOTSUPPed requests.

And my point is that this case isn't interesting, because most setups
don't guarantee proper ordering. The error handling is complex, no doubt
about that. But the trial barrier test is pretty trivial and even could
be easily abstracted out. If a later barrier write fails, then that's
really no different than if a normal write fails. Error handling is not
easy in that case.

Mikulas Patocka March 25, 2009, 10:39 p.m. UTC | #8

> > > If they can't flush cache, then they must reject barriers unless they
> > > have write through caching.
> > 
> > ... and you suppose that journaled filesystems will use this error and 
> > mark filesystem for fsck if they are running over a device that doesn't 
> > support consistency?
> 
> No, but they can warn that data consistency isn't guarenteed. And they
> all do, if you mount with barriers enabled and the barrier write fails.
> If barriers aren't support, the first one will fail. So either they do
> lazy detect, or they do a trial barrier write at mount time.

The user shouldn't really be required to know what are barriers, which 
drivers support them and which don't, and which drivers maintain 
consistency without barriers and which not.

The user only needs to know if he must run fsck in the case of power 
failure or not. --- and that -EOPNOTSUPP error and warnings about failed 
barriers give him no information about that.

> So yes, I suppose that file systems will use this error. Because that is
> what they do.
> 
> > In theory it would be nice, in practice it doesn't work this way because 
> > many devices *DO* support data consistency don't support barriers (the 
> > most common are DM and MD when run over disk without write cache).
> 
> Your theory is nice, but most dm systems use write back caching. Any

If they do, the filesystem should know about it and fsck the partition in 
the case of crash.

> desktop uses write back caching. Only higher end disks default to
> write-through caching.
> 
> > So I think there should be flag (this device does/doesn't support data 
> > consistency) that the journaled filesystems can use to mark the disk dirty 
> > for fsck. And if you implement this flag, you can accept barriers always 
> > to all kind of devices regardless of whether they support consistency. You 
> > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > they'd no longer need two commit paths and a clumsy way to restart 
> > -EOPNOTSUPPed requests.
> 
> And my point is that this case isn't interesting, because most setups
> don't guarantee proper ordering.

If the ordering isn't guaranteed, the filesystem should know about it, and 
mark the partition for fsck. That's why I'm suggesting to use a flag for 
that. That flag could be also propagated up through md and dm.

The reasoning: "write barriers aren't supported => the device doesn't 
guarantee consistency" isn't valid.

> The error handling is complex, no doubt
> about that. But the trial barrier test is pretty trivial and even could
> be easily abstracted out. If a later barrier write fails, then that's
> really no different than if a normal write fails. Error handling is not
> easy in that case.

I had a discussion with Andi about it some times ago. The conclusion was 
that all the current filesystems handle barriers failing in the middle of 
the operation without functionality loss, but it makes barriers useless 
for any performance-sensitive tasks (commits that wouldn't block 
concurrent activity). Non-blocking commits could only be implemented if 
barriers don't fail.

Mikulas

> -- Jens Axboe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 26, 2009, 8:42 a.m. UTC | #9

On Wed, Mar 25 2009, Mikulas Patocka wrote:
> > > > If they can't flush cache, then they must reject barriers unless they
> > > > have write through caching.
> > > 
> > > ... and you suppose that journaled filesystems will use this error and 
> > > mark filesystem for fsck if they are running over a device that doesn't 
> > > support consistency?
> > 
> > No, but they can warn that data consistency isn't guarenteed. And they
> > all do, if you mount with barriers enabled and the barrier write fails.
> > If barriers aren't support, the first one will fail. So either they do
> > lazy detect, or they do a trial barrier write at mount time.
> 
> The user shouldn't really be required to know what are barriers, which 
> drivers support them and which don't, and which drivers maintain 
> consistency without barriers and which not.
> 
> The user only needs to know if he must run fsck in the case of power 
> failure or not. --- and that -EOPNOTSUPP error and warnings about failed 
> barriers give him no information about that.

I completely agree, but that's "just" a usability issue. Ext4 will tell
you that barriers failed and are now disabled, not very informative. XFS
will tell you something similar.

> > So yes, I suppose that file systems will use this error. Because that is
> > what they do.
> > 
> > > In theory it would be nice, in practice it doesn't work this way because 
> > > many devices *DO* support data consistency don't support barriers (the 
> > > most common are DM and MD when run over disk without write cache).
> > 
> > Your theory is nice, but most dm systems use write back caching. Any
> 
> If they do, the filesystem should know about it and fsck the partition in 
> the case of crash.
> 
> > desktop uses write back caching. Only higher end disks default to
> > write-through caching.
> > 
> > > So I think there should be flag (this device does/doesn't support data 
> > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > for fsck. And if you implement this flag, you can accept barriers always 
> > > to all kind of devices regardless of whether they support consistency. You 
> > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > they'd no longer need two commit paths and a clumsy way to restart 
> > > -EOPNOTSUPPed requests.
> > 
> > And my point is that this case isn't interesting, because most setups
> > don't guarantee proper ordering.
> 
> If the ordering isn't guaranteed, the filesystem should know about it, and 
> mark the partition for fsck. That's why I'm suggesting to use a flag for 
> that. That flag could be also propagated up through md and dm.

We can do that, not a problem. The problem is that ordering is almost
never preserved, SCSI does not use ordered tags because it hasn't
verified that its error path doesn't reorder by mistake. So right now
you can basically use 'false' as that flag.

> The reasoning: "write barriers aren't supported => the device doesn't 
> guarantee consistency" isn't valid.

It's valid in the sense that it's the only RELIABLE primitive we have.
Are you really suggestion that we just assume any device is fully
ordered, unless proven otherwise?

> > The error handling is complex, no doubt
> > about that. But the trial barrier test is pretty trivial and even could
> > be easily abstracted out. If a later barrier write fails, then that's
> > really no different than if a normal write fails. Error handling is not
> > easy in that case.
> 
> I had a discussion with Andi about it some times ago. The conclusion was 
> that all the current filesystems handle barriers failing in the middle of 
> the operation without functionality loss, but it makes barriers useless 
> for any performance-sensitive tasks (commits that wouldn't block 
> concurrent activity). Non-blocking commits could only be implemented if 
> barriers don't fail.

As long as you do a trial barrier like XFS does, barriers will not fail
unless you have media error. Things would also be much easier, if writes
never failed.

Chris Mason March 26, 2009, 12:55 p.m. UTC | #10

On Wed, 2009-03-25 at 18:39 -0400, Mikulas Patocka wrote:

> > The error handling is complex, no doubt
> > about that. But the trial barrier test is pretty trivial and even could
> > be easily abstracted out. If a later barrier write fails, then that's
> > really no different than if a normal write fails. Error handling is not
> > easy in that case.
> 
> I had a discussion with Andi about it some times ago. The conclusion was 
> that all the current filesystems handle barriers failing in the middle of 
> the operation without functionality loss, but it makes barriers useless 
> for any performance-sensitive tasks (commits that wouldn't block 
> concurrent activity). Non-blocking commits could only be implemented if 
> barriers don't fail.
> 

If a barrier fails at runtime, the filesystems do fall back to not doing
barriers without real problems.  But, that's because they don't actually
rely on the barriers to decide if an async commit is a good idea.

One exception is reiserfs, which does one wait_on_buffer at a later time
if barriers are on.  But, this isn't an async commit, this is just
moving an unplug.

In general, async commits happen with threads and they aren't related to
barriers.  Barriers don't really give us error handling, and they are at
the very end of a long chain of technical problems around commits that
don't block concurrent activity.

-chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka March 31, 2009, 3:39 a.m. UTC | #11

On Thu, 26 Mar 2009, Jens Axboe wrote:

> On Wed, Mar 25 2009, Mikulas Patocka wrote:
>
> > > > So I think there should be flag (this device does/doesn't support data 
> > > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > > for fsck. And if you implement this flag, you can accept barriers always 
> > > > to all kind of devices regardless of whether they support consistency. You 
> > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > > they'd no longer need two commit paths and a clumsy way to restart 
> > > > -EOPNOTSUPPed requests.
> > > 
> > > And my point is that this case isn't interesting, because most setups
> > > don't guarantee proper ordering.
> > 
> > If the ordering isn't guaranteed, the filesystem should know about it, and 
> > mark the partition for fsck. That's why I'm suggesting to use a flag for 
> > that. That flag could be also propagated up through md and dm.
> 
> We can do that, not a problem. The problem is that ordering is almost
> never preserved, SCSI does not use ordered tags because it hasn't
> verified that its error path doesn't reorder by mistake. So right now
> you can basically use 'false' as that flag.

There are three ordering guarantees:

1. - nothing (for devices with write cache without cache control)

2. - non-cached ordering: the sequence [submit req a, end req a, submit 
req b, end req b] will make the ordering. It is guaranteed that when the 
request ends successfully, it is on medium. This is what all the 
filesystems, md and dm assume about disks. This consistency model was used 
long way before barriers came in.

3. - barrier ordering: ordering is done with barriers, [submit req a, end 
req a, submit req b, end req b] won't guarantee ordering of a and b, a 
barrier must be inserted.

--- so you can make a two bitflags that differentiate these models. In 
current kernel, model (1) and (2) cannot be differentiated in any way. (3) 
can be differentiated only after a trial write and it won't guarantee that 
(3) will be valid further.

> > The reasoning: "write barriers aren't supported => the device doesn't 
> > guarantee consistency" isn't valid.
> 
> It's valid in the sense that it's the only RELIABLE primitive we have.
> Are you really suggestion that we just assume any device is fully
> ordered, unless proven otherwise?

If someone implements "write barrier's aren't supported => run fsck", then 
a lot of systems start fscking needlessly (for example those using md or 
dm without write cache) and become inoperational for long time because of 
that. So no one can really implement this logic and filesystems don't run 
fsck at all when operated over a device that doesn't support ordering. So 
you get data corruption if you get crash on those devices.

> > > The error handling is complex, no doubt
> > > about that. But the trial barrier test is pretty trivial and even could
> > > be easily abstracted out. If a later barrier write fails, then that's
> > > really no different than if a normal write fails. Error handling is not
> > > easy in that case.
> > 
> > I had a discussion with Andi about it some times ago. The conclusion was 
> > that all the current filesystems handle barriers failing in the middle of 
> > the operation without functionality loss, but it makes barriers useless 
> > for any performance-sensitive tasks (commits that wouldn't block 
> > concurrent activity). Non-blocking commits could only be implemented if 
> > barriers don't fail.
> 
> As long as you do a trial barrier like XFS does, barriers will not fail
> unless you have media error.

No.

The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen 
submitted a patch that implements failing barriers for device mapper and 
he says that md-raid1 does the same thing.

Filesystems handle these randomly failed barriers but the downside is that 
they must not submit any request concurrently with the barrier. Also, that 
-EOPNOTSUPP restarting code is really crap, the request cannot be 
restarted from bi_end_io, so bi_end_io needs to handle to another thread 
for retry without barrier.

See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread)
The patch is silly but it just shows what is really hapenning and what the 
filesystem must be prepared to deal with.

> Things would also be much easier, if writes never failed.
>
> -- 
> Jens Axboe

I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP 
error code at all, make barriers always pass to all kinds of devices and 
inform the caller via queue flags that the device doesn't support 
ordering.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe March 31, 2009, 10:49 a.m. UTC | #12

On Mon, Mar 30 2009, Mikulas Patocka wrote:
> On Thu, 26 Mar 2009, Jens Axboe wrote:
> 
> > On Wed, Mar 25 2009, Mikulas Patocka wrote:
> >
> > > > > So I think there should be flag (this device does/doesn't support data 
> > > > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > > > for fsck. And if you implement this flag, you can accept barriers always 
> > > > > to all kind of devices regardless of whether they support consistency. You 
> > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > > > they'd no longer need two commit paths and a clumsy way to restart 
> > > > > -EOPNOTSUPPed requests.
> > > > 
> > > > And my point is that this case isn't interesting, because most setups
> > > > don't guarantee proper ordering.
> > > 
> > > If the ordering isn't guaranteed, the filesystem should know about it, and 
> > > mark the partition for fsck. That's why I'm suggesting to use a flag for 
> > > that. That flag could be also propagated up through md and dm.
> > 
> > We can do that, not a problem. The problem is that ordering is almost
> > never preserved, SCSI does not use ordered tags because it hasn't
> > verified that its error path doesn't reorder by mistake. So right now
> > you can basically use 'false' as that flag.
> 
> There are three ordering guarantees:
> 
> 1. - nothing (for devices with write cache without cache control)
> 
> 2. - non-cached ordering: the sequence [submit req a, end req a, submit 
> req b, end req b] will make the ordering. It is guaranteed that when the 
> request ends successfully, it is on medium. This is what all the 
> filesystems, md and dm assume about disks. This consistency model was used 
> long way before barriers came in.
> 
> 3. - barrier ordering: ordering is done with barriers, [submit req a, end 
> req a, submit req b, end req b] won't guarantee ordering of a and b, a 
> barrier must be inserted.

Plus the barrier also allows [submit req a, submit req b] and still
count on ordering if either one of them is a barrier. It doesn't have to
be sync, like the (2).

> --- so you can make a two bitflags that differentiate these models. In 
> current kernel, model (1) and (2) cannot be differentiated in any way. (3) 
> can be differentiated only after a trial write and it won't guarantee that 
> (3) will be valid further.

But what's the point? Basically no devices are naturally ordered by
default. Either you need cache flushes, or you need to tell the device
not to reorder on a per-command basis.

> > > The reasoning: "write barriers aren't supported => the device doesn't 
> > > guarantee consistency" isn't valid.
> > 
> > It's valid in the sense that it's the only RELIABLE primitive we have.
> > Are you really suggestion that we just assume any device is fully
> > ordered, unless proven otherwise?
> 
> If someone implements "write barrier's aren't supported => run fsck", then 
> a lot of systems start fscking needlessly (for example those using md or 
> dm without write cache) and become inoperational for long time because of 
> that. So no one can really implement this logic and filesystems don't run 
> fsck at all when operated over a device that doesn't support ordering. So 
> you get data corruption if you get crash on those devices.

Nobody is suggesting that, it's just not a feasible approach. But you
have to warn if you don't know whether it provides the ordering
guarantee you expect to provide consistency and integrity.

> > > > The error handling is complex, no doubt
> > > > about that. But the trial barrier test is pretty trivial and even could
> > > > be easily abstracted out. If a later barrier write fails, then that's
> > > > really no different than if a normal write fails. Error handling is not
> > > > easy in that case.
> > > 
> > > I had a discussion with Andi about it some times ago. The conclusion was 
> > > that all the current filesystems handle barriers failing in the middle of 
> > > the operation without functionality loss, but it makes barriers useless 
> > > for any performance-sensitive tasks (commits that wouldn't block 
> > > concurrent activity). Non-blocking commits could only be implemented if 
> > > barriers don't fail.
> > 
> > As long as you do a trial barrier like XFS does, barriers will not fail
> > unless you have media error.
> 
> No.
> 
> The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen 
> submitted a patch that implements failing barriers for device mapper and 
> he says that md-raid1 does the same thing.

You are right, if a device is reconfigured beneath you it may very well
begin to return -EOPNOTSUPP much later. I didn't take that into account,
I was considering only "plain" devices.

> Filesystems handle these randomly failed barriers but the downside is that 
> they must not submit any request concurrently with the barrier. Also, that 
> -EOPNOTSUPP restarting code is really crap, the request cannot be 
> restarted from bi_end_io, so bi_end_io needs to handle to another thread 
> for retry without barrier.

It can, but it requires you to operate at the request level. So for file
systems that is problematic, it wont work of course. It would not be
THAT hard to provide a helper to reissue the request. Not that pretty,
but...

> See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread)
> The patch is silly but it just shows what is really hapenning and what the 
> filesystem must be prepared to deal with.

It's not that silly, we should add special barrier failing to the
CONFIG_FAIL stuff. You'd definitely want to exercise that in the file
system.

> > Things would also be much easier, if writes never failed.
> >
> > -- 
> > Jens Axboe
> 
> I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP 
> error code at all, make barriers always pass to all kinds of devices and 
> inform the caller via queue flags that the device doesn't support 
> ordering.

Not a queue flag. Make it succeed to get rid of the whole retry
business, but flag the bio with the information anyway.

Mikulas Patocka April 2, 2009, 11:40 p.m. UTC | #13

On Tue, 31 Mar 2009, Jens Axboe wrote:

> On Mon, Mar 30 2009, Mikulas Patocka wrote:
> > On Thu, 26 Mar 2009, Jens Axboe wrote:
> > 
> > > On Wed, Mar 25 2009, Mikulas Patocka wrote:
> > >
> > > > > > So I think there should be flag (this device does/doesn't support data 
> > > > > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > > > > for fsck. And if you implement this flag, you can accept barriers always 
> > > > > > to all kind of devices regardless of whether they support consistency. You 
> > > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > > > > they'd no longer need two commit paths and a clumsy way to restart 
> > > > > > -EOPNOTSUPPed requests.
> > > > > 
> > > > > And my point is that this case isn't interesting, because most setups
> > > > > don't guarantee proper ordering.
> > > > 
> > > > If the ordering isn't guaranteed, the filesystem should know about it, and 
> > > > mark the partition for fsck. That's why I'm suggesting to use a flag for 
> > > > that. That flag could be also propagated up through md and dm.
> > > 
> > > We can do that, not a problem. The problem is that ordering is almost
> > > never preserved, SCSI does not use ordered tags because it hasn't
> > > verified that its error path doesn't reorder by mistake. So right now
> > > you can basically use 'false' as that flag.
> > 
> > There are three ordering guarantees:
> > 
> > 1. - nothing (for devices with write cache without cache control)
> > 
> > 2. - non-cached ordering: the sequence [submit req a, end req a, submit 
> > req b, end req b] will make the ordering. It is guaranteed that when the 
> > request ends successfully, it is on medium. This is what all the 
> > filesystems, md and dm assume about disks. This consistency model was used 
> > long way before barriers came in.
> > 
> > 3. - barrier ordering: ordering is done with barriers, [submit req a, end 
> > req a, submit req b, end req b] won't guarantee ordering of a and b, a 
> > barrier must be inserted.
> 
> Plus the barrier also allows [submit req a, submit req b] and still
> count on ordering if either one of them is a barrier. It doesn't have to
> be sync, like the (2).
> 
> > --- so you can make a two bitflags that differentiate these models. In 
> > current kernel, model (1) and (2) cannot be differentiated in any way. (3) 
> > can be differentiated only after a trial write and it won't guarantee that 
> > (3) will be valid further.
> 
> But what's the point? Basically no devices are naturally ordered by
> default. Either you need cache flushes, or you need to tell the device
> not to reorder on a per-command basis.
> 
> > > > The reasoning: "write barriers aren't supported => the device doesn't 
> > > > guarantee consistency" isn't valid.
> > > 
> > > It's valid in the sense that it's the only RELIABLE primitive we have.
> > > Are you really suggestion that we just assume any device is fully
> > > ordered, unless proven otherwise?
> > 
> > If someone implements "write barrier's aren't supported => run fsck", then 
> > a lot of systems start fscking needlessly (for example those using md or 
> > dm without write cache) and become inoperational for long time because of 
> > that. So no one can really implement this logic and filesystems don't run 
> > fsck at all when operated over a device that doesn't support ordering. So 
> > you get data corruption if you get crash on those devices.
> 
> Nobody is suggesting that, it's just not a feasible approach. But you

I am saying that the filesystem should run fsck if journaled filesystem is 
mounted on an unsafe device and crash happens.

> have to warn if you don't know whether it provides the ordering
> guarantee you expect to provide consistency and integrity.

The warning of missing barriers (or other actions) should be printed only 
if write cache is enabled. But there's no way how a filesystem on the top 
of several dm or md layers can find out if the disk is running with hdparm 
-w 0 or hdparm -w 1.

> > The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen 
> > submitted a patch that implements failing barriers for device mapper and 
> > he says that md-raid1 does the same thing.
> 
> You are right, if a device is reconfigured beneath you it may very well
> begin to return -EOPNOTSUPP much later. I didn't take that into account,
> I was considering only "plain" devices.
> 
> > Filesystems handle these randomly failed barriers but the downside is that 
> > they must not submit any request concurrently with the barrier. Also, that 
> > -EOPNOTSUPP restarting code is really crap, the request cannot be 
> > restarted from bi_end_io, so bi_end_io needs to handle to another thread 
> > for retry without barrier.
> 
> It can, but it requires you to operate at the request level. So for file
> systems that is problematic, it wont work of course. It would not be
> THAT hard to provide a helper to reissue the request. Not that pretty,
> but...

And it makes barriers useless for ordering.

The filesystem can't do [submit req a], [submit barrier req b], [submit 
req c] and assume that the requests will be ordered. If [b] fails with 
-EOPNOTSUPP, [a] and [c] could be already reordered and data corruption 
has already happened. Even if you catch [b]'s error and resubmit it as 
non-barrier request, it's too late.

So, as a result of this complication, all the existing filesystems send 
just one barrier request and do not try to overlay it with any other write 
requests.

So I'm wondering why Linux developers designed a barrier interface with 
complex specification, complex implementation and the interface is useless 
to provide any request ordering and it's no better than q->issue_flush_fn 
method or whatever was there beffore. Obviously, the whole barrier thing 
was designed by a person who never used it in a filesystem.
 
> > See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread)
> > The patch is silly but it just shows what is really hapenning and what the 
> > filesystem must be prepared to deal with.
> 
> It's not that silly, we should add special barrier failing to the
> CONFIG_FAIL stuff. You'd definitely want to exercise that in the file
> system.
> 
> > > Things would also be much easier, if writes never failed.
> > >
> > > -- 
> > > Jens Axboe
> > 
> > I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP 
> > error code at all, make barriers always pass to all kinds of devices and 
> > inform the caller via queue flags that the device doesn't support 
> > ordering.
> 
> Not a queue flag. Make it succeed to get rid of the whole retry
> business, but flag the bio with the information anyway.

That's a possibility too.

Mikulas

> -- 
> Jens Axboe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe April 3, 2009, 8:11 a.m. UTC | #14

On Thu, Apr 02 2009, Mikulas Patocka wrote:
> 
> 
> On Tue, 31 Mar 2009, Jens Axboe wrote:
> 
> > On Mon, Mar 30 2009, Mikulas Patocka wrote:
> > > On Thu, 26 Mar 2009, Jens Axboe wrote:
> > > 
> > > > On Wed, Mar 25 2009, Mikulas Patocka wrote:
> > > >
> > > > > > > So I think there should be flag (this device does/doesn't support data 
> > > > > > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > > > > > for fsck. And if you implement this flag, you can accept barriers always 
> > > > > > > to all kind of devices regardless of whether they support consistency. You 
> > > > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > > > > > they'd no longer need two commit paths and a clumsy way to restart 
> > > > > > > -EOPNOTSUPPed requests.
> > > > > > 
> > > > > > And my point is that this case isn't interesting, because most setups
> > > > > > don't guarantee proper ordering.
> > > > > 
> > > > > If the ordering isn't guaranteed, the filesystem should know about it, and 
> > > > > mark the partition for fsck. That's why I'm suggesting to use a flag for 
> > > > > that. That flag could be also propagated up through md and dm.
> > > > 
> > > > We can do that, not a problem. The problem is that ordering is almost
> > > > never preserved, SCSI does not use ordered tags because it hasn't
> > > > verified that its error path doesn't reorder by mistake. So right now
> > > > you can basically use 'false' as that flag.
> > > 
> > > There are three ordering guarantees:
> > > 
> > > 1. - nothing (for devices with write cache without cache control)
> > > 
> > > 2. - non-cached ordering: the sequence [submit req a, end req a, submit 
> > > req b, end req b] will make the ordering. It is guaranteed that when the 
> > > request ends successfully, it is on medium. This is what all the 
> > > filesystems, md and dm assume about disks. This consistency model was used 
> > > long way before barriers came in.
> > > 
> > > 3. - barrier ordering: ordering is done with barriers, [submit req a, end 
> > > req a, submit req b, end req b] won't guarantee ordering of a and b, a 
> > > barrier must be inserted.
> > 
> > Plus the barrier also allows [submit req a, submit req b] and still
> > count on ordering if either one of them is a barrier. It doesn't have to
> > be sync, like the (2).
> > 
> > > --- so you can make a two bitflags that differentiate these models. In 
> > > current kernel, model (1) and (2) cannot be differentiated in any way. (3) 
> > > can be differentiated only after a trial write and it won't guarantee that 
> > > (3) will be valid further.
> > 
> > But what's the point? Basically no devices are naturally ordered by
> > default. Either you need cache flushes, or you need to tell the device
> > not to reorder on a per-command basis.
> > 
> > > > > The reasoning: "write barriers aren't supported => the device doesn't 
> > > > > guarantee consistency" isn't valid.
> > > > 
> > > > It's valid in the sense that it's the only RELIABLE primitive we have.
> > > > Are you really suggestion that we just assume any device is fully
> > > > ordered, unless proven otherwise?
> > > 
> > > If someone implements "write barrier's aren't supported => run fsck", then 
> > > a lot of systems start fscking needlessly (for example those using md or 
> > > dm without write cache) and become inoperational for long time because of 
> > > that. So no one can really implement this logic and filesystems don't run 
> > > fsck at all when operated over a device that doesn't support ordering. So 
> > > you get data corruption if you get crash on those devices.
> > 
> > Nobody is suggesting that, it's just not a feasible approach. But you
> 
> I am saying that the filesystem should run fsck if journaled filesystem is 
> mounted on an unsafe device and crash happens.
> 
> > have to warn if you don't know whether it provides the ordering
> > guarantee you expect to provide consistency and integrity.
> 
> The warning of missing barriers (or other actions) should be printed only 
> if write cache is enabled. But there's no way how a filesystem on the top 
> of several dm or md layers can find out if the disk is running with hdparm 
> -w 0 or hdparm -w 1.

Right, you can't possibly now that. Hence we have to print the warning.

> > > The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen 
> > > submitted a patch that implements failing barriers for device mapper and 
> > > he says that md-raid1 does the same thing.
> > 
> > You are right, if a device is reconfigured beneath you it may very well
> > begin to return -EOPNOTSUPP much later. I didn't take that into account,
> > I was considering only "plain" devices.
> > 
> > > Filesystems handle these randomly failed barriers but the downside is that 
> > > they must not submit any request concurrently with the barrier. Also, that 
> > > -EOPNOTSUPP restarting code is really crap, the request cannot be 
> > > restarted from bi_end_io, so bi_end_io needs to handle to another thread 
> > > for retry without barrier.
> > 
> > It can, but it requires you to operate at the request level. So for file
> > systems that is problematic, it wont work of course. It would not be
> > THAT hard to provide a helper to reissue the request. Not that pretty,
> > but...
> 
> And it makes barriers useless for ordering.
> 
> The filesystem can't do [submit req a], [submit barrier req b], [submit 
> req c] and assume that the requests will be ordered. If [b] fails with 
> -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption 
> has already happened. Even if you catch [b]'s error and resubmit it as 
> non-barrier request, it's too late.
> 
> So, as a result of this complication, all the existing filesystems send 
> just one barrier request and do not try to overlay it with any other write 
> requests.
> 
> So I'm wondering why Linux developers designed a barrier interface with 
> complex specification, complex implementation and the interface is useless 
> to provide any request ordering and it's no better than q->issue_flush_fn 
> method or whatever was there beffore. Obviously, the whole barrier thing 
> was designed by a person who never used it in a filesystem.

That's not quite true, it was done in conjunction with file system
people. At a certain level, we are restricted by what the hardware can
actually do. It's certainly possible to make sure your storage stack can
support barriers and be safe in that regard, but it's certainly also
true that reconfiguring devices may void that guarantee. So it's not
perfect, but it's the best we can do. The worst part is that it's
virtually impossible to inform of such limitations.

If we get rid of -EOPNOTSUPP and just warn in such cases, then you
should never see -EOPNOTSUPP in the above sequence. You may not actually
be safe, hence we print a warning. It'll also make the whole thing a lot
less complex.

And to wrap up with the history of barriers, there was NOTHING before.
->issue_flush_fn is a later addition to just force a flush for fsync()
and friends, the original implementation was just a data bio/bh with
barrier semantics, providing no reordering before and after the data
passed.

Nobody was interested in barriers when they were done. Nobody. The fact
that it's taken 6 years or so to actually emerge as a hot topic for data
consistency should make that quite obvious. So the original
implementation was basically a joint effort with Chris on the reiser
side and EMC as the hw vendor and me doing the block implementation.

Ric Wheeler April 4, 2009, 3:20 p.m. UTC | #15

On 04/03/2009 04:11 AM, Jens Axboe wrote:
> On Thu, Apr 02 2009, Mikulas Patocka wrote:
>    
>> On Tue, 31 Mar 2009, Jens Axboe wrote:
>>
>>      
>>> On Mon, Mar 30 2009, Mikulas Patocka wrote:
>>>        
>>>> On Thu, 26 Mar 2009, Jens Axboe wrote:
>>>>
>>>>          
>>>>> On Wed, Mar 25 2009, Mikulas Patocka wrote:
>>>>>
>>>>>            
>>>>>>>> So I think there should be flag (this device does/doesn't support data
>>>>>>>> consistency) that the journaled filesystems can use to mark the disk dirty
>>>>>>>> for fsck. And if you implement this flag, you can accept barriers always
>>>>>>>> to all kind of devices regardless of whether they support consistency. You
>>>>>>>> can then get rid of that -EOPNOTSUPP and simplify filesystem code because
>>>>>>>> they'd no longer need two commit paths and a clumsy way to restart
>>>>>>>> -EOPNOTSUPPed requests.
>>>>>>>>                  
>>>>>>> And my point is that this case isn't interesting, because most setups
>>>>>>> don't guarantee proper ordering.
>>>>>>>                
>>>>>> If the ordering isn't guaranteed, the filesystem should know about it, and
>>>>>> mark the partition for fsck. That's why I'm suggesting to use a flag for
>>>>>> that. That flag could be also propagated up through md and dm.
>>>>>>              
>>>>> We can do that, not a problem. The problem is that ordering is almost
>>>>> never preserved, SCSI does not use ordered tags because it hasn't
>>>>> verified that its error path doesn't reorder by mistake. So right now
>>>>> you can basically use 'false' as that flag.
>>>>>            
>>>> There are three ordering guarantees:
>>>>
>>>> 1. - nothing (for devices with write cache without cache control)
>>>>
>>>> 2. - non-cached ordering: the sequence [submit req a, end req a, submit
>>>> req b, end req b] will make the ordering. It is guaranteed that when the
>>>> request ends successfully, it is on medium. This is what all the
>>>> filesystems, md and dm assume about disks. This consistency model was used
>>>> long way before barriers came in.
>>>>
>>>> 3. - barrier ordering: ordering is done with barriers, [submit req a, end
>>>> req a, submit req b, end req b] won't guarantee ordering of a and b, a
>>>> barrier must be inserted.
>>>>          
>>> Plus the barrier also allows [submit req a, submit req b] and still
>>> count on ordering if either one of them is a barrier. It doesn't have to
>>> be sync, like the (2).
>>>
>>>        
>>>> --- so you can make a two bitflags that differentiate these models. In
>>>> current kernel, model (1) and (2) cannot be differentiated in any way. (3)
>>>> can be differentiated only after a trial write and it won't guarantee that
>>>> (3) will be valid further.
>>>>          
>>> But what's the point? Basically no devices are naturally ordered by
>>> default. Either you need cache flushes, or you need to tell the device
>>> not to reorder on a per-command basis.
>>>
>>>        
>>>>>> The reasoning: "write barriers aren't supported =>  the device doesn't
>>>>>> guarantee consistency" isn't valid.
>>>>>>              
>>>>> It's valid in the sense that it's the only RELIABLE primitive we have.
>>>>> Are you really suggestion that we just assume any device is fully
>>>>> ordered, unless proven otherwise?
>>>>>            
>>>> If someone implements "write barrier's aren't supported =>  run fsck", then
>>>> a lot of systems start fscking needlessly (for example those using md or
>>>> dm without write cache) and become inoperational for long time because of
>>>> that. So no one can really implement this logic and filesystems don't run
>>>> fsck at all when operated over a device that doesn't support ordering. So
>>>> you get data corruption if you get crash on those devices.
>>>>          
>>> Nobody is suggesting that, it's just not a feasible approach. But you
>>>        
>> I am saying that the filesystem should run fsck if journaled filesystem is
>> mounted on an unsafe device and crash happens.
>>
>>      
>>> have to warn if you don't know whether it provides the ordering
>>> guarantee you expect to provide consistency and integrity.
>>>        
>> The warning of missing barriers (or other actions) should be printed only
>> if write cache is enabled. But there's no way how a filesystem on the top
>> of several dm or md layers can find out if the disk is running with hdparm
>> -w 0 or hdparm -w 1.
>>      
>
> Right, you can't possibly now that. Hence we have to print the warning.
>
>    
>>>> The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen
>>>> submitted a patch that implements failing barriers for device mapper and
>>>> he says that md-raid1 does the same thing.
>>>>          
>>> You are right, if a device is reconfigured beneath you it may very well
>>> begin to return -EOPNOTSUPP much later. I didn't take that into account,
>>> I was considering only "plain" devices.
>>>
>>>        
>>>> Filesystems handle these randomly failed barriers but the downside is that
>>>> they must not submit any request concurrently with the barrier. Also, that
>>>> -EOPNOTSUPP restarting code is really crap, the request cannot be
>>>> restarted from bi_end_io, so bi_end_io needs to handle to another thread
>>>> for retry without barrier.
>>>>          
>>> It can, but it requires you to operate at the request level. So for file
>>> systems that is problematic, it wont work of course. It would not be
>>> THAT hard to provide a helper to reissue the request. Not that pretty,
>>> but...
>>>        
>> And it makes barriers useless for ordering.
>>
>> The filesystem can't do [submit req a], [submit barrier req b], [submit
>> req c] and assume that the requests will be ordered. If [b] fails with
>> -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption
>> has already happened. Even if you catch [b]'s error and resubmit it as
>> non-barrier request, it's too late.
>>
>> So, as a result of this complication, all the existing filesystems send
>> just one barrier request and do not try to overlay it with any other write
>> requests.
>>
>> So I'm wondering why Linux developers designed a barrier interface with
>> complex specification, complex implementation and the interface is useless
>> to provide any request ordering and it's no better than q->issue_flush_fn
>> method or whatever was there beffore. Obviously, the whole barrier thing
>> was designed by a person who never used it in a filesystem.
>>      
>
> That's not quite true, it was done in conjunction with file system
> people. At a certain level, we are restricted by what the hardware can
> actually do. It's certainly possible to make sure your storage stack can
> support barriers and be safe in that regard, but it's certainly also
> true that reconfiguring devices may void that guarantee. So it's not
> perfect, but it's the best we can do. The worst part is that it's
> virtually impossible to inform of such limitations.
>
> If we get rid of -EOPNOTSUPP and just warn in such cases, then you
> should never see -EOPNOTSUPP in the above sequence. You may not actually
> be safe, hence we print a warning. It'll also make the whole thing a lot
> less complex.
>
> And to wrap up with the history of barriers, there was NOTHING before.
> ->issue_flush_fn is a later addition to just force a flush for fsync()
> and friends, the original implementation was just a data bio/bh with
> barrier semantics, providing no reordering before and after the data
> passed.
>
> Nobody was interested in barriers when they were done. Nobody. The fact
> that it's taken 6 years or so to actually emerge as a hot topic for data
> consistency should make that quite obvious. So the original
> implementation was basically a joint effort with Chris on the reiser
> side and EMC as the hw vendor and me doing the block implementation.
>    

And I will restate that back at EMC we tested the original barriers 
(with reiserfs mostly, a bit on ext3 and ext2) and saw significant 
reduction in file system integrity issues after power loss.

The vantage point I had at EMC while testing and deploying the original 
barrier work done by Jens and Chris was pretty unique - full ability to 
do root cause failures of any component when really needed, a huge 
installed base which could send information home on a regular basis 
about crashes/fsck instances/etc and the ability (with customer 
permission) to dial into any box and diagnose issues remotely. Not to 
mention access to drive vendors to pressure them to make the flushes 
more robust. The application was also able to validate that all 
acknowledged writes were consistent.

Barriers do work as we have them, but as others have mentioned, it is 
not a "free" win - fsync will actually move your data safely out to 
persistent storage for a huge percentage of real users (including every 
ATA/S-ATA and SAS drive I was able to test).  The file systems I 
monitored in production use without barriers were much less reliable.

As others have noted, some storage does not need barriers or flushed 
(high end arrays, drives with no volatile write cache) and some need it 
but stink (low cost USB flash sticks?) so warning is a good thing to do...

ric


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Theodore Ts'o April 5, 2009, 1:28 a.m. UTC | #16

On Sat, Apr 04, 2009 at 11:20:35AM -0400, Ric Wheeler wrote:
> Barriers do work as we have them, but as others have mentioned, it is  
> not a "free" win - fsync will actually move your data safely out to  
> persistent storage for a huge percentage of real users (including every  
> ATA/S-ATA and SAS drive I was able to test).  The file systems I  
> monitored in production use without barriers were much less reliable.

The problem is that, as long as you're not under memory pressure, and
not pushing the filesystem heavily, ext3 didn't corrupt *that* often
without barriers.  So people got away with it "most of the time" ---
just as applications replacing files by rewriting them in place using
truncate and w/o fsync would "usually" not lose data after a crash if
they were using ext3 with data=ordered mode.  This caused people to
get lazy/sloppy.

So yes, barriers were something that were largely ignored for a long
time.  After all, in a server environment with UPS's, without crappy
proprietary video drivers, Linux systems didn't crash that often
anyway.  So you really needed a large base of systems and with the
ability to root cause failures such as what Ric had at EMC to see the
problem.

Even now, the reason why ext3 doesn't have barriers enabled by default
(although we did make them the default for ext4) is because Andrew
doesn't believe Chris's replication case is likely to be true for most
users in practice, and he's concerned about the performance
degradation of barriers.  He's basically depending on the fact that
"usually" you can get away without using barriers.  Sigh....

						- Ted

P.S.  Of course, distributions should feel free to consider changing
the default on their kernels.  SLES has already if memory serves
correctly.  I don't know if RHEL has yet.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Ric Wheeler April 5, 2009, 11:54 a.m. UTC | #17

On 04/04/2009 06:28 PM, Theodore Tso wrote:
> On Sat, Apr 04, 2009 at 11:20:35AM -0400, Ric Wheeler wrote:
>    
>> Barriers do work as we have them, but as others have mentioned, it is
>> not a "free" win - fsync will actually move your data safely out to
>> persistent storage for a huge percentage of real users (including every
>> ATA/S-ATA and SAS drive I was able to test).  The file systems I
>> monitored in production use without barriers were much less reliable.
>>      
>
> The problem is that, as long as you're not under memory pressure, and
> not pushing the filesystem heavily, ext3 didn't corrupt *that* often
> without barriers.  So people got away with it "most of the time" ---
> just as applications replacing files by rewriting them in place using
> truncate and w/o fsync would "usually" not lose data after a crash if
> they were using ext3 with data=ordered mode.  This caused people to
> get lazy/sloppy.
>
> So yes, barriers were something that were largely ignored for a long
> time.  After all, in a server environment with UPS's, without crappy
> proprietary video drivers, Linux systems didn't crash that often
> anyway.  So you really needed a large base of systems and with the
> ability to root cause failures such as what Ric had at EMC to see the
> problem.
>    

One thing to point out here is that there are a lot of "servers" in high 
end data centers that do not have UPS backup. Those racks full of 1U and 
2U boxes that are used to make "grids", "clouds" and so on often are 
built with as much gear as you can stuff in a rack - no batteries or UPS 
to be seen, so they are really quite similar to the normal desktop or 
home systems that we normally run at home :-)

ric


> Even now, the reason why ext3 doesn't have barriers enabled by default
> (although we did make them the default for ext4) is because Andrew
> doesn't believe Chris's replication case is likely to be true for most
> users in practice, and he's concerned about the performance
> degradation of barriers.  He's basically depending on the fact that
> "usually" you can get away without using barriers.  Sigh....
>
> 						- Ted
>
> P.S.  Of course, distributions should feel free to consider changing
> the default on their kernels.  SLES has already if memory serves
> correctly.  I don't know if RHEL has yet.
>
>
>    




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Lee Revell April 6, 2009, 1:14 a.m. UTC | #18

On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote:
> One thing to point out here is that there are a lot of "servers" in high end
> data centers that do not have UPS backup. Those racks full of 1U and 2U
> boxes that are used to make "grids", "clouds" and so on often are built with
> as much gear as you can stuff in a rack - no batteries or UPS to be seen, so
> they are really quite similar to the normal desktop or home systems that we
> normally run at home :-)

These days even bargain basement data centers provide UPS functionality for
you, via generator backup and A/B power.

Lee

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Ric Wheeler April 6, 2009, 1:24 a.m. UTC | #19

On 04/05/2009 06:14 PM, Lee Revell wrote:
> On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler<ricwheeler@gmail.com>  wrote:
>> One thing to point out here is that there are a lot of "servers" in high end
>> data centers that do not have UPS backup. Those racks full of 1U and 2U
>> boxes that are used to make "grids", "clouds" and so on often are built with
>> as much gear as you can stuff in a rack - no batteries or UPS to be seen, so
>> they are really quite similar to the normal desktop or home systems that we
>> normally run at home :-)
>
> These days even bargain basement data centers provide UPS functionality for
> you, via generator backup and A/B power.
>
> Lee

In my experience, you will see multiple customers with and without UPS backup.

I have certainly dealt personally with many data centers that lost power without 
them & see array vendors that continue to build in their own batteries...

ric

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka April 8, 2009, 12:36 p.m. UTC | #20

> Even now, the reason why ext3 doesn't have barriers enabled by default
> (although we did make them the default for ext4) is because Andrew
> doesn't believe Chris's replication case is likely to be true for most
> users in practice, and he's concerned about the performance
> degradation of barriers.  He's basically depending on the fact that
> "usually" you can get away without using barriers.  Sigh....

What is the performance degradation of barriers?

If the disk doesn't have write cache, performance of barrier-filesystem 
should be equal or better than performance of the same filesystem not 
using barriers. If barriers degrade performance, there is something 
seriously broken, either in the filesystem (XFS...) or in the block layer.

If the disk has write cache and you disable barriers, you might get some 
performance improvement. But you are getting this performance improvement 
from the fact that the disk illegaly reorders writes that should be 
ordered. And you are going to damage your data on power failure.

--- definitely, very few admins want this "high-performance & damage" 
setup (maybe for /tmp or /var/log?) --- such condition sould be only 
enabled with admin fully knowning what's going on. And where should the 
admin get the knowledge?

In hdparm manpage, there is just:
-W     Get/set the IDE/SATA drive's write-caching feature.

> 						- Ted
> 
> P.S.  Of course, distributions should feel free to consider changing
> the default on their kernels.  SLES has already if memory serves
> correctly.  I don't know if RHEL has yet.

RHEL doesn't enable write cache by default. And doesn't use barriers (it 
uses lvm2/device mapper and they won't get through).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka April 8, 2009, 12:44 p.m. UTC | #21

On Sun, 5 Apr 2009, Lee Revell wrote:

> On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote:
> > One thing to point out here is that there are a lot of "servers" in high end
> > data centers that do not have UPS backup. Those racks full of 1U and 2U
> > boxes that are used to make "grids", "clouds" and so on often are built with
> > as much gear as you can stuff in a rack - no batteries or UPS to be seen, so
> > they are really quite similar to the normal desktop or home systems that we
> > normally run at home :-)
> 
> These days even bargain basement data centers provide UPS functionality for
> you, via generator backup and A/B power.

I have seen an installation having generator backup, but with another 
serious flaw --- if someone plugged a computer causing short-circuit, it 
turned off the circuit breaker for the whole rack :)

Mikulas

> Lee

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka April 8, 2009, 12:54 p.m. UTC | #22

> And I will restate that back at EMC we tested the original barriers (with
> reiserfs mostly, a bit on ext3 and ext2) and saw significant reduction in file
> system integrity issues after power loss.

You saw that barrier-enabled filesystem was worse than the same filesystem 
without barriers? And what kind of issues were that? Disks writing damaged 
sectors if powered-off in the middle of the writes? Or data corruptions 
due to bugs in ReiserFS?

> The vantage point I had at EMC while testing and deploying the original
> barrier work done by Jens and Chris was pretty unique - full ability to do
> root cause failures of any component when really needed, a huge installed base
> which could send information home on a regular basis about crashes/fsck
> instances/etc and the ability (with customer permission) to dial into any box
> and diagnose issues remotely. Not to mention access to drive vendors to
> pressure them to make the flushes more robust. The application was also able
> to validate that all acknowledged writes were consistent.
> 
> Barriers do work as we have them, but as others have mentioned, it is not a
> "free" win - fsync will actually move your data safely out to persistent
> storage for a huge percentage of real users (including every ATA/S-ATA and SAS
> drive I was able to test).  The file systems I monitored in production use
> without barriers were much less reliable.

With write cache or without write cache?

With cache and without barriers the system is violating the specification. 
There just could be data corruption ... and it will eventually happen.

If you got corruption without cache and without barriers, there's a bug 
and it needs to be investigated.

> As others have noted, some storage does not need barriers or flushed (high end
> arrays, drives with no volatile write cache) and some need it but stink (low
> cost USB flash sticks?) so warning is a good thing to do...
> 
> ric

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka April 8, 2009, 1:37 p.m. UTC | #23

> > So I'm wondering why Linux developers designed a barrier interface with 
> > complex specification, complex implementation and the interface is useless 
> > to provide any request ordering and it's no better than q->issue_flush_fn 
> > method or whatever was there beffore. Obviously, the whole barrier thing 
> > was designed by a person who never used it in a filesystem.
> 
> That's not quite true, it was done in conjunction with file system
> people.
> ...
> Nobody was interested in barriers when they were done. Nobody.

That's a contradiction :-)


Some times ago I wrote a piece of code that uses barriers for performance 
enhancement 
(http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/spadfs-0.9.10.tar.gz).

The used trick is basically to take a lock that prevents filesystem-wide 
updates, submit remaining writes (don't wait), submit the barrier that 
causes transition to new generation (don't wait) and release the lock. The 
lock is held for minimum time, no IO is waited for inside the lock. This 
trick can't be done without barriers, without barriers you'd have to wait 
inside the lock.

And the requirement for this code is that barriers are supported for the 
whole lifetime of the filesystem --- which is what the Linux kernel 
doesn't support! If barrier support is lost, consistency is damaged. 

With barriers, the code does [submit A, submit barrier B, submit C].
If you don't have barriers, you must modify this sequence to: [submit A, 
wait for A endio, submit B, wait for B endio, submit C]

--- and now you are getting the point why failing barriers can't ever work 
--- by the time request B completes, you find out that the device lost 
barrier support and you realize that you should have inserted the waits in 
the past --- but it's too late, there is no way to insert them 
retroactively.

AFAIK this is the only piece of code that uses barriers to improve 
performance. All the other filesystems use barriers just as a way to flush 
cache and don't overlap barrier request with any other requests.

So there are two ways:

- either support only what all in-kernel filesystems do. Using barrier 
reqiests to flush hw cache. You can remove support for barriers with data, 
leave just zero-data barrier, you can remove ordering restrictions. 
In-kernel filesystems never overlap barrier with another metadata request 
(see above why such overlap can't work), so you can freely reorder 
zero-data barriers and simplify the code ... because all the requests that 
could be submitted in paralel with the barrier are either for different 
partition or non-metadata requests to the same partition from prefetch, 
direct io or so.

- or you can allow barriers to be used for purposes as I did. And then, 
there must be clean indicator "this device supports barriers 
*and*will*support*them*in*the*future*". Currently there is no such 
indicator.

Mikulas

> -- 
> Jens Axboe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jens Axboe April 8, 2009, 2:06 p.m. UTC | #24

On Wed, Apr 08 2009, Mikulas Patocka wrote:
> > > So I'm wondering why Linux developers designed a barrier interface with 
> > > complex specification, complex implementation and the interface is useless 
> > > to provide any request ordering and it's no better than q->issue_flush_fn 
> > > method or whatever was there beffore. Obviously, the whole barrier thing 
> > > was designed by a person who never used it in a filesystem.
> > 
> > That's not quite true, it was done in conjunction with file system
> > people.
> > ...
> > Nobody was interested in barriers when they were done. Nobody.
> 
> That's a contradiction :-)

So we have sunk to this level now, snip and edit citing?

> 
> Some times ago I wrote a piece of code that uses barriers for performance 
> enhancement 
> (http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/spadfs-0.9.10.tar.gz).
> 
> The used trick is basically to take a lock that prevents filesystem-wide 
> updates, submit remaining writes (don't wait), submit the barrier that 
> causes transition to new generation (don't wait) and release the lock. The 
> lock is held for minimum time, no IO is waited for inside the lock. This 
> trick can't be done without barriers, without barriers you'd have to wait 
> inside the lock.
> 
> And the requirement for this code is that barriers are supported for the 
> whole lifetime of the filesystem --- which is what the Linux kernel 
> doesn't support! If barrier support is lost, consistency is damaged. 
> 
> With barriers, the code does [submit A, submit barrier B, submit C].
> If you don't have barriers, you must modify this sequence to: [submit A, 
> wait for A endio, submit B, wait for B endio, submit C]
> 
> --- and now you are getting the point why failing barriers can't ever work 
> --- by the time request B completes, you find out that the device lost 
> barrier support and you realize that you should have inserted the waits in 
> the past --- but it's too late, there is no way to insert them 
> retroactively.
> 
> AFAIK this is the only piece of code that uses barriers to improve 
> performance. All the other filesystems use barriers just as a way to flush 
> cache and don't overlap barrier request with any other requests.
> 
> So there are two ways:
> 
> - either support only what all in-kernel filesystems do. Using barrier 
> reqiests to flush hw cache. You can remove support for barriers with data, 
> leave just zero-data barrier, you can remove ordering restrictions. 
> In-kernel filesystems never overlap barrier with another metadata request 
> (see above why such overlap can't work), so you can freely reorder 
> zero-data barriers and simplify the code ... because all the requests that 
> could be submitted in paralel with the barrier are either for different 
> partition or non-metadata requests to the same partition from prefetch, 
> direct io or so.
> 
> - or you can allow barriers to be used for purposes as I did. And then, 
> there must be clean indicator "this device supports barriers 
> *and*will*support*them*in*the*future*". Currently there is no such 
> indicator.

I'm about to leave, so I wont comment on the above. But the below is
basically what we pretend right now, and I think that is perfectly fine.
If you go and reconfigure your device and it suddenly doesn't support
barriers anymore, call the doctor and tell him that your foot hurts when
you slam it in the door. Don't do that, it's pretty simple. We already
agreed to kill the -EOPNOTSUPP and just pretend it always works, with a
notifier that we MAY not be safe. Not much point in harping the same
thing over and over again.

Henrique de Moraes Holschuh April 8, 2009, 3:16 p.m. UTC | #25

On Wed, 08 Apr 2009, Mikulas Patocka wrote:
> I have seen an installation having generator backup, but with another 
> serious flaw --- if someone plugged a computer causing short-circuit, it 
> turned off the circuit breaker for the whole rack :)

You will be hard-pressed to find datacenters where the above is NOT true
:-)

Although, usually one has TWO independent power feeds per rack, so a short
is likely to bring down just half of it, and any equipment properly set up
with dual independent power feeds should survive...

Dave Chinner April 8, 2009, 11:44 p.m. UTC | #26

On Wed, Apr 08, 2009 at 09:37:56AM -0400, Mikulas Patocka wrote:
> The used trick is basically to take a lock that prevents filesystem-wide 
> updates, submit remaining writes (don't wait), submit the barrier that 
> causes transition to new generation (don't wait) and release the lock. The 
> lock is held for minimum time, no IO is waited for inside the lock. This 
> trick can't be done without barriers, without barriers you'd have to wait 
> inside the lock.

Woo! You just described the technique XFS uses to guarantee
ordering of metadata and log IO (i.e. asynchronous barriers). ;)

> AFAIK this is the only piece of code that uses barriers to improve 
> performance. All the other filesystems use barriers just as a way to flush 
> cache and don't overlap barrier request with any other requests.

The problem is, disks often slow down when you issue barriers. It
doesn't matter what purpose you are using barriers for, they change
the order in which the disk would retire the I/O and hence that
changes performance. Issue enough barriers and performance will
drop noticably.

In the case of XFS, we need to guarantee ordering of every single
log write w.r.t. metadata writeback. Hence barriers are issued
relatively frequently (several a second) and so disks operate at
spindle speed rather than cache speed.  It is the frequency of
barrier IO that slows XFS down, not the way they are implemented.
Your technique will show exactly the same behaviour if you issue
barriers frequently enough.

Cheers,

Dave.

Chris Mason April 9, 2009, 1:27 a.m. UTC | #27

On Wed, 2009-04-08 at 09:37 -0400, Mikulas Patocka wrote:

> So there are two ways:
> 
> - either support only what all in-kernel filesystems do. Using barrier 
> reqiests to flush hw cache. You can remove support for barriers with data, 
> leave just zero-data barrier, you can remove ordering restrictions. 
> In-kernel filesystems never overlap barrier with another metadata request 
> (see above why such overlap can't work), so you can freely reorder 
> zero-data barriers and simplify the code ... because all the requests that 
> could be submitted in paralel with the barrier are either for different 
> partition or non-metadata requests to the same partition from prefetch, 
> direct io or so.
> 
> - or you can allow barriers to be used for purposes as I did. And then, 
> there must be clean indicator "this device supports barriers 
> *and*will*support*them*in*the*future*". Currently there is no such 
> indicator.

I'm afraid expecting barriers forever in the future isn't completely
compatible with dm or md.  Both of these allow the storage to change
over time, and the filesystem needs to handle this without corruptions.

-chris


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Eric Sandeen April 9, 2009, 4:22 a.m. UTC | #28

Lee Revell wrote:
> On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote:
>> One thing to point out here is that there are a lot of "servers" in high end
>> data centers that do not have UPS backup. Those racks full of 1U and 2U
>> boxes that are used to make "grids", "clouds" and so on often are built with
>> as much gear as you can stuff in a rack - no batteries or UPS to be seen, so
>> they are really quite similar to the normal desktop or home systems that we
>> normally run at home :-)
> 
> These days even bargain basement data centers provide UPS functionality for
> you, via generator backup and A/B power.
> 
> Lee

In that case, you can turn off barriers.  It's still no excuse, IMHO,
for us to ship a journaling filesystem configuration which is known to
corrupt on power loss, by default.

-Eric

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Alasdair G Kergon April 9, 2009, 10:28 a.m. UTC | #29

On Wed, Apr 08, 2009 at 09:27:20PM -0400, Chris Mason wrote:
> I'm afraid expecting barriers forever in the future isn't completely
> compatible with dm or md.  Both of these allow the storage to change
> over time, and the filesystem needs to handle this without corruptions.

The missing piece of the jigsaw is notification to upper layers *ahead*
of such reconfigurations.  (We have similar notification issues when
devices are resized - the current approach is to leave the sysadmin
responsible for co-ordinating changes through userspace.)

Alasdair

Ric Wheeler April 9, 2009, 10:48 a.m. UTC | #30

Mikulas Patocka wrote:
>> And I will restate that back at EMC we tested the original barriers (with
>> reiserfs mostly, a bit on ext3 and ext2) and saw significant reduction in file
>> system integrity issues after power loss.
>>     
>
> You saw that barrier-enabled filesystem was worse than the same filesystem 
> without barriers? And what kind of issues were that? Disks writing damaged 
> sectors if powered-off in the middle of the writes? Or data corruptions 
> due to bugs in ReiserFS?
>   

No - I was not being clear. We saw a reduction in issues which is a 
confusing way to say that it was significantly better with barriers 
enabled, for both ext3 & reiserfs.

>   
>> The vantage point I had at EMC while testing and deploying the original
>> barrier work done by Jens and Chris was pretty unique - full ability to do
>> root cause failures of any component when really needed, a huge installed base
>> which could send information home on a regular basis about crashes/fsck
>> instances/etc and the ability (with customer permission) to dial into any box
>> and diagnose issues remotely. Not to mention access to drive vendors to
>> pressure them to make the flushes more robust. The application was also able
>> to validate that all acknowledged writes were consistent.
>>
>> Barriers do work as we have them, but as others have mentioned, it is not a
>> "free" win - fsync will actually move your data safely out to persistent
>> storage for a huge percentage of real users (including every ATA/S-ATA and SAS
>> drive I was able to test).  The file systems I monitored in production use
>> without barriers were much less reliable.
>>     
>
> With write cache or without write cache?
>   
Write cache enabled.

Barriers are off when write cache is disabled - we probe the drives 
write cache and enable barriers at mount time if and only if the 
barriers are on.
> With cache and without barriers the system is violating the specification. 
> There just could be data corruption ... and it will eventually happen.
>
> If you got corruption without cache and without barriers, there's a bug 
> and it needs to be investigated.
>
>   
>> As others have noted, some storage does not need barriers or flushed (high end
>> arrays, drives with no volatile write cache) and some need it but stink (low
>> cost USB flash sticks?) so warning is a good thing to do...
>>
>> ric
>>     
>
> Mikulas
>   

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Barriers still not passing on simple dm devices...

Commit Message

Comments

Patch