Message ID | Pine.LNX.4.64.0903241000010.29968@hs20-bc2-1.build.redhat.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > On Mon, 23 Mar 2009, Eric Sandeen wrote: > > > I've noticed that on 2.6.29-rcX, with Andi's patch > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on > > simple devices) barriers are still getting rejected on these simple devices. > > > > The problem is in __generic_make_request(): > > > > if (bio_barrier(bio) && bio_has_data(bio) && > > (q->next_ordered == QUEUE_ORDERED_NONE)) { > > err = -EOPNOTSUPP; > > goto end_io; > > } > > > > and dm isn't flagging its queue as supporting ordered writes, so it's > > rejected here. > > > > Doing something like this: > > > > + if (t->barriers_supported) > > + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost > > certainly the wrong thing to do) did get my dm-linear device to mount > > with xfs, w/o xfs complaining that its mount-time barrier tests failed. > > > > So what's the right way around this? What should dm (or md for that > > matter) advertise on their queues about ordered-ness? Should there be > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level > > doesn't care, ask the next level" or somesuch? Or should it inherit the > > flag from the next level down? Ideas? > > > > Thanks, > > -Eric > > > > -- > > dm-devel mailing list > > dm-devel@redhat.com > > https://www.redhat.com/mailman/listinfo/dm-devel > > Hi > > This is misdesign in generic bio layer and it should be fixed there. I > think it is blocking barrier support in md-raid1 too. Jens, pls apply the > attached patch. > > Mikulas > > ---- > > Move test for not-supported barriers to __make_request. > > This test prevents barriers from being dispatched to device mapper > and md. > > This test is sensible only for drivers that use requests (such as disk > drivers), not for drivers that use bios. > > It is better to fix it in generic code than to make workaround for it > in device mapper and md. So you audited any ->make_request_fn style driver and made sure they rejected barriers? > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > --- > block/blk-core.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > Index: linux-2.6.29-rc6-devel/block/blk-core.c > =================================================================== > --- linux-2.6.29-rc6-devel.orig/block/blk-core.c 2009-02-23 18:43:37.000000000 +0100 > +++ linux-2.6.29-rc6-devel/block/blk-core.c 2009-02-23 18:44:27.000000000 +0100 > @@ -1145,6 +1145,12 @@ static int __make_request(struct request > const int unplug = bio_unplug(bio); > int rw_flags; > > + if (bio_barrier(bio) && bio_has_data(bio) && > + (q->next_ordered == QUEUE_ORDERED_NONE)) { > + bio_endio(bio, -EOPNOTSUPP); > + return 0; > + } > + > nr_sectors = bio_sectors(bio); > > /* > @@ -1450,11 +1456,6 @@ static inline void __generic_make_reques > err = -EOPNOTSUPP; > goto end_io; > } > - if (bio_barrier(bio) && bio_has_data(bio) && > - (q->next_ordered == QUEUE_ORDERED_NONE)) { > - err = -EOPNOTSUPP; > - goto end_io; > - } > > ret = q->make_request_fn(q, bio); > } while (ret); > >
On Tue, 24 Mar 2009, Jens Axboe wrote: > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > On Mon, 23 Mar 2009, Eric Sandeen wrote: > > > > > I've noticed that on 2.6.29-rcX, with Andi's patch > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on > > > simple devices) barriers are still getting rejected on these simple devices. > > > > > > The problem is in __generic_make_request(): > > > > > > if (bio_barrier(bio) && bio_has_data(bio) && > > > (q->next_ordered == QUEUE_ORDERED_NONE)) { > > > err = -EOPNOTSUPP; > > > goto end_io; > > > } > > > > > > and dm isn't flagging its queue as supporting ordered writes, so it's > > > rejected here. > > > > > > Doing something like this: > > > > > > + if (t->barriers_supported) > > > + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); > > > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost > > > certainly the wrong thing to do) did get my dm-linear device to mount > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed. > > > > > > So what's the right way around this? What should dm (or md for that > > > matter) advertise on their queues about ordered-ness? Should there be > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level > > > doesn't care, ask the next level" or somesuch? Or should it inherit the > > > flag from the next level down? Ideas? > > > > > > Thanks, > > > -Eric > > > > > > -- > > > dm-devel mailing list > > > dm-devel@redhat.com > > > https://www.redhat.com/mailman/listinfo/dm-devel > > > > Hi > > > > This is misdesign in generic bio layer and it should be fixed there. I > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the > > attached patch. > > > > Mikulas > > > > ---- > > > > Move test for not-supported barriers to __make_request. > > > > This test prevents barriers from being dispatched to device mapper > > and md. > > > > This test is sensible only for drivers that use requests (such as disk > > drivers), not for drivers that use bios. > > > > It is better to fix it in generic code than to make workaround for it > > in device mapper and md. > > So you audited any ->make_request_fn style driver and made sure they > rejected barriers? I didn't. If you grep for it, you get: ./arch/powerpc/sysdev/axonram.c: doesn't reject barriers, but it is not needed, it ends all bios in make_request routine ./drivers/block/aoe/aoeblk.c: * doesn't reject barriers, should be modified to do so ./drivers/block/brd.c doesn't reject barriers, doesn't need to, ends all bios in make_request ./drivers/block/loop.c: doesn't reject barriers, it's ok because it doesn't reorder requests ./drivers/block/pktcdvd.c * doesn't reject barriers, should be modified to do so ./drivers/block/umem.c * doesn't reject barriers, I don't know if it reorders requests or not. ./drivers/s390/block/xpram.c doesn't reject barriers, doesn't need, ends bios immediatelly ./drivers/md/raid0.c rejects barriers ./drivers/md/raid1.c supports barriers ./drivers/md/raid10.c rejects barriers ./drivers/md/raid5.c rejects barriers ./drivers/md/linear.c rejects barriers ./drivers/md/dm.c supports barriers partially Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > On Tue, 24 Mar 2009, Jens Axboe wrote: > > > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > > > > On Mon, 23 Mar 2009, Eric Sandeen wrote: > > > > > > > I've noticed that on 2.6.29-rcX, with Andi's patch > > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on > > > > simple devices) barriers are still getting rejected on these simple devices. > > > > > > > > The problem is in __generic_make_request(): > > > > > > > > if (bio_barrier(bio) && bio_has_data(bio) && > > > > (q->next_ordered == QUEUE_ORDERED_NONE)) { > > > > err = -EOPNOTSUPP; > > > > goto end_io; > > > > } > > > > > > > > and dm isn't flagging its queue as supporting ordered writes, so it's > > > > rejected here. > > > > > > > > Doing something like this: > > > > > > > > + if (t->barriers_supported) > > > > + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); > > > > > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost > > > > certainly the wrong thing to do) did get my dm-linear device to mount > > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed. > > > > > > > > So what's the right way around this? What should dm (or md for that > > > > matter) advertise on their queues about ordered-ness? Should there be > > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level > > > > doesn't care, ask the next level" or somesuch? Or should it inherit the > > > > flag from the next level down? Ideas? > > > > > > > > Thanks, > > > > -Eric > > > > > > > > -- > > > > dm-devel mailing list > > > > dm-devel@redhat.com > > > > https://www.redhat.com/mailman/listinfo/dm-devel > > > > > > Hi > > > > > > This is misdesign in generic bio layer and it should be fixed there. I > > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the > > > attached patch. > > > > > > Mikulas > > > > > > ---- > > > > > > Move test for not-supported barriers to __make_request. > > > > > > This test prevents barriers from being dispatched to device mapper > > > and md. > > > > > > This test is sensible only for drivers that use requests (such as disk > > > drivers), not for drivers that use bios. > > > > > > It is better to fix it in generic code than to make workaround for it > > > in device mapper and md. > > > > So you audited any ->make_request_fn style driver and made sure they > > rejected barriers? > > I didn't. > > If you grep for it, you get: > > ./arch/powerpc/sysdev/axonram.c: > doesn't reject barriers, but it is not needed, it ends all bios in > make_request routine > > ./drivers/block/aoe/aoeblk.c: > * doesn't reject barriers, should be modified to do so > > ./drivers/block/brd.c > doesn't reject barriers, doesn't need to, ends all bios in make_request > > ./drivers/block/loop.c: > doesn't reject barriers, it's ok because it doesn't reorder requests > > ./drivers/block/pktcdvd.c > * doesn't reject barriers, should be modified to do so > > ./drivers/block/umem.c > * doesn't reject barriers, I don't know if it reorders requests or not. > > ./drivers/s390/block/xpram.c > doesn't reject barriers, doesn't need, ends bios immediatelly > > ./drivers/md/raid0.c > rejects barriers > > ./drivers/md/raid1.c > supports barriers > > ./drivers/md/raid10.c > rejects barriers > > ./drivers/md/raid5.c > rejects barriers > > ./drivers/md/linear.c > rejects barriers > > ./drivers/md/dm.c > supports barriers partially Not reordering is not enough to support the barrier primitive, unless you always go to the same device and pass the barrier flag down with it. I think having the check in generic_make_request() is perfectly fine, even if the value doesn't completely apply to stacked devices. Perhaps we can add such a value, then. My main point is that barrier support should be opt-in, not a default thing. Over time we should have support everywhere, but it needs to be checked, audited, and trusted.
On Tue, 24 Mar 2009, Jens Axboe wrote: > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > On Tue, 24 Mar 2009, Jens Axboe wrote: > > > > > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > > > > > > > On Mon, 23 Mar 2009, Eric Sandeen wrote: > > > > > > > > > I've noticed that on 2.6.29-rcX, with Andi's patch > > > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on > > > > > simple devices) barriers are still getting rejected on these simple devices. > > > > > > > > > > The problem is in __generic_make_request(): > > > > > > > > > > if (bio_barrier(bio) && bio_has_data(bio) && > > > > > (q->next_ordered == QUEUE_ORDERED_NONE)) { > > > > > err = -EOPNOTSUPP; > > > > > goto end_io; > > > > > } > > > > > > > > > > and dm isn't flagging its queue as supporting ordered writes, so it's > > > > > rejected here. > > > > > > > > > > Doing something like this: > > > > > > > > > > + if (t->barriers_supported) > > > > > + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); > > > > > > > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost > > > > > certainly the wrong thing to do) did get my dm-linear device to mount > > > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed. > > > > > > > > > > So what's the right way around this? What should dm (or md for that > > > > > matter) advertise on their queues about ordered-ness? Should there be > > > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level > > > > > doesn't care, ask the next level" or somesuch? Or should it inherit the > > > > > flag from the next level down? Ideas? > > > > > > > > > > Thanks, > > > > > -Eric > > > > > > > > > > -- > > > > > dm-devel mailing list > > > > > dm-devel@redhat.com > > > > > https://www.redhat.com/mailman/listinfo/dm-devel > > > > > > > > Hi > > > > > > > > This is misdesign in generic bio layer and it should be fixed there. I > > > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the > > > > attached patch. > > > > > > > > Mikulas > > > > > > > > ---- > > > > > > > > Move test for not-supported barriers to __make_request. > > > > > > > > This test prevents barriers from being dispatched to device mapper > > > > and md. > > > > > > > > This test is sensible only for drivers that use requests (such as disk > > > > drivers), not for drivers that use bios. > > > > > > > > It is better to fix it in generic code than to make workaround for it > > > > in device mapper and md. > > > > > > So you audited any ->make_request_fn style driver and made sure they > > > rejected barriers? > > > > I didn't. > > > > If you grep for it, you get: > > > > ./arch/powerpc/sysdev/axonram.c: > > doesn't reject barriers, but it is not needed, it ends all bios in > > make_request routine > > > > ./drivers/block/aoe/aoeblk.c: > > * doesn't reject barriers, should be modified to do so > > > > ./drivers/block/brd.c > > doesn't reject barriers, doesn't need to, ends all bios in make_request > > > > ./drivers/block/loop.c: > > doesn't reject barriers, it's ok because it doesn't reorder requests > > > > ./drivers/block/pktcdvd.c > > * doesn't reject barriers, should be modified to do so > > > > ./drivers/block/umem.c > > * doesn't reject barriers, I don't know if it reorders requests or not. > > > > ./drivers/s390/block/xpram.c > > doesn't reject barriers, doesn't need, ends bios immediatelly > > > > ./drivers/md/raid0.c > > rejects barriers > > > > ./drivers/md/raid1.c > > supports barriers > > > > ./drivers/md/raid10.c > > rejects barriers > > > > ./drivers/md/raid5.c > > rejects barriers > > > > ./drivers/md/linear.c > > rejects barriers > > > > ./drivers/md/dm.c > > supports barriers partially > > Not reordering is not enough to support the barrier primitive, unless > you always go to the same device and pass the barrier flag down with it. For single-device drivers (not md/dm), not reordering should be good enough to claim barrier support. > I think having the check in generic_make_request() is perfectly fine, > even if the value doesn't completely apply to stacked devices. Perhaps > we can add such a value, then. My main point is that barrier support > should be opt-in, not a default thing. So make some flag for these bio-based devices, so that they don't have to use one of those request-based options (which are meaningless for non-request based device). > Over time we should have support everywhere, but it needs to be checked, > audited, and trusted. BTW. What is the rule for barriers if the device can't prevent the requests from being delayed or reordered? (for example ATA<=3 disks with cache that lack cache-flush command ... or flash cards that do write-caching anyway and it can't be turned off). Should they support barriers and try to make best effort? Or should they reject barriers to inform the caller code that they have no data consistency? Mikulas > -- > Jens Axboe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > On Tue, 24 Mar 2009, Jens Axboe wrote: > > > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > > > > On Tue, 24 Mar 2009, Jens Axboe wrote: > > > > > > > On Tue, Mar 24 2009, Mikulas Patocka wrote: > > > > > > > > > > > > > > > On Mon, 23 Mar 2009, Eric Sandeen wrote: > > > > > > > > > > > I've noticed that on 2.6.29-rcX, with Andi's patch > > > > > > (ab4c1424882be9cd70b89abf2b484add355712fa, dm: support barriers on > > > > > > simple devices) barriers are still getting rejected on these simple devices. > > > > > > > > > > > > The problem is in __generic_make_request(): > > > > > > > > > > > > if (bio_barrier(bio) && bio_has_data(bio) && > > > > > > (q->next_ordered == QUEUE_ORDERED_NONE)) { > > > > > > err = -EOPNOTSUPP; > > > > > > goto end_io; > > > > > > } > > > > > > > > > > > > and dm isn't flagging its queue as supporting ordered writes, so it's > > > > > > rejected here. > > > > > > > > > > > > Doing something like this: > > > > > > > > > > > > + if (t->barriers_supported) > > > > > > + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); > > > > > > > > > > > > somewhere in dm (I stuck it in dm_table_set_restrictions() - almost > > > > > > certainly the wrong thing to do) did get my dm-linear device to mount > > > > > > with xfs, w/o xfs complaining that its mount-time barrier tests failed. > > > > > > > > > > > > So what's the right way around this? What should dm (or md for that > > > > > > matter) advertise on their queues about ordered-ness? Should there be > > > > > > some sort of "QUEUE_ORDERED_PASSTHROUGH" or something to say "this level > > > > > > doesn't care, ask the next level" or somesuch? Or should it inherit the > > > > > > flag from the next level down? Ideas? > > > > > > > > > > > > Thanks, > > > > > > -Eric > > > > > > > > > > > > -- > > > > > > dm-devel mailing list > > > > > > dm-devel@redhat.com > > > > > > https://www.redhat.com/mailman/listinfo/dm-devel > > > > > > > > > > Hi > > > > > > > > > > This is misdesign in generic bio layer and it should be fixed there. I > > > > > think it is blocking barrier support in md-raid1 too. Jens, pls apply the > > > > > attached patch. > > > > > > > > > > Mikulas > > > > > > > > > > ---- > > > > > > > > > > Move test for not-supported barriers to __make_request. > > > > > > > > > > This test prevents barriers from being dispatched to device mapper > > > > > and md. > > > > > > > > > > This test is sensible only for drivers that use requests (such as disk > > > > > drivers), not for drivers that use bios. > > > > > > > > > > It is better to fix it in generic code than to make workaround for it > > > > > in device mapper and md. > > > > > > > > So you audited any ->make_request_fn style driver and made sure they > > > > rejected barriers? > > > > > > I didn't. > > > > > > If you grep for it, you get: > > > > > > ./arch/powerpc/sysdev/axonram.c: > > > doesn't reject barriers, but it is not needed, it ends all bios in > > > make_request routine > > > > > > ./drivers/block/aoe/aoeblk.c: > > > * doesn't reject barriers, should be modified to do so > > > > > > ./drivers/block/brd.c > > > doesn't reject barriers, doesn't need to, ends all bios in make_request > > > > > > ./drivers/block/loop.c: > > > doesn't reject barriers, it's ok because it doesn't reorder requests > > > > > > ./drivers/block/pktcdvd.c > > > * doesn't reject barriers, should be modified to do so > > > > > > ./drivers/block/umem.c > > > * doesn't reject barriers, I don't know if it reorders requests or not. > > > > > > ./drivers/s390/block/xpram.c > > > doesn't reject barriers, doesn't need, ends bios immediatelly > > > > > > ./drivers/md/raid0.c > > > rejects barriers > > > > > > ./drivers/md/raid1.c > > > supports barriers > > > > > > ./drivers/md/raid10.c > > > rejects barriers > > > > > > ./drivers/md/raid5.c > > > rejects barriers > > > > > > ./drivers/md/linear.c > > > rejects barriers > > > > > > ./drivers/md/dm.c > > > supports barriers partially > > > > Not reordering is not enough to support the barrier primitive, unless > > you always go to the same device and pass the barrier flag down with it. > > For single-device drivers (not md/dm), not reordering should be good > enough to claim barrier support. Not reordering is what the barrier is all about, the problem is how far down you extend that guarantee. For the linux barrier, it's ALL the way down to and including the hardware. So it's only good enough, if it includes the device not reordering the write. And signalling completion when it's safe. "Good enough" is not an option, it's all or nothing. > > I think having the check in generic_make_request() is perfectly fine, > > even if the value doesn't completely apply to stacked devices. Perhaps > > we can add such a value, then. My main point is that barrier support > > should be opt-in, not a default thing. > > So make some flag for these bio-based devices, so that they don't have to > use one of those request-based options (which are meaningless for > non-request based device). Sure, but as I said, I think it's mainly a cosmetic issue. Signalling simple barrier support is just fine. > > Over time we should have support everywhere, but it needs to be checked, > > audited, and trusted. > > BTW. What is the rule for barriers if the device can't prevent the > requests from being delayed or reordered? (for example ATA<=3 disks with > cache that lack cache-flush command ... or flash cards that do > write-caching anyway and it can't be turned off). Should they support > barriers and try to make best effort? Or should they reject barriers to > inform the caller code that they have no data consistency? If they can't flush cache, then they must reject barriers unless they have write through caching.
> > > Over time we should have support everywhere, but it needs to be checked, > > > audited, and trusted. > > > > BTW. What is the rule for barriers if the device can't prevent the > > requests from being delayed or reordered? (for example ATA<=3 disks with > > cache that lack cache-flush command ... or flash cards that do > > write-caching anyway and it can't be turned off). Should they support > > barriers and try to make best effort? Or should they reject barriers to > > inform the caller code that they have no data consistency? > > If they can't flush cache, then they must reject barriers unless they > have write through caching. ... and you suppose that journaled filesystems will use this error and mark filesystem for fsck if they are running over a device that doesn't support consistency? In theory it would be nice, in practice it doesn't work this way because many devices *DO* support data consistency don't support barriers (the most common are DM and MD when run over disk without write cache). So I think there should be flag (this device does/doesn't support data consistency) that the journaled filesystems can use to mark the disk dirty for fsck. And if you implement this flag, you can accept barriers always to all kind of devices regardless of whether they support consistency. You can then get rid of that -EOPNOTSUPP and simplify filesystem code because they'd no longer need two commit paths and a clumsy way to restart -EOPNOTSUPPed requests. Mikulas > -- > Jens Axboe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > Over time we should have support everywhere, but it needs to be checked, > > > > audited, and trusted. > > > > > > BTW. What is the rule for barriers if the device can't prevent the > > > requests from being delayed or reordered? (for example ATA<=3 disks with > > > cache that lack cache-flush command ... or flash cards that do > > > write-caching anyway and it can't be turned off). Should they support > > > barriers and try to make best effort? Or should they reject barriers to > > > inform the caller code that they have no data consistency? > > > > If they can't flush cache, then they must reject barriers unless they > > have write through caching. > > ... and you suppose that journaled filesystems will use this error and > mark filesystem for fsck if they are running over a device that doesn't > support consistency? No, but they can warn that data consistency isn't guarenteed. And they all do, if you mount with barriers enabled and the barrier write fails. If barriers aren't support, the first one will fail. So either they do lazy detect, or they do a trial barrier write at mount time. So yes, I suppose that file systems will use this error. Because that is what they do. > In theory it would be nice, in practice it doesn't work this way because > many devices *DO* support data consistency don't support barriers (the > most common are DM and MD when run over disk without write cache). Your theory is nice, but most dm systems use write back caching. Any desktop uses write back caching. Only higher end disks default to write-through caching. > So I think there should be flag (this device does/doesn't support data > consistency) that the journaled filesystems can use to mark the disk dirty > for fsck. And if you implement this flag, you can accept barriers always > to all kind of devices regardless of whether they support consistency. You > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > they'd no longer need two commit paths and a clumsy way to restart > -EOPNOTSUPPed requests. And my point is that this case isn't interesting, because most setups don't guarantee proper ordering. The error handling is complex, no doubt about that. But the trial barrier test is pretty trivial and even could be easily abstracted out. If a later barrier write fails, then that's really no different than if a normal write fails. Error handling is not easy in that case.
> > > If they can't flush cache, then they must reject barriers unless they > > > have write through caching. > > > > ... and you suppose that journaled filesystems will use this error and > > mark filesystem for fsck if they are running over a device that doesn't > > support consistency? > > No, but they can warn that data consistency isn't guarenteed. And they > all do, if you mount with barriers enabled and the barrier write fails. > If barriers aren't support, the first one will fail. So either they do > lazy detect, or they do a trial barrier write at mount time. The user shouldn't really be required to know what are barriers, which drivers support them and which don't, and which drivers maintain consistency without barriers and which not. The user only needs to know if he must run fsck in the case of power failure or not. --- and that -EOPNOTSUPP error and warnings about failed barriers give him no information about that. > So yes, I suppose that file systems will use this error. Because that is > what they do. > > > In theory it would be nice, in practice it doesn't work this way because > > many devices *DO* support data consistency don't support barriers (the > > most common are DM and MD when run over disk without write cache). > > Your theory is nice, but most dm systems use write back caching. Any If they do, the filesystem should know about it and fsck the partition in the case of crash. > desktop uses write back caching. Only higher end disks default to > write-through caching. > > > So I think there should be flag (this device does/doesn't support data > > consistency) that the journaled filesystems can use to mark the disk dirty > > for fsck. And if you implement this flag, you can accept barriers always > > to all kind of devices regardless of whether they support consistency. You > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > they'd no longer need two commit paths and a clumsy way to restart > > -EOPNOTSUPPed requests. > > And my point is that this case isn't interesting, because most setups > don't guarantee proper ordering. If the ordering isn't guaranteed, the filesystem should know about it, and mark the partition for fsck. That's why I'm suggesting to use a flag for that. That flag could be also propagated up through md and dm. The reasoning: "write barriers aren't supported => the device doesn't guarantee consistency" isn't valid. > The error handling is complex, no doubt > about that. But the trial barrier test is pretty trivial and even could > be easily abstracted out. If a later barrier write fails, then that's > really no different than if a normal write fails. Error handling is not > easy in that case. I had a discussion with Andi about it some times ago. The conclusion was that all the current filesystems handle barriers failing in the middle of the operation without functionality loss, but it makes barriers useless for any performance-sensitive tasks (commits that wouldn't block concurrent activity). Non-blocking commits could only be implemented if barriers don't fail. Mikulas > -- Jens Axboe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > If they can't flush cache, then they must reject barriers unless they > > > > have write through caching. > > > > > > ... and you suppose that journaled filesystems will use this error and > > > mark filesystem for fsck if they are running over a device that doesn't > > > support consistency? > > > > No, but they can warn that data consistency isn't guarenteed. And they > > all do, if you mount with barriers enabled and the barrier write fails. > > If barriers aren't support, the first one will fail. So either they do > > lazy detect, or they do a trial barrier write at mount time. > > The user shouldn't really be required to know what are barriers, which > drivers support them and which don't, and which drivers maintain > consistency without barriers and which not. > > The user only needs to know if he must run fsck in the case of power > failure or not. --- and that -EOPNOTSUPP error and warnings about failed > barriers give him no information about that. I completely agree, but that's "just" a usability issue. Ext4 will tell you that barriers failed and are now disabled, not very informative. XFS will tell you something similar. > > So yes, I suppose that file systems will use this error. Because that is > > what they do. > > > > > In theory it would be nice, in practice it doesn't work this way because > > > many devices *DO* support data consistency don't support barriers (the > > > most common are DM and MD when run over disk without write cache). > > > > Your theory is nice, but most dm systems use write back caching. Any > > If they do, the filesystem should know about it and fsck the partition in > the case of crash. > > > desktop uses write back caching. Only higher end disks default to > > write-through caching. > > > > > So I think there should be flag (this device does/doesn't support data > > > consistency) that the journaled filesystems can use to mark the disk dirty > > > for fsck. And if you implement this flag, you can accept barriers always > > > to all kind of devices regardless of whether they support consistency. You > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > > they'd no longer need two commit paths and a clumsy way to restart > > > -EOPNOTSUPPed requests. > > > > And my point is that this case isn't interesting, because most setups > > don't guarantee proper ordering. > > If the ordering isn't guaranteed, the filesystem should know about it, and > mark the partition for fsck. That's why I'm suggesting to use a flag for > that. That flag could be also propagated up through md and dm. We can do that, not a problem. The problem is that ordering is almost never preserved, SCSI does not use ordered tags because it hasn't verified that its error path doesn't reorder by mistake. So right now you can basically use 'false' as that flag. > The reasoning: "write barriers aren't supported => the device doesn't > guarantee consistency" isn't valid. It's valid in the sense that it's the only RELIABLE primitive we have. Are you really suggestion that we just assume any device is fully ordered, unless proven otherwise? > > The error handling is complex, no doubt > > about that. But the trial barrier test is pretty trivial and even could > > be easily abstracted out. If a later barrier write fails, then that's > > really no different than if a normal write fails. Error handling is not > > easy in that case. > > I had a discussion with Andi about it some times ago. The conclusion was > that all the current filesystems handle barriers failing in the middle of > the operation without functionality loss, but it makes barriers useless > for any performance-sensitive tasks (commits that wouldn't block > concurrent activity). Non-blocking commits could only be implemented if > barriers don't fail. As long as you do a trial barrier like XFS does, barriers will not fail unless you have media error. Things would also be much easier, if writes never failed.
On Wed, 2009-03-25 at 18:39 -0400, Mikulas Patocka wrote: > > The error handling is complex, no doubt > > about that. But the trial barrier test is pretty trivial and even could > > be easily abstracted out. If a later barrier write fails, then that's > > really no different than if a normal write fails. Error handling is not > > easy in that case. > > I had a discussion with Andi about it some times ago. The conclusion was > that all the current filesystems handle barriers failing in the middle of > the operation without functionality loss, but it makes barriers useless > for any performance-sensitive tasks (commits that wouldn't block > concurrent activity). Non-blocking commits could only be implemented if > barriers don't fail. > If a barrier fails at runtime, the filesystems do fall back to not doing barriers without real problems. But, that's because they don't actually rely on the barriers to decide if an async commit is a good idea. One exception is reiserfs, which does one wait_on_buffer at a later time if barriers are on. But, this isn't an async commit, this is just moving an unplug. In general, async commits happen with threads and they aren't related to barriers. Barriers don't really give us error handling, and they are at the very end of a long chain of technical problems around commits that don't block concurrent activity. -chris -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 26 Mar 2009, Jens Axboe wrote: > On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > > So I think there should be flag (this device does/doesn't support data > > > > consistency) that the journaled filesystems can use to mark the disk dirty > > > > for fsck. And if you implement this flag, you can accept barriers always > > > > to all kind of devices regardless of whether they support consistency. You > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > > > they'd no longer need two commit paths and a clumsy way to restart > > > > -EOPNOTSUPPed requests. > > > > > > And my point is that this case isn't interesting, because most setups > > > don't guarantee proper ordering. > > > > If the ordering isn't guaranteed, the filesystem should know about it, and > > mark the partition for fsck. That's why I'm suggesting to use a flag for > > that. That flag could be also propagated up through md and dm. > > We can do that, not a problem. The problem is that ordering is almost > never preserved, SCSI does not use ordered tags because it hasn't > verified that its error path doesn't reorder by mistake. So right now > you can basically use 'false' as that flag. There are three ordering guarantees: 1. - nothing (for devices with write cache without cache control) 2. - non-cached ordering: the sequence [submit req a, end req a, submit req b, end req b] will make the ordering. It is guaranteed that when the request ends successfully, it is on medium. This is what all the filesystems, md and dm assume about disks. This consistency model was used long way before barriers came in. 3. - barrier ordering: ordering is done with barriers, [submit req a, end req a, submit req b, end req b] won't guarantee ordering of a and b, a barrier must be inserted. --- so you can make a two bitflags that differentiate these models. In current kernel, model (1) and (2) cannot be differentiated in any way. (3) can be differentiated only after a trial write and it won't guarantee that (3) will be valid further. > > The reasoning: "write barriers aren't supported => the device doesn't > > guarantee consistency" isn't valid. > > It's valid in the sense that it's the only RELIABLE primitive we have. > Are you really suggestion that we just assume any device is fully > ordered, unless proven otherwise? If someone implements "write barrier's aren't supported => run fsck", then a lot of systems start fscking needlessly (for example those using md or dm without write cache) and become inoperational for long time because of that. So no one can really implement this logic and filesystems don't run fsck at all when operated over a device that doesn't support ordering. So you get data corruption if you get crash on those devices. > > > The error handling is complex, no doubt > > > about that. But the trial barrier test is pretty trivial and even could > > > be easily abstracted out. If a later barrier write fails, then that's > > > really no different than if a normal write fails. Error handling is not > > > easy in that case. > > > > I had a discussion with Andi about it some times ago. The conclusion was > > that all the current filesystems handle barriers failing in the middle of > > the operation without functionality loss, but it makes barriers useless > > for any performance-sensitive tasks (commits that wouldn't block > > concurrent activity). Non-blocking commits could only be implemented if > > barriers don't fail. > > As long as you do a trial barrier like XFS does, barriers will not fail > unless you have media error. No. The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen submitted a patch that implements failing barriers for device mapper and he says that md-raid1 does the same thing. Filesystems handle these randomly failed barriers but the downside is that they must not submit any request concurrently with the barrier. Also, that -EOPNOTSUPP restarting code is really crap, the request cannot be restarted from bi_end_io, so bi_end_io needs to handle to another thread for retry without barrier. See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread) The patch is silly but it just shows what is really hapenning and what the filesystem must be prepared to deal with. > Things would also be much easier, if writes never failed. > > -- > Jens Axboe I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP error code at all, make barriers always pass to all kinds of devices and inform the caller via queue flags that the device doesn't support ordering. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Mar 30 2009, Mikulas Patocka wrote: > On Thu, 26 Mar 2009, Jens Axboe wrote: > > > On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > > > > So I think there should be flag (this device does/doesn't support data > > > > > consistency) that the journaled filesystems can use to mark the disk dirty > > > > > for fsck. And if you implement this flag, you can accept barriers always > > > > > to all kind of devices regardless of whether they support consistency. You > > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > > > > they'd no longer need two commit paths and a clumsy way to restart > > > > > -EOPNOTSUPPed requests. > > > > > > > > And my point is that this case isn't interesting, because most setups > > > > don't guarantee proper ordering. > > > > > > If the ordering isn't guaranteed, the filesystem should know about it, and > > > mark the partition for fsck. That's why I'm suggesting to use a flag for > > > that. That flag could be also propagated up through md and dm. > > > > We can do that, not a problem. The problem is that ordering is almost > > never preserved, SCSI does not use ordered tags because it hasn't > > verified that its error path doesn't reorder by mistake. So right now > > you can basically use 'false' as that flag. > > There are three ordering guarantees: > > 1. - nothing (for devices with write cache without cache control) > > 2. - non-cached ordering: the sequence [submit req a, end req a, submit > req b, end req b] will make the ordering. It is guaranteed that when the > request ends successfully, it is on medium. This is what all the > filesystems, md and dm assume about disks. This consistency model was used > long way before barriers came in. > > 3. - barrier ordering: ordering is done with barriers, [submit req a, end > req a, submit req b, end req b] won't guarantee ordering of a and b, a > barrier must be inserted. Plus the barrier also allows [submit req a, submit req b] and still count on ordering if either one of them is a barrier. It doesn't have to be sync, like the (2). > --- so you can make a two bitflags that differentiate these models. In > current kernel, model (1) and (2) cannot be differentiated in any way. (3) > can be differentiated only after a trial write and it won't guarantee that > (3) will be valid further. But what's the point? Basically no devices are naturally ordered by default. Either you need cache flushes, or you need to tell the device not to reorder on a per-command basis. > > > The reasoning: "write barriers aren't supported => the device doesn't > > > guarantee consistency" isn't valid. > > > > It's valid in the sense that it's the only RELIABLE primitive we have. > > Are you really suggestion that we just assume any device is fully > > ordered, unless proven otherwise? > > If someone implements "write barrier's aren't supported => run fsck", then > a lot of systems start fscking needlessly (for example those using md or > dm without write cache) and become inoperational for long time because of > that. So no one can really implement this logic and filesystems don't run > fsck at all when operated over a device that doesn't support ordering. So > you get data corruption if you get crash on those devices. Nobody is suggesting that, it's just not a feasible approach. But you have to warn if you don't know whether it provides the ordering guarantee you expect to provide consistency and integrity. > > > > The error handling is complex, no doubt > > > > about that. But the trial barrier test is pretty trivial and even could > > > > be easily abstracted out. If a later barrier write fails, then that's > > > > really no different than if a normal write fails. Error handling is not > > > > easy in that case. > > > > > > I had a discussion with Andi about it some times ago. The conclusion was > > > that all the current filesystems handle barriers failing in the middle of > > > the operation without functionality loss, but it makes barriers useless > > > for any performance-sensitive tasks (commits that wouldn't block > > > concurrent activity). Non-blocking commits could only be implemented if > > > barriers don't fail. > > > > As long as you do a trial barrier like XFS does, barriers will not fail > > unless you have media error. > > No. > > The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen > submitted a patch that implements failing barriers for device mapper and > he says that md-raid1 does the same thing. You are right, if a device is reconfigured beneath you it may very well begin to return -EOPNOTSUPP much later. I didn't take that into account, I was considering only "plain" devices. > Filesystems handle these randomly failed barriers but the downside is that > they must not submit any request concurrently with the barrier. Also, that > -EOPNOTSUPP restarting code is really crap, the request cannot be > restarted from bi_end_io, so bi_end_io needs to handle to another thread > for retry without barrier. It can, but it requires you to operate at the request level. So for file systems that is problematic, it wont work of course. It would not be THAT hard to provide a helper to reissue the request. Not that pretty, but... > See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread) > The patch is silly but it just shows what is really hapenning and what the > filesystem must be prepared to deal with. It's not that silly, we should add special barrier failing to the CONFIG_FAIL stuff. You'd definitely want to exercise that in the file system. > > Things would also be much easier, if writes never failed. > > > > -- > > Jens Axboe > > I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP > error code at all, make barriers always pass to all kinds of devices and > inform the caller via queue flags that the device doesn't support > ordering. Not a queue flag. Make it succeed to get rid of the whole retry business, but flag the bio with the information anyway.
On Tue, 31 Mar 2009, Jens Axboe wrote: > On Mon, Mar 30 2009, Mikulas Patocka wrote: > > On Thu, 26 Mar 2009, Jens Axboe wrote: > > > > > On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > > > > > > So I think there should be flag (this device does/doesn't support data > > > > > > consistency) that the journaled filesystems can use to mark the disk dirty > > > > > > for fsck. And if you implement this flag, you can accept barriers always > > > > > > to all kind of devices regardless of whether they support consistency. You > > > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > > > > > they'd no longer need two commit paths and a clumsy way to restart > > > > > > -EOPNOTSUPPed requests. > > > > > > > > > > And my point is that this case isn't interesting, because most setups > > > > > don't guarantee proper ordering. > > > > > > > > If the ordering isn't guaranteed, the filesystem should know about it, and > > > > mark the partition for fsck. That's why I'm suggesting to use a flag for > > > > that. That flag could be also propagated up through md and dm. > > > > > > We can do that, not a problem. The problem is that ordering is almost > > > never preserved, SCSI does not use ordered tags because it hasn't > > > verified that its error path doesn't reorder by mistake. So right now > > > you can basically use 'false' as that flag. > > > > There are three ordering guarantees: > > > > 1. - nothing (for devices with write cache without cache control) > > > > 2. - non-cached ordering: the sequence [submit req a, end req a, submit > > req b, end req b] will make the ordering. It is guaranteed that when the > > request ends successfully, it is on medium. This is what all the > > filesystems, md and dm assume about disks. This consistency model was used > > long way before barriers came in. > > > > 3. - barrier ordering: ordering is done with barriers, [submit req a, end > > req a, submit req b, end req b] won't guarantee ordering of a and b, a > > barrier must be inserted. > > Plus the barrier also allows [submit req a, submit req b] and still > count on ordering if either one of them is a barrier. It doesn't have to > be sync, like the (2). > > > --- so you can make a two bitflags that differentiate these models. In > > current kernel, model (1) and (2) cannot be differentiated in any way. (3) > > can be differentiated only after a trial write and it won't guarantee that > > (3) will be valid further. > > But what's the point? Basically no devices are naturally ordered by > default. Either you need cache flushes, or you need to tell the device > not to reorder on a per-command basis. > > > > > The reasoning: "write barriers aren't supported => the device doesn't > > > > guarantee consistency" isn't valid. > > > > > > It's valid in the sense that it's the only RELIABLE primitive we have. > > > Are you really suggestion that we just assume any device is fully > > > ordered, unless proven otherwise? > > > > If someone implements "write barrier's aren't supported => run fsck", then > > a lot of systems start fscking needlessly (for example those using md or > > dm without write cache) and become inoperational for long time because of > > that. So no one can really implement this logic and filesystems don't run > > fsck at all when operated over a device that doesn't support ordering. So > > you get data corruption if you get crash on those devices. > > Nobody is suggesting that, it's just not a feasible approach. But you I am saying that the filesystem should run fsck if journaled filesystem is mounted on an unsafe device and crash happens. > have to warn if you don't know whether it provides the ordering > guarantee you expect to provide consistency and integrity. The warning of missing barriers (or other actions) should be printed only if write cache is enabled. But there's no way how a filesystem on the top of several dm or md layers can find out if the disk is running with hdparm -w 0 or hdparm -w 1. > > The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen > > submitted a patch that implements failing barriers for device mapper and > > he says that md-raid1 does the same thing. > > You are right, if a device is reconfigured beneath you it may very well > begin to return -EOPNOTSUPP much later. I didn't take that into account, > I was considering only "plain" devices. > > > Filesystems handle these randomly failed barriers but the downside is that > > they must not submit any request concurrently with the barrier. Also, that > > -EOPNOTSUPP restarting code is really crap, the request cannot be > > restarted from bi_end_io, so bi_end_io needs to handle to another thread > > for retry without barrier. > > It can, but it requires you to operate at the request level. So for file > systems that is problematic, it wont work of course. It would not be > THAT hard to provide a helper to reissue the request. Not that pretty, > but... And it makes barriers useless for ordering. The filesystem can't do [submit req a], [submit barrier req b], [submit req c] and assume that the requests will be ordered. If [b] fails with -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption has already happened. Even if you catch [b]'s error and resubmit it as non-barrier request, it's too late. So, as a result of this complication, all the existing filesystems send just one barrier request and do not try to overlay it with any other write requests. So I'm wondering why Linux developers designed a barrier interface with complex specification, complex implementation and the interface is useless to provide any request ordering and it's no better than q->issue_flush_fn method or whatever was there beffore. Obviously, the whole barrier thing was designed by a person who never used it in a filesystem. > > See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread) > > The patch is silly but it just shows what is really hapenning and what the > > filesystem must be prepared to deal with. > > It's not that silly, we should add special barrier failing to the > CONFIG_FAIL stuff. You'd definitely want to exercise that in the file > system. > > > > Things would also be much easier, if writes never failed. > > > > > > -- > > > Jens Axboe > > > > I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP > > error code at all, make barriers always pass to all kinds of devices and > > inform the caller via queue flags that the device doesn't support > > ordering. > > Not a queue flag. Make it succeed to get rid of the whole retry > business, but flag the bio with the information anyway. That's a possibility too. Mikulas > -- > Jens Axboe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Apr 02 2009, Mikulas Patocka wrote: > > > On Tue, 31 Mar 2009, Jens Axboe wrote: > > > On Mon, Mar 30 2009, Mikulas Patocka wrote: > > > On Thu, 26 Mar 2009, Jens Axboe wrote: > > > > > > > On Wed, Mar 25 2009, Mikulas Patocka wrote: > > > > > > > > > > > So I think there should be flag (this device does/doesn't support data > > > > > > > consistency) that the journaled filesystems can use to mark the disk dirty > > > > > > > for fsck. And if you implement this flag, you can accept barriers always > > > > > > > to all kind of devices regardless of whether they support consistency. You > > > > > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because > > > > > > > they'd no longer need two commit paths and a clumsy way to restart > > > > > > > -EOPNOTSUPPed requests. > > > > > > > > > > > > And my point is that this case isn't interesting, because most setups > > > > > > don't guarantee proper ordering. > > > > > > > > > > If the ordering isn't guaranteed, the filesystem should know about it, and > > > > > mark the partition for fsck. That's why I'm suggesting to use a flag for > > > > > that. That flag could be also propagated up through md and dm. > > > > > > > > We can do that, not a problem. The problem is that ordering is almost > > > > never preserved, SCSI does not use ordered tags because it hasn't > > > > verified that its error path doesn't reorder by mistake. So right now > > > > you can basically use 'false' as that flag. > > > > > > There are three ordering guarantees: > > > > > > 1. - nothing (for devices with write cache without cache control) > > > > > > 2. - non-cached ordering: the sequence [submit req a, end req a, submit > > > req b, end req b] will make the ordering. It is guaranteed that when the > > > request ends successfully, it is on medium. This is what all the > > > filesystems, md and dm assume about disks. This consistency model was used > > > long way before barriers came in. > > > > > > 3. - barrier ordering: ordering is done with barriers, [submit req a, end > > > req a, submit req b, end req b] won't guarantee ordering of a and b, a > > > barrier must be inserted. > > > > Plus the barrier also allows [submit req a, submit req b] and still > > count on ordering if either one of them is a barrier. It doesn't have to > > be sync, like the (2). > > > > > --- so you can make a two bitflags that differentiate these models. In > > > current kernel, model (1) and (2) cannot be differentiated in any way. (3) > > > can be differentiated only after a trial write and it won't guarantee that > > > (3) will be valid further. > > > > But what's the point? Basically no devices are naturally ordered by > > default. Either you need cache flushes, or you need to tell the device > > not to reorder on a per-command basis. > > > > > > > The reasoning: "write barriers aren't supported => the device doesn't > > > > > guarantee consistency" isn't valid. > > > > > > > > It's valid in the sense that it's the only RELIABLE primitive we have. > > > > Are you really suggestion that we just assume any device is fully > > > > ordered, unless proven otherwise? > > > > > > If someone implements "write barrier's aren't supported => run fsck", then > > > a lot of systems start fscking needlessly (for example those using md or > > > dm without write cache) and become inoperational for long time because of > > > that. So no one can really implement this logic and filesystems don't run > > > fsck at all when operated over a device that doesn't support ordering. So > > > you get data corruption if you get crash on those devices. > > > > Nobody is suggesting that, it's just not a feasible approach. But you > > I am saying that the filesystem should run fsck if journaled filesystem is > mounted on an unsafe device and crash happens. > > > have to warn if you don't know whether it provides the ordering > > guarantee you expect to provide consistency and integrity. > > The warning of missing barriers (or other actions) should be printed only > if write cache is enabled. But there's no way how a filesystem on the top > of several dm or md layers can find out if the disk is running with hdparm > -w 0 or hdparm -w 1. Right, you can't possibly now that. Hence we have to print the warning. > > > The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen > > > submitted a patch that implements failing barriers for device mapper and > > > he says that md-raid1 does the same thing. > > > > You are right, if a device is reconfigured beneath you it may very well > > begin to return -EOPNOTSUPP much later. I didn't take that into account, > > I was considering only "plain" devices. > > > > > Filesystems handle these randomly failed barriers but the downside is that > > > they must not submit any request concurrently with the barrier. Also, that > > > -EOPNOTSUPP restarting code is really crap, the request cannot be > > > restarted from bi_end_io, so bi_end_io needs to handle to another thread > > > for retry without barrier. > > > > It can, but it requires you to operate at the request level. So for file > > systems that is problematic, it wont work of course. It would not be > > THAT hard to provide a helper to reissue the request. Not that pretty, > > but... > > And it makes barriers useless for ordering. > > The filesystem can't do [submit req a], [submit barrier req b], [submit > req c] and assume that the requests will be ordered. If [b] fails with > -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption > has already happened. Even if you catch [b]'s error and resubmit it as > non-barrier request, it's too late. > > So, as a result of this complication, all the existing filesystems send > just one barrier request and do not try to overlay it with any other write > requests. > > So I'm wondering why Linux developers designed a barrier interface with > complex specification, complex implementation and the interface is useless > to provide any request ordering and it's no better than q->issue_flush_fn > method or whatever was there beffore. Obviously, the whole barrier thing > was designed by a person who never used it in a filesystem. That's not quite true, it was done in conjunction with file system people. At a certain level, we are restricted by what the hardware can actually do. It's certainly possible to make sure your storage stack can support barriers and be safe in that regard, but it's certainly also true that reconfiguring devices may void that guarantee. So it's not perfect, but it's the best we can do. The worst part is that it's virtually impossible to inform of such limitations. If we get rid of -EOPNOTSUPP and just warn in such cases, then you should never see -EOPNOTSUPP in the above sequence. You may not actually be safe, hence we print a warning. It'll also make the whole thing a lot less complex. And to wrap up with the history of barriers, there was NOTHING before. ->issue_flush_fn is a later addition to just force a flush for fsync() and friends, the original implementation was just a data bio/bh with barrier semantics, providing no reordering before and after the data passed. Nobody was interested in barriers when they were done. Nobody. The fact that it's taken 6 years or so to actually emerge as a hot topic for data consistency should make that quite obvious. So the original implementation was basically a joint effort with Chris on the reiser side and EMC as the hw vendor and me doing the block implementation.
On 04/03/2009 04:11 AM, Jens Axboe wrote: > On Thu, Apr 02 2009, Mikulas Patocka wrote: > >> On Tue, 31 Mar 2009, Jens Axboe wrote: >> >> >>> On Mon, Mar 30 2009, Mikulas Patocka wrote: >>> >>>> On Thu, 26 Mar 2009, Jens Axboe wrote: >>>> >>>> >>>>> On Wed, Mar 25 2009, Mikulas Patocka wrote: >>>>> >>>>> >>>>>>>> So I think there should be flag (this device does/doesn't support data >>>>>>>> consistency) that the journaled filesystems can use to mark the disk dirty >>>>>>>> for fsck. And if you implement this flag, you can accept barriers always >>>>>>>> to all kind of devices regardless of whether they support consistency. You >>>>>>>> can then get rid of that -EOPNOTSUPP and simplify filesystem code because >>>>>>>> they'd no longer need two commit paths and a clumsy way to restart >>>>>>>> -EOPNOTSUPPed requests. >>>>>>>> >>>>>>> And my point is that this case isn't interesting, because most setups >>>>>>> don't guarantee proper ordering. >>>>>>> >>>>>> If the ordering isn't guaranteed, the filesystem should know about it, and >>>>>> mark the partition for fsck. That's why I'm suggesting to use a flag for >>>>>> that. That flag could be also propagated up through md and dm. >>>>>> >>>>> We can do that, not a problem. The problem is that ordering is almost >>>>> never preserved, SCSI does not use ordered tags because it hasn't >>>>> verified that its error path doesn't reorder by mistake. So right now >>>>> you can basically use 'false' as that flag. >>>>> >>>> There are three ordering guarantees: >>>> >>>> 1. - nothing (for devices with write cache without cache control) >>>> >>>> 2. - non-cached ordering: the sequence [submit req a, end req a, submit >>>> req b, end req b] will make the ordering. It is guaranteed that when the >>>> request ends successfully, it is on medium. This is what all the >>>> filesystems, md and dm assume about disks. This consistency model was used >>>> long way before barriers came in. >>>> >>>> 3. - barrier ordering: ordering is done with barriers, [submit req a, end >>>> req a, submit req b, end req b] won't guarantee ordering of a and b, a >>>> barrier must be inserted. >>>> >>> Plus the barrier also allows [submit req a, submit req b] and still >>> count on ordering if either one of them is a barrier. It doesn't have to >>> be sync, like the (2). >>> >>> >>>> --- so you can make a two bitflags that differentiate these models. In >>>> current kernel, model (1) and (2) cannot be differentiated in any way. (3) >>>> can be differentiated only after a trial write and it won't guarantee that >>>> (3) will be valid further. >>>> >>> But what's the point? Basically no devices are naturally ordered by >>> default. Either you need cache flushes, or you need to tell the device >>> not to reorder on a per-command basis. >>> >>> >>>>>> The reasoning: "write barriers aren't supported => the device doesn't >>>>>> guarantee consistency" isn't valid. >>>>>> >>>>> It's valid in the sense that it's the only RELIABLE primitive we have. >>>>> Are you really suggestion that we just assume any device is fully >>>>> ordered, unless proven otherwise? >>>>> >>>> If someone implements "write barrier's aren't supported => run fsck", then >>>> a lot of systems start fscking needlessly (for example those using md or >>>> dm without write cache) and become inoperational for long time because of >>>> that. So no one can really implement this logic and filesystems don't run >>>> fsck at all when operated over a device that doesn't support ordering. So >>>> you get data corruption if you get crash on those devices. >>>> >>> Nobody is suggesting that, it's just not a feasible approach. But you >>> >> I am saying that the filesystem should run fsck if journaled filesystem is >> mounted on an unsafe device and crash happens. >> >> >>> have to warn if you don't know whether it provides the ordering >>> guarantee you expect to provide consistency and integrity. >>> >> The warning of missing barriers (or other actions) should be printed only >> if write cache is enabled. But there's no way how a filesystem on the top >> of several dm or md layers can find out if the disk is running with hdparm >> -w 0 or hdparm -w 1. >> > > Right, you can't possibly now that. Hence we have to print the warning. > > >>>> The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen >>>> submitted a patch that implements failing barriers for device mapper and >>>> he says that md-raid1 does the same thing. >>>> >>> You are right, if a device is reconfigured beneath you it may very well >>> begin to return -EOPNOTSUPP much later. I didn't take that into account, >>> I was considering only "plain" devices. >>> >>> >>>> Filesystems handle these randomly failed barriers but the downside is that >>>> they must not submit any request concurrently with the barrier. Also, that >>>> -EOPNOTSUPP restarting code is really crap, the request cannot be >>>> restarted from bi_end_io, so bi_end_io needs to handle to another thread >>>> for retry without barrier. >>>> >>> It can, but it requires you to operate at the request level. So for file >>> systems that is problematic, it wont work of course. It would not be >>> THAT hard to provide a helper to reissue the request. Not that pretty, >>> but... >>> >> And it makes barriers useless for ordering. >> >> The filesystem can't do [submit req a], [submit barrier req b], [submit >> req c] and assume that the requests will be ordered. If [b] fails with >> -EOPNOTSUPP, [a] and [c] could be already reordered and data corruption >> has already happened. Even if you catch [b]'s error and resubmit it as >> non-barrier request, it's too late. >> >> So, as a result of this complication, all the existing filesystems send >> just one barrier request and do not try to overlay it with any other write >> requests. >> >> So I'm wondering why Linux developers designed a barrier interface with >> complex specification, complex implementation and the interface is useless >> to provide any request ordering and it's no better than q->issue_flush_fn >> method or whatever was there beffore. Obviously, the whole barrier thing >> was designed by a person who never used it in a filesystem. >> > > That's not quite true, it was done in conjunction with file system > people. At a certain level, we are restricted by what the hardware can > actually do. It's certainly possible to make sure your storage stack can > support barriers and be safe in that regard, but it's certainly also > true that reconfiguring devices may void that guarantee. So it's not > perfect, but it's the best we can do. The worst part is that it's > virtually impossible to inform of such limitations. > > If we get rid of -EOPNOTSUPP and just warn in such cases, then you > should never see -EOPNOTSUPP in the above sequence. You may not actually > be safe, hence we print a warning. It'll also make the whole thing a lot > less complex. > > And to wrap up with the history of barriers, there was NOTHING before. > ->issue_flush_fn is a later addition to just force a flush for fsync() > and friends, the original implementation was just a data bio/bh with > barrier semantics, providing no reordering before and after the data > passed. > > Nobody was interested in barriers when they were done. Nobody. The fact > that it's taken 6 years or so to actually emerge as a hot topic for data > consistency should make that quite obvious. So the original > implementation was basically a joint effort with Chris on the reiser > side and EMC as the hw vendor and me doing the block implementation. > And I will restate that back at EMC we tested the original barriers (with reiserfs mostly, a bit on ext3 and ext2) and saw significant reduction in file system integrity issues after power loss. The vantage point I had at EMC while testing and deploying the original barrier work done by Jens and Chris was pretty unique - full ability to do root cause failures of any component when really needed, a huge installed base which could send information home on a regular basis about crashes/fsck instances/etc and the ability (with customer permission) to dial into any box and diagnose issues remotely. Not to mention access to drive vendors to pressure them to make the flushes more robust. The application was also able to validate that all acknowledged writes were consistent. Barriers do work as we have them, but as others have mentioned, it is not a "free" win - fsync will actually move your data safely out to persistent storage for a huge percentage of real users (including every ATA/S-ATA and SAS drive I was able to test). The file systems I monitored in production use without barriers were much less reliable. As others have noted, some storage does not need barriers or flushed (high end arrays, drives with no volatile write cache) and some need it but stink (low cost USB flash sticks?) so warning is a good thing to do... ric -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, Apr 04, 2009 at 11:20:35AM -0400, Ric Wheeler wrote: > Barriers do work as we have them, but as others have mentioned, it is > not a "free" win - fsync will actually move your data safely out to > persistent storage for a huge percentage of real users (including every > ATA/S-ATA and SAS drive I was able to test). The file systems I > monitored in production use without barriers were much less reliable. The problem is that, as long as you're not under memory pressure, and not pushing the filesystem heavily, ext3 didn't corrupt *that* often without barriers. So people got away with it "most of the time" --- just as applications replacing files by rewriting them in place using truncate and w/o fsync would "usually" not lose data after a crash if they were using ext3 with data=ordered mode. This caused people to get lazy/sloppy. So yes, barriers were something that were largely ignored for a long time. After all, in a server environment with UPS's, without crappy proprietary video drivers, Linux systems didn't crash that often anyway. So you really needed a large base of systems and with the ability to root cause failures such as what Ric had at EMC to see the problem. Even now, the reason why ext3 doesn't have barriers enabled by default (although we did make them the default for ext4) is because Andrew doesn't believe Chris's replication case is likely to be true for most users in practice, and he's concerned about the performance degradation of barriers. He's basically depending on the fact that "usually" you can get away without using barriers. Sigh.... - Ted P.S. Of course, distributions should feel free to consider changing the default on their kernels. SLES has already if memory serves correctly. I don't know if RHEL has yet. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 04/04/2009 06:28 PM, Theodore Tso wrote: > On Sat, Apr 04, 2009 at 11:20:35AM -0400, Ric Wheeler wrote: > >> Barriers do work as we have them, but as others have mentioned, it is >> not a "free" win - fsync will actually move your data safely out to >> persistent storage for a huge percentage of real users (including every >> ATA/S-ATA and SAS drive I was able to test). The file systems I >> monitored in production use without barriers were much less reliable. >> > > The problem is that, as long as you're not under memory pressure, and > not pushing the filesystem heavily, ext3 didn't corrupt *that* often > without barriers. So people got away with it "most of the time" --- > just as applications replacing files by rewriting them in place using > truncate and w/o fsync would "usually" not lose data after a crash if > they were using ext3 with data=ordered mode. This caused people to > get lazy/sloppy. > > So yes, barriers were something that were largely ignored for a long > time. After all, in a server environment with UPS's, without crappy > proprietary video drivers, Linux systems didn't crash that often > anyway. So you really needed a large base of systems and with the > ability to root cause failures such as what Ric had at EMC to see the > problem. > One thing to point out here is that there are a lot of "servers" in high end data centers that do not have UPS backup. Those racks full of 1U and 2U boxes that are used to make "grids", "clouds" and so on often are built with as much gear as you can stuff in a rack - no batteries or UPS to be seen, so they are really quite similar to the normal desktop or home systems that we normally run at home :-) ric > Even now, the reason why ext3 doesn't have barriers enabled by default > (although we did make them the default for ext4) is because Andrew > doesn't believe Chris's replication case is likely to be true for most > users in practice, and he's concerned about the performance > degradation of barriers. He's basically depending on the fact that > "usually" you can get away without using barriers. Sigh.... > > - Ted > > P.S. Of course, distributions should feel free to consider changing > the default on their kernels. SLES has already if memory serves > correctly. I don't know if RHEL has yet. > > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote: > One thing to point out here is that there are a lot of "servers" in high end > data centers that do not have UPS backup. Those racks full of 1U and 2U > boxes that are used to make "grids", "clouds" and so on often are built with > as much gear as you can stuff in a rack - no batteries or UPS to be seen, so > they are really quite similar to the normal desktop or home systems that we > normally run at home :-) These days even bargain basement data centers provide UPS functionality for you, via generator backup and A/B power. Lee -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 04/05/2009 06:14 PM, Lee Revell wrote: > On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler<ricwheeler@gmail.com> wrote: >> One thing to point out here is that there are a lot of "servers" in high end >> data centers that do not have UPS backup. Those racks full of 1U and 2U >> boxes that are used to make "grids", "clouds" and so on often are built with >> as much gear as you can stuff in a rack - no batteries or UPS to be seen, so >> they are really quite similar to the normal desktop or home systems that we >> normally run at home :-) > > These days even bargain basement data centers provide UPS functionality for > you, via generator backup and A/B power. > > Lee In my experience, you will see multiple customers with and without UPS backup. I have certainly dealt personally with many data centers that lost power without them & see array vendors that continue to build in their own batteries... ric -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> Even now, the reason why ext3 doesn't have barriers enabled by default > (although we did make them the default for ext4) is because Andrew > doesn't believe Chris's replication case is likely to be true for most > users in practice, and he's concerned about the performance > degradation of barriers. He's basically depending on the fact that > "usually" you can get away without using barriers. Sigh.... What is the performance degradation of barriers? If the disk doesn't have write cache, performance of barrier-filesystem should be equal or better than performance of the same filesystem not using barriers. If barriers degrade performance, there is something seriously broken, either in the filesystem (XFS...) or in the block layer. If the disk has write cache and you disable barriers, you might get some performance improvement. But you are getting this performance improvement from the fact that the disk illegaly reorders writes that should be ordered. And you are going to damage your data on power failure. --- definitely, very few admins want this "high-performance & damage" setup (maybe for /tmp or /var/log?) --- such condition sould be only enabled with admin fully knowning what's going on. And where should the admin get the knowledge? In hdparm manpage, there is just: -W Get/set the IDE/SATA drive's write-caching feature. > - Ted > > P.S. Of course, distributions should feel free to consider changing > the default on their kernels. SLES has already if memory serves > correctly. I don't know if RHEL has yet. RHEL doesn't enable write cache by default. And doesn't use barriers (it uses lvm2/device mapper and they won't get through). Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sun, 5 Apr 2009, Lee Revell wrote: > On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote: > > One thing to point out here is that there are a lot of "servers" in high end > > data centers that do not have UPS backup. Those racks full of 1U and 2U > > boxes that are used to make "grids", "clouds" and so on often are built with > > as much gear as you can stuff in a rack - no batteries or UPS to be seen, so > > they are really quite similar to the normal desktop or home systems that we > > normally run at home :-) > > These days even bargain basement data centers provide UPS functionality for > you, via generator backup and A/B power. I have seen an installation having generator backup, but with another serious flaw --- if someone plugged a computer causing short-circuit, it turned off the circuit breaker for the whole rack :) Mikulas > Lee -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> And I will restate that back at EMC we tested the original barriers (with > reiserfs mostly, a bit on ext3 and ext2) and saw significant reduction in file > system integrity issues after power loss. You saw that barrier-enabled filesystem was worse than the same filesystem without barriers? And what kind of issues were that? Disks writing damaged sectors if powered-off in the middle of the writes? Or data corruptions due to bugs in ReiserFS? > The vantage point I had at EMC while testing and deploying the original > barrier work done by Jens and Chris was pretty unique - full ability to do > root cause failures of any component when really needed, a huge installed base > which could send information home on a regular basis about crashes/fsck > instances/etc and the ability (with customer permission) to dial into any box > and diagnose issues remotely. Not to mention access to drive vendors to > pressure them to make the flushes more robust. The application was also able > to validate that all acknowledged writes were consistent. > > Barriers do work as we have them, but as others have mentioned, it is not a > "free" win - fsync will actually move your data safely out to persistent > storage for a huge percentage of real users (including every ATA/S-ATA and SAS > drive I was able to test). The file systems I monitored in production use > without barriers were much less reliable. With write cache or without write cache? With cache and without barriers the system is violating the specification. There just could be data corruption ... and it will eventually happen. If you got corruption without cache and without barriers, there's a bug and it needs to be investigated. > As others have noted, some storage does not need barriers or flushed (high end > arrays, drives with no volatile write cache) and some need it but stink (low > cost USB flash sticks?) so warning is a good thing to do... > > ric Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> > So I'm wondering why Linux developers designed a barrier interface with > > complex specification, complex implementation and the interface is useless > > to provide any request ordering and it's no better than q->issue_flush_fn > > method or whatever was there beffore. Obviously, the whole barrier thing > > was designed by a person who never used it in a filesystem. > > That's not quite true, it was done in conjunction with file system > people. > ... > Nobody was interested in barriers when they were done. Nobody. That's a contradiction :-) Some times ago I wrote a piece of code that uses barriers for performance enhancement (http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/spadfs-0.9.10.tar.gz). The used trick is basically to take a lock that prevents filesystem-wide updates, submit remaining writes (don't wait), submit the barrier that causes transition to new generation (don't wait) and release the lock. The lock is held for minimum time, no IO is waited for inside the lock. This trick can't be done without barriers, without barriers you'd have to wait inside the lock. And the requirement for this code is that barriers are supported for the whole lifetime of the filesystem --- which is what the Linux kernel doesn't support! If barrier support is lost, consistency is damaged. With barriers, the code does [submit A, submit barrier B, submit C]. If you don't have barriers, you must modify this sequence to: [submit A, wait for A endio, submit B, wait for B endio, submit C] --- and now you are getting the point why failing barriers can't ever work --- by the time request B completes, you find out that the device lost barrier support and you realize that you should have inserted the waits in the past --- but it's too late, there is no way to insert them retroactively. AFAIK this is the only piece of code that uses barriers to improve performance. All the other filesystems use barriers just as a way to flush cache and don't overlap barrier request with any other requests. So there are two ways: - either support only what all in-kernel filesystems do. Using barrier reqiests to flush hw cache. You can remove support for barriers with data, leave just zero-data barrier, you can remove ordering restrictions. In-kernel filesystems never overlap barrier with another metadata request (see above why such overlap can't work), so you can freely reorder zero-data barriers and simplify the code ... because all the requests that could be submitted in paralel with the barrier are either for different partition or non-metadata requests to the same partition from prefetch, direct io or so. - or you can allow barriers to be used for purposes as I did. And then, there must be clean indicator "this device supports barriers *and*will*support*them*in*the*future*". Currently there is no such indicator. Mikulas > -- > Jens Axboe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Apr 08 2009, Mikulas Patocka wrote: > > > So I'm wondering why Linux developers designed a barrier interface with > > > complex specification, complex implementation and the interface is useless > > > to provide any request ordering and it's no better than q->issue_flush_fn > > > method or whatever was there beffore. Obviously, the whole barrier thing > > > was designed by a person who never used it in a filesystem. > > > > That's not quite true, it was done in conjunction with file system > > people. > > ... > > Nobody was interested in barriers when they were done. Nobody. > > That's a contradiction :-) So we have sunk to this level now, snip and edit citing? > > Some times ago I wrote a piece of code that uses barriers for performance > enhancement > (http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/spadfs-0.9.10.tar.gz). > > The used trick is basically to take a lock that prevents filesystem-wide > updates, submit remaining writes (don't wait), submit the barrier that > causes transition to new generation (don't wait) and release the lock. The > lock is held for minimum time, no IO is waited for inside the lock. This > trick can't be done without barriers, without barriers you'd have to wait > inside the lock. > > And the requirement for this code is that barriers are supported for the > whole lifetime of the filesystem --- which is what the Linux kernel > doesn't support! If barrier support is lost, consistency is damaged. > > With barriers, the code does [submit A, submit barrier B, submit C]. > If you don't have barriers, you must modify this sequence to: [submit A, > wait for A endio, submit B, wait for B endio, submit C] > > --- and now you are getting the point why failing barriers can't ever work > --- by the time request B completes, you find out that the device lost > barrier support and you realize that you should have inserted the waits in > the past --- but it's too late, there is no way to insert them > retroactively. > > AFAIK this is the only piece of code that uses barriers to improve > performance. All the other filesystems use barriers just as a way to flush > cache and don't overlap barrier request with any other requests. > > So there are two ways: > > - either support only what all in-kernel filesystems do. Using barrier > reqiests to flush hw cache. You can remove support for barriers with data, > leave just zero-data barrier, you can remove ordering restrictions. > In-kernel filesystems never overlap barrier with another metadata request > (see above why such overlap can't work), so you can freely reorder > zero-data barriers and simplify the code ... because all the requests that > could be submitted in paralel with the barrier are either for different > partition or non-metadata requests to the same partition from prefetch, > direct io or so. > > - or you can allow barriers to be used for purposes as I did. And then, > there must be clean indicator "this device supports barriers > *and*will*support*them*in*the*future*". Currently there is no such > indicator. I'm about to leave, so I wont comment on the above. But the below is basically what we pretend right now, and I think that is perfectly fine. If you go and reconfigure your device and it suddenly doesn't support barriers anymore, call the doctor and tell him that your foot hurts when you slam it in the door. Don't do that, it's pretty simple. We already agreed to kill the -EOPNOTSUPP and just pretend it always works, with a notifier that we MAY not be safe. Not much point in harping the same thing over and over again.
On Wed, 08 Apr 2009, Mikulas Patocka wrote: > I have seen an installation having generator backup, but with another > serious flaw --- if someone plugged a computer causing short-circuit, it > turned off the circuit breaker for the whole rack :) You will be hard-pressed to find datacenters where the above is NOT true :-) Although, usually one has TWO independent power feeds per rack, so a short is likely to bring down just half of it, and any equipment properly set up with dual independent power feeds should survive...
On Wed, Apr 08, 2009 at 09:37:56AM -0400, Mikulas Patocka wrote: > The used trick is basically to take a lock that prevents filesystem-wide > updates, submit remaining writes (don't wait), submit the barrier that > causes transition to new generation (don't wait) and release the lock. The > lock is held for minimum time, no IO is waited for inside the lock. This > trick can't be done without barriers, without barriers you'd have to wait > inside the lock. Woo! You just described the technique XFS uses to guarantee ordering of metadata and log IO (i.e. asynchronous barriers). ;) > AFAIK this is the only piece of code that uses barriers to improve > performance. All the other filesystems use barriers just as a way to flush > cache and don't overlap barrier request with any other requests. The problem is, disks often slow down when you issue barriers. It doesn't matter what purpose you are using barriers for, they change the order in which the disk would retire the I/O and hence that changes performance. Issue enough barriers and performance will drop noticably. In the case of XFS, we need to guarantee ordering of every single log write w.r.t. metadata writeback. Hence barriers are issued relatively frequently (several a second) and so disks operate at spindle speed rather than cache speed. It is the frequency of barrier IO that slows XFS down, not the way they are implemented. Your technique will show exactly the same behaviour if you issue barriers frequently enough. Cheers, Dave.
On Wed, 2009-04-08 at 09:37 -0400, Mikulas Patocka wrote: > So there are two ways: > > - either support only what all in-kernel filesystems do. Using barrier > reqiests to flush hw cache. You can remove support for barriers with data, > leave just zero-data barrier, you can remove ordering restrictions. > In-kernel filesystems never overlap barrier with another metadata request > (see above why such overlap can't work), so you can freely reorder > zero-data barriers and simplify the code ... because all the requests that > could be submitted in paralel with the barrier are either for different > partition or non-metadata requests to the same partition from prefetch, > direct io or so. > > - or you can allow barriers to be used for purposes as I did. And then, > there must be clean indicator "this device supports barriers > *and*will*support*them*in*the*future*". Currently there is no such > indicator. I'm afraid expecting barriers forever in the future isn't completely compatible with dm or md. Both of these allow the storage to change over time, and the filesystem needs to handle this without corruptions. -chris -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Lee Revell wrote: > On Sun, Apr 5, 2009 at 7:54 AM, Ric Wheeler <ricwheeler@gmail.com> wrote: >> One thing to point out here is that there are a lot of "servers" in high end >> data centers that do not have UPS backup. Those racks full of 1U and 2U >> boxes that are used to make "grids", "clouds" and so on often are built with >> as much gear as you can stuff in a rack - no batteries or UPS to be seen, so >> they are really quite similar to the normal desktop or home systems that we >> normally run at home :-) > > These days even bargain basement data centers provide UPS functionality for > you, via generator backup and A/B power. > > Lee In that case, you can turn off barriers. It's still no excuse, IMHO, for us to ship a journaling filesystem configuration which is known to corrupt on power loss, by default. -Eric -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Apr 08, 2009 at 09:27:20PM -0400, Chris Mason wrote: > I'm afraid expecting barriers forever in the future isn't completely > compatible with dm or md. Both of these allow the storage to change > over time, and the filesystem needs to handle this without corruptions. The missing piece of the jigsaw is notification to upper layers *ahead* of such reconfigurations. (We have similar notification issues when devices are resized - the current approach is to leave the sysadmin responsible for co-ordinating changes through userspace.) Alasdair
Mikulas Patocka wrote: >> And I will restate that back at EMC we tested the original barriers (with >> reiserfs mostly, a bit on ext3 and ext2) and saw significant reduction in file >> system integrity issues after power loss. >> > > You saw that barrier-enabled filesystem was worse than the same filesystem > without barriers? And what kind of issues were that? Disks writing damaged > sectors if powered-off in the middle of the writes? Or data corruptions > due to bugs in ReiserFS? > No - I was not being clear. We saw a reduction in issues which is a confusing way to say that it was significantly better with barriers enabled, for both ext3 & reiserfs. > >> The vantage point I had at EMC while testing and deploying the original >> barrier work done by Jens and Chris was pretty unique - full ability to do >> root cause failures of any component when really needed, a huge installed base >> which could send information home on a regular basis about crashes/fsck >> instances/etc and the ability (with customer permission) to dial into any box >> and diagnose issues remotely. Not to mention access to drive vendors to >> pressure them to make the flushes more robust. The application was also able >> to validate that all acknowledged writes were consistent. >> >> Barriers do work as we have them, but as others have mentioned, it is not a >> "free" win - fsync will actually move your data safely out to persistent >> storage for a huge percentage of real users (including every ATA/S-ATA and SAS >> drive I was able to test). The file systems I monitored in production use >> without barriers were much less reliable. >> > > With write cache or without write cache? > Write cache enabled. Barriers are off when write cache is disabled - we probe the drives write cache and enable barriers at mount time if and only if the barriers are on. > With cache and without barriers the system is violating the specification. > There just could be data corruption ... and it will eventually happen. > > If you got corruption without cache and without barriers, there's a bug > and it needs to be investigated. > > >> As others have noted, some storage does not need barriers or flushed (high end >> arrays, drives with no volatile write cache) and some need it but stink (low >> cost USB flash sticks?) so warning is a good thing to do... >> >> ric >> > > Mikulas > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6.29-rc6-devel/block/blk-core.c =================================================================== --- linux-2.6.29-rc6-devel.orig/block/blk-core.c 2009-02-23 18:43:37.000000000 +0100 +++ linux-2.6.29-rc6-devel/block/blk-core.c 2009-02-23 18:44:27.000000000 +0100 @@ -1145,6 +1145,12 @@ static int __make_request(struct request const int unplug = bio_unplug(bio); int rw_flags; + if (bio_barrier(bio) && bio_has_data(bio) && + (q->next_ordered == QUEUE_ORDERED_NONE)) { + bio_endio(bio, -EOPNOTSUPP); + return 0; + } + nr_sectors = bio_sectors(bio); /* @@ -1450,11 +1456,6 @@ static inline void __generic_make_reques err = -EOPNOTSUPP; goto end_io; } - if (bio_barrier(bio) && bio_has_data(bio) && - (q->next_ordered == QUEUE_ORDERED_NONE)) { - err = -EOPNOTSUPP; - goto end_io; - } ret = q->make_request_fn(q, bio); } while (ret);