Message ID | 20150720151849.GA2282@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote: > If XFS fails to write metadata it will retry the write indefinitely > (with the hope that the write will succeed at some point in the future). > > Others can possibly speak to historic reason(s) why this is a sane > default for XFS. But when XFS is deployed ontop of DM thin provisioning > this infinite retry is very unwelcome -- especially if DM thinp was > configured to be automatically extended with free space but the admin > hasn't provided (or restored) adequate free space. > > To fix this infinite retry a new bdev_has_space () hook is added to XFS > to break out of its metadata retry loop if the underlying block device > reports it no longer has free space. DM thin provisioning is now > trained to respond accordingly, which enables XFS to not cause a cascade > of tasks blocked on IO waiting for XFS's infinite retry. > > All other block devices, which don't implement a .has_space method in > block_device_operations, will always return true for bdev_has_space(). > > With this change XFS will fail the metadata IO, force shutdown, and the > XFS filesystem may be unmounted. This enables an admin to recover from > their oversight, of not having provided enough free space, without > having to force a hard reset of the system to get XFS to unwedge. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> Shouldn't dm-thinp just return the bio with ENOSPC as it's error? The scsi layers already do this for hardware thinp ENOSPC failures, so dm-thinp should behave exactly the same (i.e. via __scsi_error_from_host_byte()). The behaviour of the filesystem should be the same in all cases - making it conditional on whether the thinp implementation can be polled for available space is wrong as most hardware thinp can't be polled by the kernel forthis info.. If dm-thinp just returns ENOSPC from on the BIO like other hardware thinp devices, then it is up to the filesystem to handle that appropriately. i.e. whether an ENOSPC IO error is fatal to the filesystem is determined by filesystem configuration and context of the IO error, not whether the block device has no space (which we should already know from the ENOSPC error delivered by IO completion). Cheers, Dave.
On Mon, Jul 20 2015 at 6:36pm -0400, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote: > > If XFS fails to write metadata it will retry the write indefinitely > > (with the hope that the write will succeed at some point in the future). > > > > Others can possibly speak to historic reason(s) why this is a sane > > default for XFS. But when XFS is deployed ontop of DM thin provisioning > > this infinite retry is very unwelcome -- especially if DM thinp was > > configured to be automatically extended with free space but the admin > > hasn't provided (or restored) adequate free space. > > > > To fix this infinite retry a new bdev_has_space () hook is added to XFS > > to break out of its metadata retry loop if the underlying block device > > reports it no longer has free space. DM thin provisioning is now > > trained to respond accordingly, which enables XFS to not cause a cascade > > of tasks blocked on IO waiting for XFS's infinite retry. > > > > All other block devices, which don't implement a .has_space method in > > block_device_operations, will always return true for bdev_has_space(). > > > > With this change XFS will fail the metadata IO, force shutdown, and the > > XFS filesystem may be unmounted. This enables an admin to recover from > > their oversight, of not having provided enough free space, without > > having to force a hard reset of the system to get XFS to unwedge. > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > Shouldn't dm-thinp just return the bio with ENOSPC as it's error? > The scsi layers already do this for hardware thinp ENOSPC failures, > so dm-thinp should behave exactly the same (i.e. via > __scsi_error_from_host_byte()) DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit bcc696fac1). Before it was more nuanced, but it would either return -ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's no_space_timoeut expired). Anyway, XFS doesn't care what error is returned it'll just keep retrying metdata IO indefinitely. > The behaviour of the filesystem > should be the same in all cases - making it conditional on whether > the thinp implementation can be polled for available space is wrong > as most hardware thinp can't be polled by the kernel forthis info.. There isn't an immediate corollary for SCSI no. I've discussed flagging the SCSI block device as out of space when the equivalent HW thinp ASC/ASQ is returned from the SCSI target but there isn't an existing way to know space was added to reset this state (like is possible with DM thinp). But in the past you (or others) have claimed that ENOSPC isn't a rich enough return to _know_ that it is a hard error. Now you seem to be oscillating away from a richer interface between filesystems and the underlying block device to: a simple error code is enough. > If dm-thinp just returns ENOSPC from on the BIO like other hardware > thinp devices, then it is up to the filesystem to handle that > appropriately. i.e. whether an ENOSPC IO error is fatal to the > filesystem is determined by filesystem configuration and context of > the IO error, not whether the block device has no space (which we > should already know from the ENOSPC error delivered by IO > completion). OK, _please_ add that configurability to XFS. As I think you know, this is a long-standing XFS on thinp issue (both DM and HW thinp). Mike -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 20, 2015 at 07:20:58PM -0400, Mike Snitzer wrote: > On Mon, Jul 20 2015 at 6:36pm -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote: > > > If XFS fails to write metadata it will retry the write indefinitely > > > (with the hope that the write will succeed at some point in the future). > > > > > > Others can possibly speak to historic reason(s) why this is a sane > > > default for XFS. But when XFS is deployed ontop of DM thin provisioning > > > this infinite retry is very unwelcome -- especially if DM thinp was > > > configured to be automatically extended with free space but the admin > > > hasn't provided (or restored) adequate free space. > > > > > > To fix this infinite retry a new bdev_has_space () hook is added to XFS > > > to break out of its metadata retry loop if the underlying block device > > > reports it no longer has free space. DM thin provisioning is now > > > trained to respond accordingly, which enables XFS to not cause a cascade > > > of tasks blocked on IO waiting for XFS's infinite retry. > > > > > > All other block devices, which don't implement a .has_space method in > > > block_device_operations, will always return true for bdev_has_space(). > > > > > > With this change XFS will fail the metadata IO, force shutdown, and the > > > XFS filesystem may be unmounted. This enables an admin to recover from > > > their oversight, of not having provided enough free space, without > > > having to force a hard reset of the system to get XFS to unwedge. > > > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > > > Shouldn't dm-thinp just return the bio with ENOSPC as it's error? > > The scsi layers already do this for hardware thinp ENOSPC failures, > > so dm-thinp should behave exactly the same (i.e. via > > __scsi_error_from_host_byte()) > > DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit > bcc696fac1). Before it was more nuanced, but it would either return > -ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's > no_space_timoeut expired). Right, which meant we couldn't get a reliable ENOSPC detection, so we just had to retry and hope.... > Anyway, XFS doesn't care what error is returned it'll just keep > retrying metdata IO indefinitely. Sure, that's historical behaviour and it's a good default for most systems. But see here for how we plan to handle this particular issue of error catergorisation: http://oss.sgi.com/archives/xfs/2015-02/msg00343.html BTW, it's not just ENOSPC we need this for - ENOMEM needs the same configurable behaviour now that the mm subsystem has started treating GFP_NOFAIL differently to just looping outside the allocator.... > > The behaviour of the filesystem > > should be the same in all cases - making it conditional on whether > > the thinp implementation can be polled for available space is wrong > > as most hardware thinp can't be polled by the kernel forthis info.. > > There isn't an immediate corollary for SCSI no. I've discussed flagging > the SCSI block device as out of space when the equivalent HW thinp > ASC/ASQ is returned from the SCSI target but there isn't an existing way > to know space was added to reset this state (like is possible with DM > thinp). > > But in the past you (or others) have claimed that ENOSPC isn't a rich > enough return to _know_ that it is a hard error. Yes, I've said that in the past because we haven't had any agreement on how block devices should report ENOSPC. However, this is slowly being standardised and I've seen enough people say "give us a knob to decide that ourselves as we know our hardware" that means we can rely on ENOSPC on the bio to detect this issue and handle it appropriately. > Now you seem to be oscillating away from a richer interface between > filesystems and the underlying block device to: a simple error code is > enough. Once we decided to give users the option to configure block device error handling, the need for more information from the block device goes away... > > If dm-thinp just returns ENOSPC from on the BIO like other hardware > > thinp devices, then it is up to the filesystem to handle that > > appropriately. i.e. whether an ENOSPC IO error is fatal to the > > filesystem is determined by filesystem configuration and context of > > the IO error, not whether the block device has no space (which we > > should already know from the ENOSPC error delivered by IO > > completion). > > OK, _please_ add that configurability to XFS. As I think you know, this > is a long-standing XFS on thinp issue (both DM and HW thinp). It got stuck on the side of the bikeshed, unfortunately, and then other stuff got painted on top of it. I'll go uncover it now that dm-thinp is returning reliable ENOSPC in ENOSPC conditions ;) Cheers, Dave.
On 7/20/15 5:36 PM, Dave Chinner wrote: > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote: >> If XFS fails to write metadata it will retry the write indefinitely >> (with the hope that the write will succeed at some point in the future). >> >> Others can possibly speak to historic reason(s) why this is a sane >> default for XFS. But when XFS is deployed ontop of DM thin provisioning >> this infinite retry is very unwelcome -- especially if DM thinp was >> configured to be automatically extended with free space but the admin >> hasn't provided (or restored) adequate free space. >> >> To fix this infinite retry a new bdev_has_space () hook is added to XFS >> to break out of its metadata retry loop if the underlying block device >> reports it no longer has free space. DM thin provisioning is now >> trained to respond accordingly, which enables XFS to not cause a cascade >> of tasks blocked on IO waiting for XFS's infinite retry. >> >> All other block devices, which don't implement a .has_space method in >> block_device_operations, will always return true for bdev_has_space(). >> >> With this change XFS will fail the metadata IO, force shutdown, and the >> XFS filesystem may be unmounted. This enables an admin to recover from >> their oversight, of not having provided enough free space, without >> having to force a hard reset of the system to get XFS to unwedge. >> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > Shouldn't dm-thinp just return the bio with ENOSPC as it's error? > The scsi layers already do this for hardware thinp ENOSPC failures, > so dm-thinp should behave exactly the same (i.e. via > __scsi_error_from_host_byte()). The behaviour of the filesystem > should be the same in all cases - making it conditional on whether > the thinp implementation can be polled for available space is wrong > as most hardware thinp can't be polled by the kernel forthis info.. > > > If dm-thinp just returns ENOSPC from on the BIO like other hardware > thinp devices, then it is up to the filesystem to handle that > appropriately. i.e. whether an ENOSPC IO error is fatal to the > filesystem is determined by filesystem configuration and context of > the IO error, not whether the block device has no space (which we > should already know from the ENOSPC error delivered by IO > completion). The issue we had discussed previously is that there is no agreement across block devices about whether ENOSPC is a permanent or temporary condition. Asking the admin to tune the fs to each block device's behavior sucks, IMHO. This interface could at least be defined to reflect a permanent and unambiguous state... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > On 7/20/15 5:36 PM, Dave Chinner wrote: > > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote: > >> If XFS fails to write metadata it will retry the write indefinitely > >> (with the hope that the write will succeed at some point in the future). > >> > >> Others can possibly speak to historic reason(s) why this is a sane > >> default for XFS. But when XFS is deployed ontop of DM thin provisioning > >> this infinite retry is very unwelcome -- especially if DM thinp was > >> configured to be automatically extended with free space but the admin > >> hasn't provided (or restored) adequate free space. > >> > >> To fix this infinite retry a new bdev_has_space () hook is added to XFS > >> to break out of its metadata retry loop if the underlying block device > >> reports it no longer has free space. DM thin provisioning is now > >> trained to respond accordingly, which enables XFS to not cause a cascade > >> of tasks blocked on IO waiting for XFS's infinite retry. > >> > >> All other block devices, which don't implement a .has_space method in > >> block_device_operations, will always return true for bdev_has_space(). > >> > >> With this change XFS will fail the metadata IO, force shutdown, and the > >> XFS filesystem may be unmounted. This enables an admin to recover from > >> their oversight, of not having provided enough free space, without > >> having to force a hard reset of the system to get XFS to unwedge. > >> > >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > > > Shouldn't dm-thinp just return the bio with ENOSPC as it's error? > > The scsi layers already do this for hardware thinp ENOSPC failures, > > so dm-thinp should behave exactly the same (i.e. via > > __scsi_error_from_host_byte()). The behaviour of the filesystem > > should be the same in all cases - making it conditional on whether > > the thinp implementation can be polled for available space is wrong > > as most hardware thinp can't be polled by the kernel forthis info.. > > > > > > If dm-thinp just returns ENOSPC from on the BIO like other hardware > > thinp devices, then it is up to the filesystem to handle that > > appropriately. i.e. whether an ENOSPC IO error is fatal to the > > filesystem is determined by filesystem configuration and context of > > the IO error, not whether the block device has no space (which we > > should already know from the ENOSPC error delivered by IO > > completion). > > The issue we had discussed previously is that there is no agreement > across block devices about whether ENOSPC is a permanent or temporary > condition. Asking the admin to tune the fs to each block device's > behavior sucks, IMHO. It does suck, but it beats the alternative of XFS continuing to do nothing about the problem. Disucssing more with Vivek, might be that XFS would be best served to model what dm-thinp has provided with its 'no_space_timeout'. It defaults to queueing IO for 60 seconds, once the timeout expires the queued IOs getted errored. If set to 0 dm-thinp will queue IO indefinitely. So for XFS's use-case: s/queue/retry/ > This interface could at least be defined to reflect a permanent and > unambiguous state... The proposed bdev_has_space() interface enabled XFS to defer to the block device. But it obviously doesn't help at all if the blockdevice isn't providing a .has_space method -- so I can see value in XFS having something like a 'no_space_timeout' knob. But something needs to happen. No more bike-shedding allowed on this one.. PLEASE DO SOMETHING! :) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > > On 7/20/15 5:36 PM, Dave Chinner wrote: > > The issue we had discussed previously is that there is no agreement > > across block devices about whether ENOSPC is a permanent or temporary > > condition. Asking the admin to tune the fs to each block device's > > behavior sucks, IMHO. > > It does suck, but it beats the alternative of XFS continuing to do > nothing about the problem. Just a comment on that: doing nothing is better than doing the wrong thing and being stuck with it forever. :) > Disucssing more with Vivek, might be that XFS would be best served to > model what dm-thinp has provided with its 'no_space_timeout'. It > defaults to queueing IO for 60 seconds, once the timeout expires the > queued IOs getted errored. If set to 0 dm-thinp will queue IO > indefinitely. Yes, that's exactly what I proposed in the thread I referenced in my previous email, and what got stuck on the bikeshed wall because of these concerns about knob twiddling: http://oss.sgi.com/archives/xfs/2015-02/msg00346.html | e.g. if we need configurable error handling, it needs to be | configurable for different error types, and it needs to be | configurable on a per-mount basis. And it needs to be configurable | at runtime, not just at mount time. That kind of leads to using | sysfs for this. e.g. for each error type we ned to handle different | behaviour for: | | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type | [transient] permanent | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds | 300 | $ cat | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts | 50 | $ cat | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount | 1 I've rebased this patchset, and I'm cleaning it up now, so in a few days I'll have something for review, likely for the 4.3 merge window.... Cheers, Dave.
On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote: > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > > > On 7/20/15 5:36 PM, Dave Chinner wrote: > > > The issue we had discussed previously is that there is no agreement > > > across block devices about whether ENOSPC is a permanent or temporary > > > condition. Asking the admin to tune the fs to each block device's > > > behavior sucks, IMHO. > > > > It does suck, but it beats the alternative of XFS continuing to do > > nothing about the problem. > > Just a comment on that: doing nothing is better than doing the wrong > thing and being stuck with it forever. :) > > > Disucssing more with Vivek, might be that XFS would be best served to > > model what dm-thinp has provided with its 'no_space_timeout'. It > > defaults to queueing IO for 60 seconds, once the timeout expires the > > queued IOs getted errored. If set to 0 dm-thinp will queue IO > > indefinitely. > > Yes, that's exactly what I proposed in the thread I referenced in > my previous email, and what got stuck on the bikeshed wall because > of these concerns about knob twiddling: > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html > > | e.g. if we need configurable error handling, it needs to be > | configurable for different error types, and it needs to be > | configurable on a per-mount basis. And it needs to be configurable > | at runtime, not just at mount time. That kind of leads to using > | sysfs for this. e.g. for each error type we ned to handle different > | behaviour for: > | > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type > | [transient] permanent > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds > | 300 > | $ cat > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts > | 50 > | $ cat > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount > | 1 > > I've rebased this patchset, and I'm cleaning it up now, so in a few > days I'll have something for review, likely for the 4.3 merge > window.... Just thinking a bit more on how to make this simpler to configure, is there a simple way for the filesystem to determine the current config of the dm thinp volume? i.e. if the dm-thinp volume is configured to error out immediately on enospc, then XFS should default to doing the same thing. having XFS be able to grab this status at mount time and change the default ENOSPC error config from transient to permanent on such dm-thinp volumes would go a long way to making these configs Just Do The Right Thing on block dev enospc errors... e.g. if dm-thinp is configured to queue for 60s and then fail on ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then we want XFS to retry and use it's default retry maximums before failing permanently. Cheers, Dave.
On Tue, Jul 21 2015 at 9:00pm -0400, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote: > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > > > > On 7/20/15 5:36 PM, Dave Chinner wrote: > > > > The issue we had discussed previously is that there is no agreement > > > > across block devices about whether ENOSPC is a permanent or temporary > > > > condition. Asking the admin to tune the fs to each block device's > > > > behavior sucks, IMHO. > > > > > > It does suck, but it beats the alternative of XFS continuing to do > > > nothing about the problem. > > > > Just a comment on that: doing nothing is better than doing the wrong > > thing and being stuck with it forever. :) > > > > > Disucssing more with Vivek, might be that XFS would be best served to > > > model what dm-thinp has provided with its 'no_space_timeout'. It > > > defaults to queueing IO for 60 seconds, once the timeout expires the > > > queued IOs getted errored. If set to 0 dm-thinp will queue IO > > > indefinitely. > > > > Yes, that's exactly what I proposed in the thread I referenced in > > my previous email, and what got stuck on the bikeshed wall because > > of these concerns about knob twiddling: > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html > > > > | e.g. if we need configurable error handling, it needs to be > > | configurable for different error types, and it needs to be > > | configurable on a per-mount basis. And it needs to be configurable > > | at runtime, not just at mount time. That kind of leads to using > > | sysfs for this. e.g. for each error type we ned to handle different > > | behaviour for: > > | > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type > > | [transient] permanent > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds > > | 300 > > | $ cat > > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts > > | 50 > > | $ cat > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount > > | 1 > > > > I've rebased this patchset, and I'm cleaning it up now, so in a few > > days I'll have something for review, likely for the 4.3 merge > > window.... > > Just thinking a bit more on how to make this simpler to configure, > is there a simple way for the filesystem to determine the current > config of the dm thinp volume? i.e. if the dm-thinp volume is > configured to error out immediately on enospc, then XFS should > default to doing the same thing. having XFS be able to grab this > status at mount time and change the default ENOSPC error config from > transient to permanent on such dm-thinp volumes would go a long way > to making these configs Just Do The Right Thing on block dev enospc > errors... > > e.g. if dm-thinp is configured to queue for 60s and then fail on > ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If > dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then > we want XFS to retry and use it's default retry maximums before > failing permanently. Yes, that'd be nice. But there isn't a way to easily get the DM thinp device's config from within the kernel (unless XFS wants to get into the business of issuing ioctls to DM devices.. unlikely). I could be persuaded to expose a per-device sysfs file to get the status (would avoid need for ioctl), e.g.: # cat /sys/block/dm-5/dm/status (but that doesn't _really_ help in-kernel access, awkward for filesystem code to be opening sysfs files!) SO userspace (mkfs.xfs) could easily check the thinp device's setup using 'dmsetup status <device>' (output will either contain 'queue_if_no_space' or 'error_if_no_space'). The DM thinp 'no_space_timeout' (applicable if queue_if_no_space) is a thinp global accessed using a module param: # cat /sys/module/dm_thin_pool/parameters/no_space_timeout 60 I'm open to considering alternative interfaces for getting you the info you need. I just don't have a great sense for what mechanism you'd like to use. Do we invent a new block device operations table method that sets values in a 'struct no_space_strategy' passed in to the blockdevice? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: > On Tue, Jul 21 2015 at 9:00pm -0400, > Dave Chinner <david@fromorbit.com> wrote: > > > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote: > > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > > > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > > > > > On 7/20/15 5:36 PM, Dave Chinner wrote: > > > > > The issue we had discussed previously is that there is no agreement > > > > > across block devices about whether ENOSPC is a permanent or temporary > > > > > condition. Asking the admin to tune the fs to each block device's > > > > > behavior sucks, IMHO. > > > > > > > > It does suck, but it beats the alternative of XFS continuing to do > > > > nothing about the problem. > > > > > > Just a comment on that: doing nothing is better than doing the wrong > > > thing and being stuck with it forever. :) > > > > > > > Disucssing more with Vivek, might be that XFS would be best served to > > > > model what dm-thinp has provided with its 'no_space_timeout'. It > > > > defaults to queueing IO for 60 seconds, once the timeout expires the > > > > queued IOs getted errored. If set to 0 dm-thinp will queue IO > > > > indefinitely. > > > > > > Yes, that's exactly what I proposed in the thread I referenced in > > > my previous email, and what got stuck on the bikeshed wall because > > > of these concerns about knob twiddling: > > > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html > > > > > > | e.g. if we need configurable error handling, it needs to be > > > | configurable for different error types, and it needs to be > > > | configurable on a per-mount basis. And it needs to be configurable > > > | at runtime, not just at mount time. That kind of leads to using > > > | sysfs for this. e.g. for each error type we ned to handle different > > > | behaviour for: > > > | > > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type > > > | [transient] permanent > > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds > > > | 300 > > > | $ cat > > > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts > > > | 50 > > > | $ cat > > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount > > > | 1 > > > > > > I've rebased this patchset, and I'm cleaning it up now, so in a few > > > days I'll have something for review, likely for the 4.3 merge > > > window.... > > > > Just thinking a bit more on how to make this simpler to configure, > > is there a simple way for the filesystem to determine the current > > config of the dm thinp volume? i.e. if the dm-thinp volume is > > configured to error out immediately on enospc, then XFS should > > default to doing the same thing. having XFS be able to grab this > > status at mount time and change the default ENOSPC error config from > > transient to permanent on such dm-thinp volumes would go a long way > > to making these configs Just Do The Right Thing on block dev enospc > > errors... > > > > e.g. if dm-thinp is configured to queue for 60s and then fail on > > ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If > > dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then > > we want XFS to retry and use it's default retry maximums before > > failing permanently. > > Yes, that'd be nice. But there isn't a way to easily get the DM thinp > device's config from within the kernel (unless XFS wants to get into the > business of issuing ioctls to DM devices.. unlikely). Not really. > I could be > persuaded to expose a per-device sysfs file to get the status (would > avoid need for ioctl), e.g.: > # cat /sys/block/dm-5/dm/status > (but that doesn't _really_ help in-kernel access, awkward for filesystem > code to be opening sysfs files!) No, not going that way. We have direct access through the bdev we opened, so that's the communications channel we'd need to use. > SO userspace (mkfs.xfs) could easily check the thinp device's setup > using 'dmsetup status <device>' (output will either contain > 'queue_if_no_space' or 'error_if_no_space'). The DM thinp > 'no_space_timeout' (applicable if queue_if_no_space) is a thinp global > accessed using a module param: > # cat /sys/module/dm_thin_pool/parameters/no_space_timeout > 60 Mkfs is not the right interface - users can change dm-thinp behaviour long after the filesystem was created and so the XFS config needs to be configurable, too. Further, I really don't want to have to add anything to the on-disk format to support error configuration because, well, that drives the level of complexity up a couple or orders of magnitude (mkfs, repair, metadump, db, etc all need to support it), especially when it can be driven easily from userspace after mount with far less constraints and support burden. > I'm open to considering alternative interfaces for getting you the info > you need. I just don't have a great sense for what mechanism you'd like > to use. Do we invent a new block device operations table method that > sets values in a 'struct no_space_strategy' passed in to the > blockdevice? It's long been frowned on having the filesystems dig into block device structures. We have lots of wrapper functions for getting information from or performing operations on block devices. (e.g. bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd need to follow. If we do that - bdev_get_nospace_strategy() - then how that information gets to the filesystem is completely opaque at the fs level, and the block layer can implement it in whatever way is considered sane... And, realistically, all we really need returned is a enum to tell us how the bdev behaves on enospc: - bdev fails fast, (i.e. immediate ENOSPC) - bdev fails slow, (i.e. queue for some time, then ENOSPC) - bdev never fails (i.e. queue forever) - bdev doesn't support this (i.e. EOPNOTSUPP) Cheers, Dave.
On Tue, Jul 21 2015 at 10:37pm -0400, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: > > > I'm open to considering alternative interfaces for getting you the info > > you need. I just don't have a great sense for what mechanism you'd like > > to use. Do we invent a new block device operations table method that > > sets values in a 'struct no_space_strategy' passed in to the > > blockdevice? > > It's long been frowned on having the filesystems dig into block > device structures. We have lots of wrapper functions for getting > information from or performing operations on block devices. (e.g. > bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), > blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd > need to follow. If we do that - bdev_get_nospace_strategy() - then > how that information gets to the filesystem is completely opaque > at the fs level, and the block layer can implement it in whatever > way is considered sane... > > And, realistically, all we really need returned is a enum to tell us > how the bdev behaves on enospc: > - bdev fails fast, (i.e. immediate ENOSPC) > - bdev fails slow, (i.e. queue for some time, then ENOSPC) > - bdev never fails (i.e. queue forever) > - bdev doesn't support this (i.e. EOPNOTSUPP) This 'struct no_space_strategy' would be invented purely for informational purposes for upper layers' benefit -- I don't consider it a "block device structure" it the traditional sense. I was thinking upper layers would like to know the actual timeout value for the "fails slow" case. As such the 'struct no_space_strategy' would have the enum and the timeout. And would be returned with a call: bdev_get_nospace_strategy(bdev, &no_space_strategy) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/22/15 8:34 AM, Mike Snitzer wrote: > On Tue, Jul 21 2015 at 10:37pm -0400, > Dave Chinner <david@fromorbit.com> wrote: > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: >> >>> I'm open to considering alternative interfaces for getting you the info >>> you need. I just don't have a great sense for what mechanism you'd like >>> to use. Do we invent a new block device operations table method that >>> sets values in a 'struct no_space_strategy' passed in to the >>> blockdevice? >> >> It's long been frowned on having the filesystems dig into block >> device structures. We have lots of wrapper functions for getting >> information from or performing operations on block devices. (e.g. >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd >> need to follow. If we do that - bdev_get_nospace_strategy() - then >> how that information gets to the filesystem is completely opaque >> at the fs level, and the block layer can implement it in whatever >> way is considered sane... >> >> And, realistically, all we really need returned is a enum to tell us >> how the bdev behaves on enospc: >> - bdev fails fast, (i.e. immediate ENOSPC) >> - bdev fails slow, (i.e. queue for some time, then ENOSPC) >> - bdev never fails (i.e. queue forever) >> - bdev doesn't support this (i.e. EOPNOTSUPP) I'm not sure how this is more useful than the bdev simply responding to a query of "should we keep trying IOs?" IOWS do we really care if it's failing fast or slow, vs. simply knowing whether it has now permanently failed? So rather than "bdev_get_nospace_strategy" it seems like all we need to know is "bdev_has_failed" - do we really care about the details? > This 'struct no_space_strategy' would be invented purely for > informational purposes for upper layers' benefit -- I don't consider it > a "block device structure" it the traditional sense. > > I was thinking upper layers would like to know the actual timeout value > for the "fails slow" case. As such the 'struct no_space_strategy' would > have the enum and the timeout. And would be returned with a call: > bdev_get_nospace_strategy(bdev, &no_space_strategy) Asking for the timeout value seems to add complexity. It could change after we ask, and knowing it now requires another layer to be handling timeouts... Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 22 2015 at 12:28pm -0400, Eric Sandeen <sandeen@redhat.com> wrote: > On 7/22/15 8:34 AM, Mike Snitzer wrote: > > On Tue, Jul 21 2015 at 10:37pm -0400, > > Dave Chinner <david@fromorbit.com> wrote: > > > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: > >> > >>> I'm open to considering alternative interfaces for getting you the info > >>> you need. I just don't have a great sense for what mechanism you'd like > >>> to use. Do we invent a new block device operations table method that > >>> sets values in a 'struct no_space_strategy' passed in to the > >>> blockdevice? > >> > >> It's long been frowned on having the filesystems dig into block > >> device structures. We have lots of wrapper functions for getting > >> information from or performing operations on block devices. (e.g. > >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), > >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd > >> need to follow. If we do that - bdev_get_nospace_strategy() - then > >> how that information gets to the filesystem is completely opaque > >> at the fs level, and the block layer can implement it in whatever > >> way is considered sane... > >> > >> And, realistically, all we really need returned is a enum to tell us > >> how the bdev behaves on enospc: > >> - bdev fails fast, (i.e. immediate ENOSPC) > >> - bdev fails slow, (i.e. queue for some time, then ENOSPC) > >> - bdev never fails (i.e. queue forever) > >> - bdev doesn't support this (i.e. EOPNOTSUPP) > > I'm not sure how this is more useful than the bdev simply responding to > a query of "should we keep trying IOs?" > > IOWS do we really care if it's failing fast or slow, vs. simply knowing > whether it has now permanently failed? > > So rather than "bdev_get_nospace_strategy" it seems like all we need > to know is "bdev_has_failed" - do we really care about the details? My bdev_has_space() proposal is no different then bdev_has_failed(). If you prefer the more generic name then fine. But bdev_has_failed() is of limited utlity outside of devices that provide support. So I can see why Dave is resisting it. Anyway, the benefit of XFS tailoring its independent config based on dm-thinp's comparable config makes sense to me. The reason for XFS's independent config is it could be deployed on any storage (e.g. not dm-thinp). Affords XFS to defer to DM thinp but still have comparable functionality for HW thinp or some other storage. > > This 'struct no_space_strategy' would be invented purely for > > informational purposes for upper layers' benefit -- I don't consider it > > a "block device structure" it the traditional sense. > > > > I was thinking upper layers would like to know the actual timeout value > > for the "fails slow" case. As such the 'struct no_space_strategy' would > > have the enum and the timeout. And would be returned with a call: > > bdev_get_nospace_strategy(bdev, &no_space_strategy) > > Asking for the timeout value seems to add complexity. It could change after > we ask, and knowing it now requires another layer to be handling timeouts... Dave is already saying XFS will have a timeout it'll be managing. Stands to reason that XFS would base its timeout on DM thinp's timeout. But yeah it does allow the stacked timeout that XFS uses to be out of sync if the lower timeout changes (no different than blk_stack_limits). Please fix this however you see fit. I'll assist anywhere that makes sense. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote: > On 7/22/15 8:34 AM, Mike Snitzer wrote: > > On Tue, Jul 21 2015 at 10:37pm -0400, > > Dave Chinner <david@fromorbit.com> wrote: > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: > >>> I'm open to considering alternative interfaces for getting you the info > >>> you need. I just don't have a great sense for what mechanism you'd like > >>> to use. Do we invent a new block device operations table method that > >>> sets values in a 'struct no_space_strategy' passed in to the > >>> blockdevice? > >> > >> It's long been frowned on having the filesystems dig into block > >> device structures. We have lots of wrapper functions for getting > >> information from or performing operations on block devices. (e.g. > >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), > >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd > >> need to follow. If we do that - bdev_get_nospace_strategy() - then > >> how that information gets to the filesystem is completely opaque > >> at the fs level, and the block layer can implement it in whatever > >> way is considered sane... > >> > >> And, realistically, all we really need returned is a enum to tell us > >> how the bdev behaves on enospc: > >> - bdev fails fast, (i.e. immediate ENOSPC) > >> - bdev fails slow, (i.e. queue for some time, then ENOSPC) > >> - bdev never fails (i.e. queue forever) > >> - bdev doesn't support this (i.e. EOPNOTSUPP) > > I'm not sure how this is more useful than the bdev simply responding to > a query of "should we keep trying IOs?" - bdev fails fast, (i.e. immediate ENOSPC) XFS should use a bound retry behaviour for to allow the possiblity of the admin adding more space before we shut down the fs. i.e. XFS fails slow. - bdev fails slow, (i.e. queue for some time, then ENOSPC) We know that IOs are going to be delayed before they are failed, so there's no point in retrying as the admin has already had a chance to resolve the ENOSPC condition before failure was reported. i.e. XFS fails fast. - bdev never fails (i.e. queue forever) Block device will appear to hang when it runs out of space. Nothing XFS can do here because IOs never fail, but we need to note this in the log at mount time so that filesystem hangs are easily explained when reported to us. - bdev doesn't support this (i.e. EOPNOTSUPP) XFS uses default "retry forever" behaviour. > > This 'struct no_space_strategy' would be invented purely for > > informational purposes for upper layers' benefit -- I don't consider it > > a "block device structure" it the traditional sense. > > > > I was thinking upper layers would like to know the actual timeout value > > for the "fails slow" case. As such the 'struct no_space_strategy' would > > have the enum and the timeout. And would be returned with a call: > > bdev_get_nospace_strategy(bdev, &no_space_strategy) > > Asking for the timeout value seems to add complexity. It could change after > we ask, and knowing it now requires another layer to be handling timeouts... I don't think knowing the bdev timeout is necessary because the default is most likely to be "fail fast" in this case. i.e. no retries, just shut down. IOWs, if we describe the configs and actions in neutral terms, then the default configurations easy for users to understand. i.e: bdev enospc XFS default ----------- ----------- Fail slow Fail fast Fail fast Fail slow Fail never Fail never, Record in log EOPNOTSUPP Fail never With that in mind, I'm thinking I should drop the "permanent/transient" error classifications, and change it "failure behaviour" with the options "fast slow [never]" and only the slow option has retry/timeout configuration options. I think the "never" option still needs to "fail at unmount" config variable, but we enable it by default rather than hanging unmount and requiring a manual shutdown like we do now.... Cheers, Dave.
On Thu, Jul 23 2015 at 1:10am -0400, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote: > > On 7/22/15 8:34 AM, Mike Snitzer wrote: > > > On Tue, Jul 21 2015 at 10:37pm -0400, > > > Dave Chinner <david@fromorbit.com> wrote: > > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote: > > >>> I'm open to considering alternative interfaces for getting you the info > > >>> you need. I just don't have a great sense for what mechanism you'd like > > >>> to use. Do we invent a new block device operations table method that > > >>> sets values in a 'struct no_space_strategy' passed in to the > > >>> blockdevice? > > >> > > >> It's long been frowned on having the filesystems dig into block > > >> device structures. We have lots of wrapper functions for getting > > >> information from or performing operations on block devices. (e.g. > > >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(), > > >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd > > >> need to follow. If we do that - bdev_get_nospace_strategy() - then > > >> how that information gets to the filesystem is completely opaque > > >> at the fs level, and the block layer can implement it in whatever > > >> way is considered sane... > > >> > > >> And, realistically, all we really need returned is a enum to tell us > > >> how the bdev behaves on enospc: > > >> - bdev fails fast, (i.e. immediate ENOSPC) > > >> - bdev fails slow, (i.e. queue for some time, then ENOSPC) > > >> - bdev never fails (i.e. queue forever) > > >> - bdev doesn't support this (i.e. EOPNOTSUPP) > > > > I'm not sure how this is more useful than the bdev simply responding to > > a query of "should we keep trying IOs?" > > - bdev fails fast, (i.e. immediate ENOSPC) > > XFS should use a bound retry behaviour for to allow the possiblity of > the admin adding more space before we shut down the fs. i.e. > XFS fails slow. > > - bdev fails slow, (i.e. queue for some time, then ENOSPC) > > We know that IOs are going to be delayed before they are failed, so > there's no point in retrying as the admin has already had a chance > to resolve the ENOSPC condition before failure was reported. i.e. > XFS fails fast. > > - bdev never fails (i.e. queue forever) > > Block device will appear to hang when it runs out of space. Nothing > XFS can do here because IOs never fail, but we need to note this in > the log at mount time so that filesystem hangs are easily explained > when reported to us. > > - bdev doesn't support this (i.e. EOPNOTSUPP) > > XFS uses default "retry forever" behaviour. > > > > This 'struct no_space_strategy' would be invented purely for > > > informational purposes for upper layers' benefit -- I don't consider it > > > a "block device structure" it the traditional sense. > > > > > > I was thinking upper layers would like to know the actual timeout value > > > for the "fails slow" case. As such the 'struct no_space_strategy' would > > > have the enum and the timeout. And would be returned with a call: > > > bdev_get_nospace_strategy(bdev, &no_space_strategy) > > > > Asking for the timeout value seems to add complexity. It could change after > > we ask, and knowing it now requires another layer to be handling timeouts... > > I don't think knowing the bdev timeout is necessary because the > default is most likely to be "fail fast" in this case. i.e. no > retries, just shut down. IOWs, if we describe the configs and > actions in neutral terms, then the default configurations easy for > users to understand. i.e: > > bdev enospc XFS default > ----------- ----------- > Fail slow Fail fast > Fail fast Fail slow > Fail never Fail never, Record in log > EOPNOTSUPP Fail never > > With that in mind, I'm thinking I should drop the > "permanent/transient" error classifications, and change it "failure > behaviour" with the options "fast slow [never]" and only the slow > option has retry/timeout configuration options. I think the "never" > option still needs to "fail at unmount" config variable, but we > enable it by default rather than hanging unmount and requiring a > manual shutdown like we do now.... This all sounds good to me. The simpler XFS configuration looks like a nice improvement. If you just want to stub out the call to bdev_get_nospace_strategy() I can crank through implementing it once I get a few minutes. Btw, not sure what I was thinking when suggesting XFS would benefit from knowing the duration of the thinp no_space_timeout. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote: [..] > I don't think knowing the bdev timeout is necessary because the > default is most likely to be "fail fast" in this case. i.e. no > retries, just shut down. IOWs, if we describe the configs and > actions in neutral terms, then the default configurations easy for > users to understand. i.e: > > bdev enospc XFS default > ----------- ----------- > Fail slow Fail fast > Fail fast Fail slow > Fail never Fail never, Record in log > EOPNOTSUPP Fail never > > With that in mind, I'm thinking I should drop the > "permanent/transient" error classifications, and change it "failure > behaviour" with the options "fast slow [never]" and only the slow > option has retry/timeout configuration options. I think the "never" > option still needs to "fail at unmount" config variable, but we > enable it by default rather than hanging unmount and requiring a > manual shutdown like we do now.... I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can we just do with one knob per error type and that is retry-timout. retry-timeout=0 (Fail fast) retry-timeout=X (Fail slow) retry-timeout=-1 (Never Give up). Also do we really need this timeout per error type. Also would be nice if this timeout was configurable using a mount option. Then we can just specify it during mount time and be done with it. Idea of auto tuning based on what block device is doing sounds reasonable but that should not be a requirement for this patch and can go in even later. It is one of those nice to have features. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Jul 2015, Dave Chinner wrote: > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote: > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > > > > On 7/20/15 5:36 PM, Dave Chinner wrote: > > > > The issue we had discussed previously is that there is no agreement > > > > across block devices about whether ENOSPC is a permanent or temporary > > > > condition. Asking the admin to tune the fs to each block device's > > > > behavior sucks, IMHO. > > > > > > It does suck, but it beats the alternative of XFS continuing to do > > > nothing about the problem. > > > > Just a comment on that: doing nothing is better than doing the wrong > > thing and being stuck with it forever. :) > > > > > Disucssing more with Vivek, might be that XFS would be best served to > > > model what dm-thinp has provided with its 'no_space_timeout'. It > > > defaults to queueing IO for 60 seconds, once the timeout expires the > > > queued IOs getted errored. If set to 0 dm-thinp will queue IO > > > indefinitely. > > > > Yes, that's exactly what I proposed in the thread I referenced in > > my previous email, and what got stuck on the bikeshed wall because > > of these concerns about knob twiddling: > > > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html > > > > | e.g. if we need configurable error handling, it needs to be > > | configurable for different error types, and it needs to be > > | configurable on a per-mount basis. And it needs to be configurable > > | at runtime, not just at mount time. That kind of leads to using > > | sysfs for this. e.g. for each error type we ned to handle different > > | behaviour for: > > | > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type > > | [transient] permanent > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds > > | 300 > > | $ cat > > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts > > | 50 > > | $ cat > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount > > | 1 > > > > I've rebased this patchset, and I'm cleaning it up now, so in a few > > days I'll have something for review, likely for the 4.3 merge > > window.... > > Just thinking a bit more on how to make this simpler to configure, > is there a simple way for the filesystem to determine the current > config of the dm thinp volume? You can just stop retrying the I/Os when the user attempts to unmount the filesystem - then, you don't need any configuration option. Mikulas > i.e. if the dm-thinp volume is > configured to error out immediately on enospc, then XFS should > default to doing the same thing. having XFS be able to grab this > status at mount time and change the default ENOSPC error config from > transient to permanent on such dm-thinp volumes would go a long way > to making these configs Just Do The Right Thing on block dev enospc > errors... > > e.g. if dm-thinp is configured to queue for 60s and then fail on > ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If > dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then > we want XFS to retry and use it's default retry maximums before > failing permanently. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote: > On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote: > > [..] > > I don't think knowing the bdev timeout is necessary because the > > default is most likely to be "fail fast" in this case. i.e. no > > retries, just shut down. IOWs, if we describe the configs and > > actions in neutral terms, then the default configurations easy for > > users to understand. i.e: > > > > bdev enospc XFS default > > ----------- ----------- > > Fail slow Fail fast > > Fail fast Fail slow > > Fail never Fail never, Record in log > > EOPNOTSUPP Fail never > > > > With that in mind, I'm thinking I should drop the > > "permanent/transient" error classifications, and change it "failure > > behaviour" with the options "fast slow [never]" and only the slow > > option has retry/timeout configuration options. I think the "never" > > option still needs to "fail at unmount" config variable, but we > > enable it by default rather than hanging unmount and requiring a > > manual shutdown like we do now.... > > I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can > we just do with one knob per error type and that is retry-timout. "retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is configured as: # echo slow > fail_method # echo 0 > max_retries # echo 300 > retry_timeout > retry-timeout=0 (Fail fast) > retry-timeout=X (Fail slow) > retry-timeout=-1 (Never Give up). What do we do when we want to add a different failure type with different configuration requirements? > Also do we really need this timeout per error type. I don't follow your logic here. What do need a timeout for with either the "never" or "fast" failure configurations? > Also would be nice if this timeout was configurable using a mount > option. Then we can just specify it during mount time and be done > with it. That way lies madness. The error configuration iinfrastructure we need is not just for ENOSPC errors on metadata buffers. We need configurable error behaviour for multiple different errors in multiple different subsystems (e.g. data IO failure vs metadata buffer IO failure vs memory allocation failure vs inode corruption vs freespace corruption vs ....). And we still would need the sysfs interface for querying and configuring at runtime, so mount options are just a bad idea. And with sysfs, the potential future route for automatic configuration at mount time is via udev events and configuration files, similar to block devices. > Idea of auto tuning based on what block device is doing sounds reasonable > but that should not be a requirement for this patch and can go in even > later. It is one of those nice to have features. "this patch"? Just the core infrastructure so far: 11 files changed, 290 insertions(+), 60 deletions(-) and that will need to be split into 4-5 patches for review. There's a bunch of cleanup that preceeds this, and then there's a patch per error type we are going to handle in metadata buffer IO completion. IOWs, the dm-thinp autotuning is just a simple, small patch at the end of a much larger series - it's maybe 10 lines of code in XFS... Cheers, Dave.
On Thu, Jul 23, 2015 at 01:08:36PM -0400, Mikulas Patocka wrote: > On Wed, 22 Jul 2015, Dave Chinner wrote: > > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote: > > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote: > > > | $ cat > > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount > > > | 1 [...] > You can just stop retrying the I/Os when the user attempts to unmount the > filesystem - then, you don't need any configuration option. See above - the default will do that, but there are users who do not want that unmount behaviour.... -Dave.
On Fri, Jul 24, 2015 at 09:00:54AM +1000, Dave Chinner wrote: > On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote: > > On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote: > > > > [..] > > > I don't think knowing the bdev timeout is necessary because the > > > default is most likely to be "fail fast" in this case. i.e. no > > > retries, just shut down. IOWs, if we describe the configs and > > > actions in neutral terms, then the default configurations easy for > > > users to understand. i.e: > > > > > > bdev enospc XFS default > > > ----------- ----------- > > > Fail slow Fail fast > > > Fail fast Fail slow > > > Fail never Fail never, Record in log > > > EOPNOTSUPP Fail never > > > > > > With that in mind, I'm thinking I should drop the > > > "permanent/transient" error classifications, and change it "failure > > > behaviour" with the options "fast slow [never]" and only the slow > > > option has retry/timeout configuration options. I think the "never" > > > option still needs to "fail at unmount" config variable, but we > > > enable it by default rather than hanging unmount and requiring a > > > manual shutdown like we do now.... > > > > I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can > > we just do with one knob per error type and that is retry-timout. > > "retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is > configured as: > > # echo slow > fail_method > # echo 0 > max_retries > # echo 300 > retry_timeout Hi Dave, I am sure I am missing something but I will anyway ask. Why do we need this knob "fail_method". Isn't it sort of implied in other two knobs based on their values. max_retries=0 retry_timeout=0 implies fail_method=fast. A non-zero value of max_retries or retry_timeout implies fail_method=slow A very high value (-1) of either max_retries or retry_timeout implies fail_method="almost never". > > retry-timeout=0 (Fail fast) > > retry-timeout=X (Fail slow) > > retry-timeout=-1 (Never Give up). > > What do we do when we want to add a different failure type > with different configuration requirements? Ok, got it. So we are targettting something very generic so that other cases can be handled too. > > > Also do we really need this timeout per error type. > > I don't follow your logic here. What do need a timeout for with > either the "never" or "fast" failure configurations? Ignore this. I had misunderstood it. > > > Also would be nice if this timeout was configurable using a mount > > option. Then we can just specify it during mount time and be done > > with it. > > That way lies madness. The error configuration iinfrastructure we > need is not just for ENOSPC errors on metadata buffers. We need > configurable error behaviour for multiple different errors in > multiple different subsystems (e.g. data IO failure vs metadata > buffer IO failure vs memory allocation failure vs inode corruption > vs freespace corruption vs ....). > > And we still would need the sysfs interface for querying and > configuring at runtime, so mount options are just a bad idea. And > with sysfs, the potential future route for automatic configuration > at mount time is via udev events and configuration files, similar to > block devices. Agreed that sysfs provides lots of flexibility here. I guess I was just thinking in terms of solving this particular issue we are facing. > > > Idea of auto tuning based on what block device is doing sounds reasonable > > but that should not be a requirement for this patch and can go in even > > later. It is one of those nice to have features. > > "this patch"? Just the core infrastructure so far: I was referring to Mike's patch where we add additional method to block device operations. > > 11 files changed, 290 insertions(+), 60 deletions(-) > > and that will need to be split into 4-5 patches for review. There's > a bunch of cleanup that preceeds this, and then there's a patch per > error type we are going to handle in metadata buffer IO completion. > IOWs, the dm-thinp autotuning is just a simple, small patch at the > end of a much larger series - it's maybe 10 lines of code in XFS... Ok. I will wait for the final patches. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 1c50c58..55ee3cf 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -3948,7 +3948,7 @@ static struct target_type pool_target = { .name = "thin-pool", .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | DM_TARGET_IMMUTABLE, - .version = {1, 16, 0}, + .version = {1, 17, 0}, .module = THIS_MODULE, .ctr = pool_ctr, .dtr = pool_dtr, @@ -4333,9 +4333,31 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits) limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */ } +static bool thin_has_space(struct dm_target *ti) +{ + struct thin_c *tc = ti->private; + struct pool *pool = tc->pool; + enum pool_mode m = get_pool_mode(pool); + + /* + * The thin-pool has space if it is either in write mode _or_ + * it is still waiting for space to be added. + * + * If 'error_if_no_space' was configured the pool will not queue + * IO at all, even though the pool will stay in OODS mode, so + * there is no point having upper layers (e.g. XFS) retry IO + * given 'error_if_no_space' is meant to _not_ queue IO. + */ + if (m == PM_WRITE || + (m == PM_OUT_OF_DATA_SPACE && !pool->pf.error_if_no_space)) + return true; + + return false; +} + static struct target_type thin_target = { .name = "thin", - .version = {1, 16, 0}, + .version = {1, 17, 0}, .module = THIS_MODULE, .ctr = thin_ctr, .dtr = thin_dtr, @@ -4348,6 +4370,7 @@ static struct target_type thin_target = { .merge = thin_merge, .iterate_devices = thin_iterate_devices, .io_hints = thin_io_hints, + .has_space = thin_has_space, }; /*----------------------------------------------------------------*/ diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ab37ae1..14bf9df 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -597,6 +597,38 @@ out: return r; } +static bool dm_blk_has_space(struct block_device *bdev) +{ + struct mapped_device *md = bdev->bd_disk->private_data; + int srcu_idx; + struct dm_table *map; + struct dm_target *tgt; + bool r = true; + + map = dm_get_live_table(md, &srcu_idx); + + if (!map || !dm_table_get_size(map)) + goto out; + + /* We only support devices that have a single target */ + if (dm_table_get_num_targets(map) != 1) + goto out; + + tgt = dm_table_get_target(map, 0); + if (!tgt->type->has_space) + goto out; + + if (dm_suspended_md(md)) + goto out; + + r = tgt->type->has_space(tgt); + +out: + dm_put_live_table(md, srcu_idx); + + return r; +} + static struct dm_io *alloc_io(struct mapped_device *md) { return mempool_alloc(md->io_pool, GFP_NOIO); @@ -3647,6 +3679,7 @@ static const struct block_device_operations dm_blk_dops = { .release = dm_blk_close, .ioctl = dm_blk_ioctl, .getgeo = dm_blk_getgeo, + .has_space = dm_blk_has_space, .owner = THIS_MODULE }; diff --git a/fs/block_dev.c b/fs/block_dev.c index 1982437..5034361 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -469,6 +469,16 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL_GPL(bdev_direct_access); +bool bdev_has_space(struct block_device *bdev) +{ + const struct block_device_operations *ops = bdev->bd_disk->fops; + + if (!ops->has_space) + return true; + return ops->has_space(bdev); +} +EXPORT_SYMBOL_GPL(bdev_has_space); + /* * pseudo-fs */ diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 092d652..98efbca 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1059,6 +1059,9 @@ xfs_buf_iodone_callbacks( if (likely(!bp->b_error)) goto do_callbacks; + if (!bdev_has_space(bp->b_target->bt_bdev)) + xfs_force_shutdown(mp, SHUTDOWN_REMOTE_REQ); + /* * If we've already decided to shutdown the filesystem because of * I/O errors, there's no point in giving this a retry. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index d4068c17d..1ba66bc 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1566,6 +1566,7 @@ struct block_device_operations { int (*getgeo)(struct block_device *, struct hd_geometry *); /* this callback is with swap_lock and sometimes page table lock held */ void (*swap_slot_free_notify) (struct block_device *, unsigned long); + bool (*has_space) (struct block_device *); struct module *owner; }; @@ -1576,6 +1577,8 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, sector_t, void **addr, unsigned long *pfn, long size); +extern bool bdev_has_space(struct block_device *); + #else /* CONFIG_BLOCK */ struct block_device; diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index 51cc1de..c55053e 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -119,6 +119,13 @@ typedef void (*dm_io_hints_fn) (struct dm_target *ti, */ typedef int (*dm_busy_fn) (struct dm_target *ti); +/* + * Returns: + * false: The target has no space that may be allocated. + * true: The target has space that may be allocated. + */ +typedef bool (*dm_has_space_fn) (struct dm_target *ti); + void dm_error(const char *message); struct dm_dev { @@ -164,6 +171,7 @@ struct target_type { dm_busy_fn busy; dm_iterate_devices_fn iterate_devices; dm_io_hints_fn io_hints; + dm_has_space_fn has_space; /* For internal device-mapper use. */ struct list_head list;
If XFS fails to write metadata it will retry the write indefinitely (with the hope that the write will succeed at some point in the future). Others can possibly speak to historic reason(s) why this is a sane default for XFS. But when XFS is deployed ontop of DM thin provisioning this infinite retry is very unwelcome -- especially if DM thinp was configured to be automatically extended with free space but the admin hasn't provided (or restored) adequate free space. To fix this infinite retry a new bdev_has_space () hook is added to XFS to break out of its metadata retry loop if the underlying block device reports it no longer has free space. DM thin provisioning is now trained to respond accordingly, which enables XFS to not cause a cascade of tasks blocked on IO waiting for XFS's infinite retry. All other block devices, which don't implement a .has_space method in block_device_operations, will always return true for bdev_has_space(). With this change XFS will fail the metadata IO, force shutdown, and the XFS filesystem may be unmounted. This enables an admin to recover from their oversight, of not having provided enough free space, without having to force a hard reset of the system to get XFS to unwedge. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/md/dm-thin.c | 27 +++++++++++++++++++++++++-- drivers/md/dm.c | 33 +++++++++++++++++++++++++++++++++ fs/block_dev.c | 10 ++++++++++ fs/xfs/xfs_buf_item.c | 3 +++ include/linux/blkdev.h | 3 +++ include/linux/device-mapper.h | 8 ++++++++ 6 files changed, 82 insertions(+), 2 deletions(-)