diff mbox

[RFC] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

Message ID 20150720151849.GA2282@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mike Snitzer July 20, 2015, 3:18 p.m. UTC
If XFS fails to write metadata it will retry the write indefinitely
(with the hope that the write will succeed at some point in the future).

Others can possibly speak to historic reason(s) why this is a sane
default for XFS.  But when XFS is deployed ontop of DM thin provisioning
this infinite retry is very unwelcome -- especially if DM thinp was
configured to be automatically extended with free space but the admin
hasn't provided (or restored) adequate free space.

To fix this infinite retry a new bdev_has_space () hook is added to XFS
to break out of its metadata retry loop if the underlying block device
reports it no longer has free space.  DM thin provisioning is now
trained to respond accordingly, which enables XFS to not cause a cascade
of tasks blocked on IO waiting for XFS's infinite retry.

All other block devices, which don't implement a .has_space method in
block_device_operations, will always return true for bdev_has_space().

With this change XFS will fail the metadata IO, force shutdown, and the
XFS filesystem may be unmounted.  This enables an admin to recover from
their oversight, of not having provided enough free space, without
having to force a hard reset of the system to get XFS to unwedge.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin.c          | 27 +++++++++++++++++++++++++--
 drivers/md/dm.c               | 33 +++++++++++++++++++++++++++++++++
 fs/block_dev.c                | 10 ++++++++++
 fs/xfs/xfs_buf_item.c         |  3 +++
 include/linux/blkdev.h        |  3 +++
 include/linux/device-mapper.h |  8 ++++++++
 6 files changed, 82 insertions(+), 2 deletions(-)

Comments

Dave Chinner July 20, 2015, 10:36 p.m. UTC | #1
On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> If XFS fails to write metadata it will retry the write indefinitely
> (with the hope that the write will succeed at some point in the future).
> 
> Others can possibly speak to historic reason(s) why this is a sane
> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> this infinite retry is very unwelcome -- especially if DM thinp was
> configured to be automatically extended with free space but the admin
> hasn't provided (or restored) adequate free space.
> 
> To fix this infinite retry a new bdev_has_space () hook is added to XFS
> to break out of its metadata retry loop if the underlying block device
> reports it no longer has free space.  DM thin provisioning is now
> trained to respond accordingly, which enables XFS to not cause a cascade
> of tasks blocked on IO waiting for XFS's infinite retry.
> 
> All other block devices, which don't implement a .has_space method in
> block_device_operations, will always return true for bdev_has_space().
> 
> With this change XFS will fail the metadata IO, force shutdown, and the
> XFS filesystem may be unmounted.  This enables an admin to recover from
> their oversight, of not having provided enough free space, without
> having to force a hard reset of the system to get XFS to unwedge.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
The scsi layers already do this for hardware thinp ENOSPC failures,
so dm-thinp should behave exactly the same (i.e. via
__scsi_error_from_host_byte()). The behaviour of the filesystem
should be the same in all cases - making it conditional on whether
the thinp implementation can be polled for available space is wrong
as most hardware thinp can't be polled by the kernel forthis info..


If dm-thinp just returns ENOSPC from on the BIO like other hardware
thinp devices, then it is up to the filesystem to handle that
appropriately.  i.e. whether an ENOSPC IO error is fatal to the
filesystem is determined by filesystem configuration and context of
the IO error, not whether the block device has no space (which we
should already know from the ENOSPC error delivered by IO
completion).

Cheers,

Dave.
Mike Snitzer July 20, 2015, 11:20 p.m. UTC | #2
On Mon, Jul 20 2015 at  6:36pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> > If XFS fails to write metadata it will retry the write indefinitely
> > (with the hope that the write will succeed at some point in the future).
> > 
> > Others can possibly speak to historic reason(s) why this is a sane
> > default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> > this infinite retry is very unwelcome -- especially if DM thinp was
> > configured to be automatically extended with free space but the admin
> > hasn't provided (or restored) adequate free space.
> > 
> > To fix this infinite retry a new bdev_has_space () hook is added to XFS
> > to break out of its metadata retry loop if the underlying block device
> > reports it no longer has free space.  DM thin provisioning is now
> > trained to respond accordingly, which enables XFS to not cause a cascade
> > of tasks blocked on IO waiting for XFS's infinite retry.
> > 
> > All other block devices, which don't implement a .has_space method in
> > block_device_operations, will always return true for bdev_has_space().
> > 
> > With this change XFS will fail the metadata IO, force shutdown, and the
> > XFS filesystem may be unmounted.  This enables an admin to recover from
> > their oversight, of not having provided enough free space, without
> > having to force a hard reset of the system to get XFS to unwedge.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> The scsi layers already do this for hardware thinp ENOSPC failures,
> so dm-thinp should behave exactly the same (i.e. via
> __scsi_error_from_host_byte())

DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
bcc696fac1).  Before it was more nuanced, but it would either return
-ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
no_space_timoeut expired).

Anyway, XFS doesn't care what error is returned it'll just keep
retrying metdata IO indefinitely.

> The behaviour of the filesystem
> should be the same in all cases - making it conditional on whether
> the thinp implementation can be polled for available space is wrong
> as most hardware thinp can't be polled by the kernel forthis info..

There isn't an immediate corollary for SCSI no.  I've discussed flagging
the SCSI block device as out of space when the equivalent HW thinp
ASC/ASQ is returned from the SCSI target but there isn't an existing way
to know space was added to reset this state (like is possible with DM
thinp).

But in the past you (or others) have claimed that ENOSPC isn't a rich
enough return to _know_ that it is a hard error.

Now you seem to be oscillating away from a richer interface between
filesystems and the underlying block device to: a simple error code is
enough.

> If dm-thinp just returns ENOSPC from on the BIO like other hardware
> thinp devices, then it is up to the filesystem to handle that
> appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> filesystem is determined by filesystem configuration and context of
> the IO error, not whether the block device has no space (which we
> should already know from the ENOSPC error delivered by IO
> completion).

OK, _please_ add that configurability to XFS.  As I think you know, this
is a long-standing XFS on thinp issue (both DM and HW thinp).

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 21, 2015, 12:36 a.m. UTC | #3
On Mon, Jul 20, 2015 at 07:20:58PM -0400, Mike Snitzer wrote:
> On Mon, Jul 20 2015 at  6:36pm -0400,
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> > > If XFS fails to write metadata it will retry the write indefinitely
> > > (with the hope that the write will succeed at some point in the future).
> > > 
> > > Others can possibly speak to historic reason(s) why this is a sane
> > > default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> > > this infinite retry is very unwelcome -- especially if DM thinp was
> > > configured to be automatically extended with free space but the admin
> > > hasn't provided (or restored) adequate free space.
> > > 
> > > To fix this infinite retry a new bdev_has_space () hook is added to XFS
> > > to break out of its metadata retry loop if the underlying block device
> > > reports it no longer has free space.  DM thin provisioning is now
> > > trained to respond accordingly, which enables XFS to not cause a cascade
> > > of tasks blocked on IO waiting for XFS's infinite retry.
> > > 
> > > All other block devices, which don't implement a .has_space method in
> > > block_device_operations, will always return true for bdev_has_space().
> > > 
> > > With this change XFS will fail the metadata IO, force shutdown, and the
> > > XFS filesystem may be unmounted.  This enables an admin to recover from
> > > their oversight, of not having provided enough free space, without
> > > having to force a hard reset of the system to get XFS to unwedge.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> > The scsi layers already do this for hardware thinp ENOSPC failures,
> > so dm-thinp should behave exactly the same (i.e. via
> > __scsi_error_from_host_byte())
> 
> DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
> bcc696fac1).  Before it was more nuanced, but it would either return
> -ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
> no_space_timoeut expired).

Right, which meant we couldn't get a reliable ENOSPC detection, so
we just had to retry and hope....

> Anyway, XFS doesn't care what error is returned it'll just keep
> retrying metdata IO indefinitely.

Sure, that's historical behaviour and it's a good default for most
systems. But see here for how we plan to handle this particular
issue of error catergorisation:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

BTW, it's not just ENOSPC we need this for - ENOMEM needs the same
configurable behaviour now that the mm subsystem has started
treating GFP_NOFAIL differently to just looping outside the
allocator....

> > The behaviour of the filesystem
> > should be the same in all cases - making it conditional on whether
> > the thinp implementation can be polled for available space is wrong
> > as most hardware thinp can't be polled by the kernel forthis info..
> 
> There isn't an immediate corollary for SCSI no.  I've discussed flagging
> the SCSI block device as out of space when the equivalent HW thinp
> ASC/ASQ is returned from the SCSI target but there isn't an existing way
> to know space was added to reset this state (like is possible with DM
> thinp).
> 
> But in the past you (or others) have claimed that ENOSPC isn't a rich
> enough return to _know_ that it is a hard error.

Yes, I've said that in the past because we haven't had any agreement
on how block devices should report ENOSPC. However, this is slowly
being standardised and I've seen enough people say
"give us a knob to decide that ourselves as we know our hardware"
that means we can rely on ENOSPC on the bio to detect this issue and
handle it appropriately.

> Now you seem to be oscillating away from a richer interface between
> filesystems and the underlying block device to: a simple error code is
> enough.

Once we decided to give users the option to configure block device
error handling, the need for more information from the block device
goes away...

> > If dm-thinp just returns ENOSPC from on the BIO like other hardware
> > thinp devices, then it is up to the filesystem to handle that
> > appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> > filesystem is determined by filesystem configuration and context of
> > the IO error, not whether the block device has no space (which we
> > should already know from the ENOSPC error delivered by IO
> > completion).
> 
> OK, _please_ add that configurability to XFS.  As I think you know, this
> is a long-standing XFS on thinp issue (both DM and HW thinp).

It got stuck on the side of the bikeshed, unfortunately, and then
other stuff got painted on top of it. I'll go uncover it now that
dm-thinp is returning reliable ENOSPC in ENOSPC conditions ;)

Cheers,

Dave.
Eric Sandeen July 21, 2015, 3:34 p.m. UTC | #4
On 7/20/15 5:36 PM, Dave Chinner wrote:
> On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
>> If XFS fails to write metadata it will retry the write indefinitely
>> (with the hope that the write will succeed at some point in the future).
>>
>> Others can possibly speak to historic reason(s) why this is a sane
>> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
>> this infinite retry is very unwelcome -- especially if DM thinp was
>> configured to be automatically extended with free space but the admin
>> hasn't provided (or restored) adequate free space.
>>
>> To fix this infinite retry a new bdev_has_space () hook is added to XFS
>> to break out of its metadata retry loop if the underlying block device
>> reports it no longer has free space.  DM thin provisioning is now
>> trained to respond accordingly, which enables XFS to not cause a cascade
>> of tasks blocked on IO waiting for XFS's infinite retry.
>>
>> All other block devices, which don't implement a .has_space method in
>> block_device_operations, will always return true for bdev_has_space().
>>
>> With this change XFS will fail the metadata IO, force shutdown, and the
>> XFS filesystem may be unmounted.  This enables an admin to recover from
>> their oversight, of not having provided enough free space, without
>> having to force a hard reset of the system to get XFS to unwedge.
>>
>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> The scsi layers already do this for hardware thinp ENOSPC failures,
> so dm-thinp should behave exactly the same (i.e. via
> __scsi_error_from_host_byte()). The behaviour of the filesystem
> should be the same in all cases - making it conditional on whether
> the thinp implementation can be polled for available space is wrong
> as most hardware thinp can't be polled by the kernel forthis info..
> 
> 
> If dm-thinp just returns ENOSPC from on the BIO like other hardware
> thinp devices, then it is up to the filesystem to handle that
> appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> filesystem is determined by filesystem configuration and context of
> the IO error, not whether the block device has no space (which we
> should already know from the ENOSPC error delivered by IO
> completion).

The issue we had discussed previously is that there is no agreement
across block devices about whether ENOSPC is a permanent or temporary
condition.  Asking the admin to tune  the fs to each block device's
behavior sucks, IMHO.

This interface could at least be defined to reflect a permanent and
unambiguous state...

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer July 21, 2015, 5:47 p.m. UTC | #5
On Tue, Jul 21 2015 at 11:34am -0400,
Eric Sandeen <sandeen@redhat.com> wrote:

> On 7/20/15 5:36 PM, Dave Chinner wrote:
> > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> >> If XFS fails to write metadata it will retry the write indefinitely
> >> (with the hope that the write will succeed at some point in the future).
> >>
> >> Others can possibly speak to historic reason(s) why this is a sane
> >> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> >> this infinite retry is very unwelcome -- especially if DM thinp was
> >> configured to be automatically extended with free space but the admin
> >> hasn't provided (or restored) adequate free space.
> >>
> >> To fix this infinite retry a new bdev_has_space () hook is added to XFS
> >> to break out of its metadata retry loop if the underlying block device
> >> reports it no longer has free space.  DM thin provisioning is now
> >> trained to respond accordingly, which enables XFS to not cause a cascade
> >> of tasks blocked on IO waiting for XFS's infinite retry.
> >>
> >> All other block devices, which don't implement a .has_space method in
> >> block_device_operations, will always return true for bdev_has_space().
> >>
> >> With this change XFS will fail the metadata IO, force shutdown, and the
> >> XFS filesystem may be unmounted.  This enables an admin to recover from
> >> their oversight, of not having provided enough free space, without
> >> having to force a hard reset of the system to get XFS to unwedge.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> > The scsi layers already do this for hardware thinp ENOSPC failures,
> > so dm-thinp should behave exactly the same (i.e. via
> > __scsi_error_from_host_byte()). The behaviour of the filesystem
> > should be the same in all cases - making it conditional on whether
> > the thinp implementation can be polled for available space is wrong
> > as most hardware thinp can't be polled by the kernel forthis info..
> > 
> > 
> > If dm-thinp just returns ENOSPC from on the BIO like other hardware
> > thinp devices, then it is up to the filesystem to handle that
> > appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> > filesystem is determined by filesystem configuration and context of
> > the IO error, not whether the block device has no space (which we
> > should already know from the ENOSPC error delivered by IO
> > completion).
> 
> The issue we had discussed previously is that there is no agreement
> across block devices about whether ENOSPC is a permanent or temporary
> condition.  Asking the admin to tune  the fs to each block device's
> behavior sucks, IMHO.

It does suck, but it beats the alternative of XFS continuing to do
nothing about the problem.

Disucssing more with Vivek, might be that XFS would be best served to
model what dm-thinp has provided with its 'no_space_timeout'.  It
defaults to queueing IO for 60 seconds, once the timeout expires the
queued IOs getted errored.  If set to 0 dm-thinp will queue IO
indefinitely.

So for XFS's use-case: s/queue/retry/

> This interface could at least be defined to reflect a permanent and
> unambiguous state...

The proposed bdev_has_space() interface enabled XFS to defer to the
block device.  But it obviously doesn't help at all if the blockdevice
isn't providing a .has_space method -- so I can see value in XFS
having something like a 'no_space_timeout' knob.

But something needs to happen.  No more bike-shedding allowed on this
one.. PLEASE DO SOMETHING! :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 22, 2015, 12:09 a.m. UTC | #6
On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote:
> > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > The issue we had discussed previously is that there is no agreement
> > across block devices about whether ENOSPC is a permanent or temporary
> > condition.  Asking the admin to tune  the fs to each block device's
> > behavior sucks, IMHO.
> 
> It does suck, but it beats the alternative of XFS continuing to do
> nothing about the problem.

Just a comment on that: doing nothing is better than doing the wrong
thing and being stuck with it forever. :)

> Disucssing more with Vivek, might be that XFS would be best served to
> model what dm-thinp has provided with its 'no_space_timeout'.  It
> defaults to queueing IO for 60 seconds, once the timeout expires the
> queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> indefinitely.

Yes, that's exactly what I proposed in the thread I referenced in
my previous email, and what got stuck on the bikeshed wall because
of these concerns about knob twiddling:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

| e.g. if we need configurable error handling, it needs to be
| configurable for different error types, and it needs to be
| configurable on a per-mount basis. And it needs to be configurable
| at runtime, not just at mount time. That kind of leads to using
| sysfs for this. e.g. for each error type we ned to handle different
| behaviour for:
| 
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
| [transient] permanent
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
| 300
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
| 50
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
| 1

I've rebased this patchset, and I'm cleaning it up now, so in a few
days I'll have something for review, likely for the 4.3 merge
window....

Cheers,

Dave.
Dave Chinner July 22, 2015, 1 a.m. UTC | #7
On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote:
> > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > The issue we had discussed previously is that there is no agreement
> > > across block devices about whether ENOSPC is a permanent or temporary
> > > condition.  Asking the admin to tune  the fs to each block device's
> > > behavior sucks, IMHO.
> > 
> > It does suck, but it beats the alternative of XFS continuing to do
> > nothing about the problem.
> 
> Just a comment on that: doing nothing is better than doing the wrong
> thing and being stuck with it forever. :)
> 
> > Disucssing more with Vivek, might be that XFS would be best served to
> > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > defaults to queueing IO for 60 seconds, once the timeout expires the
> > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > indefinitely.
> 
> Yes, that's exactly what I proposed in the thread I referenced in
> my previous email, and what got stuck on the bikeshed wall because
> of these concerns about knob twiddling:
> 
> http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> 
> | e.g. if we need configurable error handling, it needs to be
> | configurable for different error types, and it needs to be
> | configurable on a per-mount basis. And it needs to be configurable
> | at runtime, not just at mount time. That kind of leads to using
> | sysfs for this. e.g. for each error type we ned to handle different
> | behaviour for:
> | 
> | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> | [transient] permanent
> | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> | 300
> | $ cat
> | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> | 50
> | $ cat
> | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> | 1
> 
> I've rebased this patchset, and I'm cleaning it up now, so in a few
> days I'll have something for review, likely for the 4.3 merge
> window....

Just thinking a bit more on how to make this simpler to configure,
is there a simple way for the filesystem to determine the current
config of the dm thinp volume? i.e. if the dm-thinp volume is
configured to error out immediately on enospc, then XFS should
default to doing the same thing. having XFS be able to grab this
status at mount time and change the default ENOSPC error config from
transient to permanent on such dm-thinp volumes would go a long way
to making these configs Just Do The Right Thing on block dev enospc
errors...

e.g. if dm-thinp is configured to queue for 60s and then fail on
ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
we want XFS to retry and use it's default retry maximums before
failing permanently.

Cheers,

Dave.
Mike Snitzer July 22, 2015, 1:40 a.m. UTC | #8
On Tue, Jul 21 2015 at  9:00pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote:
> > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > The issue we had discussed previously is that there is no agreement
> > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > behavior sucks, IMHO.
> > > 
> > > It does suck, but it beats the alternative of XFS continuing to do
> > > nothing about the problem.
> > 
> > Just a comment on that: doing nothing is better than doing the wrong
> > thing and being stuck with it forever. :)
> > 
> > > Disucssing more with Vivek, might be that XFS would be best served to
> > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > indefinitely.
> > 
> > Yes, that's exactly what I proposed in the thread I referenced in
> > my previous email, and what got stuck on the bikeshed wall because
> > of these concerns about knob twiddling:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > 
> > | e.g. if we need configurable error handling, it needs to be
> > | configurable for different error types, and it needs to be
> > | configurable on a per-mount basis. And it needs to be configurable
> > | at runtime, not just at mount time. That kind of leads to using
> > | sysfs for this. e.g. for each error type we ned to handle different
> > | behaviour for:
> > | 
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > | [transient] permanent
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > | 300
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > | 50
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > | 1
> > 
> > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > days I'll have something for review, likely for the 4.3 merge
> > window....
> 
> Just thinking a bit more on how to make this simpler to configure,
> is there a simple way for the filesystem to determine the current
> config of the dm thinp volume? i.e. if the dm-thinp volume is
> configured to error out immediately on enospc, then XFS should
> default to doing the same thing. having XFS be able to grab this
> status at mount time and change the default ENOSPC error config from
> transient to permanent on such dm-thinp volumes would go a long way
> to making these configs Just Do The Right Thing on block dev enospc
> errors...
> 
> e.g. if dm-thinp is configured to queue for 60s and then fail on
> ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> we want XFS to retry and use it's default retry maximums before
> failing permanently.

Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
device's config from within the kernel (unless XFS wants to get into the
business of issuing ioctls to DM devices.. unlikely).  I could be
persuaded to expose a per-device sysfs file to get the status (would
avoid need for ioctl), e.g.:
 # cat /sys/block/dm-5/dm/status
(but that doesn't _really_ help in-kernel access, awkward for filesystem
code to be opening sysfs files!)

SO userspace (mkfs.xfs) could easily check the thinp device's setup
using 'dmsetup status <device>' (output will either contain
'queue_if_no_space' or 'error_if_no_space').  The DM thinp
'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
accessed using a module param:
 # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
 60

I'm open to considering alternative interfaces for getting you the info
you need.  I just don't have a great sense for what mechanism you'd like
to use.  Do we invent a new block device operations table method that
sets values in a 'struct no_space_strategy' passed in to the
blockdevice?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 22, 2015, 2:37 a.m. UTC | #9
On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at  9:00pm -0400,
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote:
> > > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > > The issue we had discussed previously is that there is no agreement
> > > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > > behavior sucks, IMHO.
> > > > 
> > > > It does suck, but it beats the alternative of XFS continuing to do
> > > > nothing about the problem.
> > > 
> > > Just a comment on that: doing nothing is better than doing the wrong
> > > thing and being stuck with it forever. :)
> > > 
> > > > Disucssing more with Vivek, might be that XFS would be best served to
> > > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > > indefinitely.
> > > 
> > > Yes, that's exactly what I proposed in the thread I referenced in
> > > my previous email, and what got stuck on the bikeshed wall because
> > > of these concerns about knob twiddling:
> > > 
> > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > > 
> > > | e.g. if we need configurable error handling, it needs to be
> > > | configurable for different error types, and it needs to be
> > > | configurable on a per-mount basis. And it needs to be configurable
> > > | at runtime, not just at mount time. That kind of leads to using
> > > | sysfs for this. e.g. for each error type we ned to handle different
> > > | behaviour for:
> > > | 
> > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > > | [transient] permanent
> > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > > | 300
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > > | 50
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > > | 1
> > > 
> > > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > > days I'll have something for review, likely for the 4.3 merge
> > > window....
> > 
> > Just thinking a bit more on how to make this simpler to configure,
> > is there a simple way for the filesystem to determine the current
> > config of the dm thinp volume? i.e. if the dm-thinp volume is
> > configured to error out immediately on enospc, then XFS should
> > default to doing the same thing. having XFS be able to grab this
> > status at mount time and change the default ENOSPC error config from
> > transient to permanent on such dm-thinp volumes would go a long way
> > to making these configs Just Do The Right Thing on block dev enospc
> > errors...
> > 
> > e.g. if dm-thinp is configured to queue for 60s and then fail on
> > ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> > dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> > we want XFS to retry and use it's default retry maximums before
> > failing permanently.
> 
> Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
> device's config from within the kernel (unless XFS wants to get into the
> business of issuing ioctls to DM devices.. unlikely). 

Not really.

> I could be
> persuaded to expose a per-device sysfs file to get the status (would
> avoid need for ioctl), e.g.:
>  # cat /sys/block/dm-5/dm/status
> (but that doesn't _really_ help in-kernel access, awkward for filesystem
> code to be opening sysfs files!)

No, not going that way.  We have direct access through the bdev we
opened, so that's the communications channel we'd need to use.

> SO userspace (mkfs.xfs) could easily check the thinp device's setup
> using 'dmsetup status <device>' (output will either contain
> 'queue_if_no_space' or 'error_if_no_space').  The DM thinp
> 'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
> accessed using a module param:
>  # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
>  60

Mkfs is not the right interface - users can change dm-thinp
behaviour long after the filesystem was created and so the XFS
config needs to be configurable, too. Further, I really don't want
to have to add anything to the on-disk format to support error
configuration because, well, that drives the level of complexity up
a couple or orders of magnitude (mkfs, repair, metadump, db, etc all
need to support it), especially when it can be driven easily from
userspace after mount with far less constraints and support burden.

> I'm open to considering alternative interfaces for getting you the info
> you need.  I just don't have a great sense for what mechanism you'd like
> to use.  Do we invent a new block device operations table method that
> sets values in a 'struct no_space_strategy' passed in to the
> blockdevice?

It's long been frowned on having the filesystems dig into block
device structures. We have lots of wrapper functions for getting
information from or performing operations on block devices. (e.g.
bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
need to follow. If we do that - bdev_get_nospace_strategy() - then
how that information gets to the filesystem is completely opaque
at the fs level, and the block layer can implement it in whatever
way is considered sane...

And, realistically, all we really need returned is a enum to tell us
how the bdev behaves on enospc:
	- bdev fails fast, (i.e. immediate ENOSPC)
	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
	- bdev never fails (i.e. queue forever)
	- bdev doesn't support this (i.e. EOPNOTSUPP)

Cheers,

Dave.
Mike Snitzer July 22, 2015, 1:34 p.m. UTC | #10
On Tue, Jul 21 2015 at 10:37pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> 
> > I'm open to considering alternative interfaces for getting you the info
> > you need.  I just don't have a great sense for what mechanism you'd like
> > to use.  Do we invent a new block device operations table method that
> > sets values in a 'struct no_space_strategy' passed in to the
> > blockdevice?
> 
> It's long been frowned on having the filesystems dig into block
> device structures. We have lots of wrapper functions for getting
> information from or performing operations on block devices. (e.g.
> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> need to follow. If we do that - bdev_get_nospace_strategy() - then
> how that information gets to the filesystem is completely opaque
> at the fs level, and the block layer can implement it in whatever
> way is considered sane...
> 
> And, realistically, all we really need returned is a enum to tell us
> how the bdev behaves on enospc:
> 	- bdev fails fast, (i.e. immediate ENOSPC)
> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> 	- bdev never fails (i.e. queue forever)
> 	- bdev doesn't support this (i.e. EOPNOTSUPP)

This 'struct no_space_strategy' would be invented purely for
informational purposes for upper layers' benefit -- I don't consider it
a "block device structure" it the traditional sense.

I was thinking upper layers would like to know the actual timeout value
for the "fails slow" case.  As such the 'struct no_space_strategy' would
have the enum and the timeout.  And would be returned with a call:
     bdev_get_nospace_strategy(bdev, &no_space_strategy)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 22, 2015, 4:28 p.m. UTC | #11
On 7/22/15 8:34 AM, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at 10:37pm -0400,
> Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
>>
>>> I'm open to considering alternative interfaces for getting you the info
>>> you need.  I just don't have a great sense for what mechanism you'd like
>>> to use.  Do we invent a new block device operations table method that
>>> sets values in a 'struct no_space_strategy' passed in to the
>>> blockdevice?
>>
>> It's long been frowned on having the filesystems dig into block
>> device structures. We have lots of wrapper functions for getting
>> information from or performing operations on block devices. (e.g.
>> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
>> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
>> need to follow. If we do that - bdev_get_nospace_strategy() - then
>> how that information gets to the filesystem is completely opaque
>> at the fs level, and the block layer can implement it in whatever
>> way is considered sane...
>>
>> And, realistically, all we really need returned is a enum to tell us
>> how the bdev behaves on enospc:
>> 	- bdev fails fast, (i.e. immediate ENOSPC)
>> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
>> 	- bdev never fails (i.e. queue forever)
>> 	- bdev doesn't support this (i.e. EOPNOTSUPP)

I'm not sure how this is more useful than the bdev simply responding to
a query of "should we keep trying IOs?"

IOWS do we really care if it's failing fast or slow, vs. simply knowing
whether it has now permanently failed?

So rather than "bdev_get_nospace_strategy" it seems like all we need
to know is "bdev_has_failed" - do we really care about the details?

> This 'struct no_space_strategy' would be invented purely for
> informational purposes for upper layers' benefit -- I don't consider it
> a "block device structure" it the traditional sense.
> 
> I was thinking upper layers would like to know the actual timeout value
> for the "fails slow" case.  As such the 'struct no_space_strategy' would
> have the enum and the timeout.  And would be returned with a call:
>      bdev_get_nospace_strategy(bdev, &no_space_strategy)

Asking for the timeout value seems to add complexity.  It could change after
we ask, and knowing it now requires another layer to be handling timeouts...

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer July 22, 2015, 4:51 p.m. UTC | #12
On Wed, Jul 22 2015 at 12:28pm -0400,
Eric Sandeen <sandeen@redhat.com> wrote:

> On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 10:37pm -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> >>
> >>> I'm open to considering alternative interfaces for getting you the info
> >>> you need.  I just don't have a great sense for what mechanism you'd like
> >>> to use.  Do we invent a new block device operations table method that
> >>> sets values in a 'struct no_space_strategy' passed in to the
> >>> blockdevice?
> >>
> >> It's long been frowned on having the filesystems dig into block
> >> device structures. We have lots of wrapper functions for getting
> >> information from or performing operations on block devices. (e.g.
> >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> >> how that information gets to the filesystem is completely opaque
> >> at the fs level, and the block layer can implement it in whatever
> >> way is considered sane...
> >>
> >> And, realistically, all we really need returned is a enum to tell us
> >> how the bdev behaves on enospc:
> >> 	- bdev fails fast, (i.e. immediate ENOSPC)
> >> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> >> 	- bdev never fails (i.e. queue forever)
> >> 	- bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> I'm not sure how this is more useful than the bdev simply responding to
> a query of "should we keep trying IOs?"
> 
> IOWS do we really care if it's failing fast or slow, vs. simply knowing
> whether it has now permanently failed?
> 
> So rather than "bdev_get_nospace_strategy" it seems like all we need
> to know is "bdev_has_failed" - do we really care about the details?

My bdev_has_space() proposal is no different then bdev_has_failed().  If
you prefer the more generic name then fine.  But bdev_has_failed() is of
limited utlity outside of devices that provide support.  So I can see
why Dave is resisting it.

Anyway, the benefit of XFS tailoring its independent config based on
dm-thinp's comparable config makes sense to me.  The reason for XFS's
independent config is it could be deployed on any storage (e.g. not
dm-thinp).

Affords XFS to defer to DM thinp but still have comparable functionality
for HW thinp or some other storage.

> > This 'struct no_space_strategy' would be invented purely for
> > informational purposes for upper layers' benefit -- I don't consider it
> > a "block device structure" it the traditional sense.
> > 
> > I was thinking upper layers would like to know the actual timeout value
> > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > have the enum and the timeout.  And would be returned with a call:
> >      bdev_get_nospace_strategy(bdev, &no_space_strategy)
> 
> Asking for the timeout value seems to add complexity.  It could change after
> we ask, and knowing it now requires another layer to be handling timeouts...

Dave is already saying XFS will have a timeout it'll be managing.
Stands to reason that XFS would base its timeout on DM thinp's timeout.
But yeah it does allow the stacked timeout that XFS uses to be out of
sync if the lower timeout changes (no different than blk_stack_limits).

Please fix this however you see fit.  I'll assist anywhere that makes
sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 23, 2015, 5:10 a.m. UTC | #13
On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
> On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 10:37pm -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> >>> I'm open to considering alternative interfaces for getting you the info
> >>> you need.  I just don't have a great sense for what mechanism you'd like
> >>> to use.  Do we invent a new block device operations table method that
> >>> sets values in a 'struct no_space_strategy' passed in to the
> >>> blockdevice?
> >>
> >> It's long been frowned on having the filesystems dig into block
> >> device structures. We have lots of wrapper functions for getting
> >> information from or performing operations on block devices. (e.g.
> >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> >> how that information gets to the filesystem is completely opaque
> >> at the fs level, and the block layer can implement it in whatever
> >> way is considered sane...
> >>
> >> And, realistically, all we really need returned is a enum to tell us
> >> how the bdev behaves on enospc:
> >> 	- bdev fails fast, (i.e. immediate ENOSPC)
> >> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> >> 	- bdev never fails (i.e. queue forever)
> >> 	- bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> I'm not sure how this is more useful than the bdev simply responding to
> a query of "should we keep trying IOs?"

	- bdev fails fast, (i.e. immediate ENOSPC)

XFS should use a bound retry behaviour for to allow the possiblity of
the admin adding more space before we shut down the fs. i.e.
XFS fails slow.

	- bdev fails slow, (i.e. queue for some time, then ENOSPC)

We know that IOs are going to be delayed before they are failed, so
there's no point in retrying as the admin has already had a chance
to resolve the ENOSPC condition before failure was reported. i.e.
XFS fails fast.

	- bdev never fails (i.e. queue forever)

Block device will appear to hang when it runs out of space. Nothing
XFS can do here because IOs never fail, but we need to note this in
the log at mount time so that filesystem hangs are easily explained
when reported to us.

	- bdev doesn't support this (i.e. EOPNOTSUPP)

XFS uses default "retry forever" behaviour.

> > This 'struct no_space_strategy' would be invented purely for
> > informational purposes for upper layers' benefit -- I don't consider it
> > a "block device structure" it the traditional sense.
> > 
> > I was thinking upper layers would like to know the actual timeout value
> > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > have the enum and the timeout.  And would be returned with a call:
> >      bdev_get_nospace_strategy(bdev, &no_space_strategy)
> 
> Asking for the timeout value seems to add complexity.  It could change after
> we ask, and knowing it now requires another layer to be handling timeouts...

I don't think knowing the bdev timeout is necessary because the
default is most likely to be "fail fast" in this case. i.e. no
retries, just shut down.  IOWs, if we describe the configs and
actions in neutral terms, then the default configurations easy for
users to understand. i.e:

bdev enospc		XFS default
-----------		-----------
Fail slow		Fail fast
Fail fast		Fail slow
Fail never		Fail never, Record in log
EOPNOTSUPP		Fail never

With that in mind, I'm thinking I should drop the
"permanent/transient" error classifications, and change it "failure
behaviour" with the options "fast slow [never]" and only the slow
option has retry/timeout configuration options.  I think the "never"
option still needs to "fail at unmount" config variable, but we
enable it by default rather than hanging unmount and requiring a
manual shutdown like we do now....

Cheers,

Dave.
Mike Snitzer July 23, 2015, 2:33 p.m. UTC | #14
On Thu, Jul 23 2015 at  1:10am -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
> > On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 10:37pm -0400,
> > > Dave Chinner <david@fromorbit.com> wrote:
> > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> > >>> I'm open to considering alternative interfaces for getting you the info
> > >>> you need.  I just don't have a great sense for what mechanism you'd like
> > >>> to use.  Do we invent a new block device operations table method that
> > >>> sets values in a 'struct no_space_strategy' passed in to the
> > >>> blockdevice?
> > >>
> > >> It's long been frowned on having the filesystems dig into block
> > >> device structures. We have lots of wrapper functions for getting
> > >> information from or performing operations on block devices. (e.g.
> > >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> > >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> > >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> > >> how that information gets to the filesystem is completely opaque
> > >> at the fs level, and the block layer can implement it in whatever
> > >> way is considered sane...
> > >>
> > >> And, realistically, all we really need returned is a enum to tell us
> > >> how the bdev behaves on enospc:
> > >> 	- bdev fails fast, (i.e. immediate ENOSPC)
> > >> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> > >> 	- bdev never fails (i.e. queue forever)
> > >> 	- bdev doesn't support this (i.e. EOPNOTSUPP)
> > 
> > I'm not sure how this is more useful than the bdev simply responding to
> > a query of "should we keep trying IOs?"
> 
> 	- bdev fails fast, (i.e. immediate ENOSPC)
> 
> XFS should use a bound retry behaviour for to allow the possiblity of
> the admin adding more space before we shut down the fs. i.e.
> XFS fails slow.
> 
> 	- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> 
> We know that IOs are going to be delayed before they are failed, so
> there's no point in retrying as the admin has already had a chance
> to resolve the ENOSPC condition before failure was reported. i.e.
> XFS fails fast.
> 
> 	- bdev never fails (i.e. queue forever)
> 
> Block device will appear to hang when it runs out of space. Nothing
> XFS can do here because IOs never fail, but we need to note this in
> the log at mount time so that filesystem hangs are easily explained
> when reported to us.
> 
> 	- bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> XFS uses default "retry forever" behaviour.
> 
> > > This 'struct no_space_strategy' would be invented purely for
> > > informational purposes for upper layers' benefit -- I don't consider it
> > > a "block device structure" it the traditional sense.
> > > 
> > > I was thinking upper layers would like to know the actual timeout value
> > > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > > have the enum and the timeout.  And would be returned with a call:
> > >      bdev_get_nospace_strategy(bdev, &no_space_strategy)
> > 
> > Asking for the timeout value seems to add complexity.  It could change after
> > we ask, and knowing it now requires another layer to be handling timeouts...
> 
> I don't think knowing the bdev timeout is necessary because the
> default is most likely to be "fail fast" in this case. i.e. no
> retries, just shut down.  IOWs, if we describe the configs and
> actions in neutral terms, then the default configurations easy for
> users to understand. i.e:
> 
> bdev enospc		XFS default
> -----------		-----------
> Fail slow		Fail fast
> Fail fast		Fail slow
> Fail never		Fail never, Record in log
> EOPNOTSUPP		Fail never
> 
> With that in mind, I'm thinking I should drop the
> "permanent/transient" error classifications, and change it "failure
> behaviour" with the options "fast slow [never]" and only the slow
> option has retry/timeout configuration options.  I think the "never"
> option still needs to "fail at unmount" config variable, but we
> enable it by default rather than hanging unmount and requiring a
> manual shutdown like we do now....

This all sounds good to me.  The simpler XFS configuration looks like a
nice improvement.

If you just want to stub out the call to bdev_get_nospace_strategy() I
can crank through implementing it once I get a few minutes.

Btw, not sure what I was thinking when suggesting XFS would benefit from
knowing the duration of the thinp no_space_timeout.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal July 23, 2015, 4:43 p.m. UTC | #15
On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:

[..]
> I don't think knowing the bdev timeout is necessary because the
> default is most likely to be "fail fast" in this case. i.e. no
> retries, just shut down.  IOWs, if we describe the configs and
> actions in neutral terms, then the default configurations easy for
> users to understand. i.e:
> 
> bdev enospc		XFS default
> -----------		-----------
> Fail slow		Fail fast
> Fail fast		Fail slow
> Fail never		Fail never, Record in log
> EOPNOTSUPP		Fail never
> 
> With that in mind, I'm thinking I should drop the
> "permanent/transient" error classifications, and change it "failure
> behaviour" with the options "fast slow [never]" and only the slow
> option has retry/timeout configuration options.  I think the "never"
> option still needs to "fail at unmount" config variable, but we
> enable it by default rather than hanging unmount and requiring a
> manual shutdown like we do now....

I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
we just do with one knob per error type and that is retry-timout.

retry-timeout=0 (Fail fast)
retry-timeout=X (Fail slow)
retry-timeout=-1 (Never Give up).

Also do we really need this timeout per error type.

Also would be nice if this timeout was configurable using a mount
option. Then we can just specify it during mount time and be done
with it.

Idea of auto tuning based on what block device is doing sounds reasonable
but that should not be a requirement for this patch and can go in even
later. It is one of those nice to have features.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka July 23, 2015, 5:08 p.m. UTC | #16
On Wed, 22 Jul 2015, Dave Chinner wrote:

> On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen <sandeen@redhat.com> wrote:
> > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > The issue we had discussed previously is that there is no agreement
> > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > behavior sucks, IMHO.
> > > 
> > > It does suck, but it beats the alternative of XFS continuing to do
> > > nothing about the problem.
> > 
> > Just a comment on that: doing nothing is better than doing the wrong
> > thing and being stuck with it forever. :)
> > 
> > > Disucssing more with Vivek, might be that XFS would be best served to
> > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > indefinitely.
> > 
> > Yes, that's exactly what I proposed in the thread I referenced in
> > my previous email, and what got stuck on the bikeshed wall because
> > of these concerns about knob twiddling:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > 
> > | e.g. if we need configurable error handling, it needs to be
> > | configurable for different error types, and it needs to be
> > | configurable on a per-mount basis. And it needs to be configurable
> > | at runtime, not just at mount time. That kind of leads to using
> > | sysfs for this. e.g. for each error type we ned to handle different
> > | behaviour for:
> > | 
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > | [transient] permanent
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > | 300
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > | 50
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > | 1
> > 
> > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > days I'll have something for review, likely for the 4.3 merge
> > window....
> 
> Just thinking a bit more on how to make this simpler to configure,
> is there a simple way for the filesystem to determine the current
> config of the dm thinp volume?

You can just stop retrying the I/Os when the user attempts to unmount the 
filesystem - then, you don't need any configuration option.

Mikulas

> i.e. if the dm-thinp volume is
> configured to error out immediately on enospc, then XFS should
> default to doing the same thing. having XFS be able to grab this
> status at mount time and change the default ENOSPC error config from
> transient to permanent on such dm-thinp volumes would go a long way
> to making these configs Just Do The Right Thing on block dev enospc
> errors...
> 
> e.g. if dm-thinp is configured to queue for 60s and then fail on
> ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> we want XFS to retry and use it's default retry maximums before
> failing permanently.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 23, 2015, 11 p.m. UTC | #17
On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
> 
> [..]
> > I don't think knowing the bdev timeout is necessary because the
> > default is most likely to be "fail fast" in this case. i.e. no
> > retries, just shut down.  IOWs, if we describe the configs and
> > actions in neutral terms, then the default configurations easy for
> > users to understand. i.e:
> > 
> > bdev enospc		XFS default
> > -----------		-----------
> > Fail slow		Fail fast
> > Fail fast		Fail slow
> > Fail never		Fail never, Record in log
> > EOPNOTSUPP		Fail never
> > 
> > With that in mind, I'm thinking I should drop the
> > "permanent/transient" error classifications, and change it "failure
> > behaviour" with the options "fast slow [never]" and only the slow
> > option has retry/timeout configuration options.  I think the "never"
> > option still needs to "fail at unmount" config variable, but we
> > enable it by default rather than hanging unmount and requiring a
> > manual shutdown like we do now....
> 
> I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
> we just do with one knob per error type and that is retry-timout.

"retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is
configured as:

# echo slow > fail_method
# echo 0 > max_retries
# echo 300 > retry_timeout

> retry-timeout=0 (Fail fast)
> retry-timeout=X (Fail slow)
> retry-timeout=-1 (Never Give up).

What do we do when we want to add a different failure type
with different configuration requirements?

> Also do we really need this timeout per error type.

I don't follow your logic here.  What do need a timeout for with
either the "never" or "fast" failure configurations?

> Also would be nice if this timeout was configurable using a mount
> option. Then we can just specify it during mount time and be done
> with it.

That way lies madness.  The error configuration iinfrastructure we
need is not just for ENOSPC errors on metadata buffers.  We need
configurable error behaviour for multiple different errors in
multiple different subsystems (e.g. data IO failure vs metadata
buffer IO failure vs memory allocation failure vs inode corruption
vs freespace corruption vs ....).

And we still would need the sysfs interface for querying and
configuring at runtime, so mount options are just a bad idea.  And
with sysfs, the potential future route for automatic configuration
at mount time is via udev events and configuration files, similar to
block devices.

> Idea of auto tuning based on what block device is doing sounds reasonable
> but that should not be a requirement for this patch and can go in even
> later. It is one of those nice to have features.

"this patch"? Just the core infrastructure so far:

11 files changed, 290 insertions(+), 60 deletions(-)

and that will need to be split into 4-5 patches for review. There's
a bunch of cleanup that preceeds this, and then there's a patch per
error type we are going to handle in metadata buffer IO completion.
IOWs, the dm-thinp autotuning is just a simple, small patch at the
end of a much larger series - it's maybe 10 lines of code in XFS...

Cheers,

Dave.
Dave Chinner July 23, 2015, 11:05 p.m. UTC | #18
On Thu, Jul 23, 2015 at 01:08:36PM -0400, Mikulas Patocka wrote:
> On Wed, 22 Jul 2015, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > > | 1

[...]

> You can just stop retrying the I/Os when the user attempts to unmount the 
> filesystem - then, you don't need any configuration option.

See above - the default will do that, but there are users who do not
want that unmount behaviour....

-Dave.
Vivek Goyal July 24, 2015, 2:34 a.m. UTC | #19
On Fri, Jul 24, 2015 at 09:00:54AM +1000, Dave Chinner wrote:
> On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
> > On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
> > 
> > [..]
> > > I don't think knowing the bdev timeout is necessary because the
> > > default is most likely to be "fail fast" in this case. i.e. no
> > > retries, just shut down.  IOWs, if we describe the configs and
> > > actions in neutral terms, then the default configurations easy for
> > > users to understand. i.e:
> > > 
> > > bdev enospc		XFS default
> > > -----------		-----------
> > > Fail slow		Fail fast
> > > Fail fast		Fail slow
> > > Fail never		Fail never, Record in log
> > > EOPNOTSUPP		Fail never
> > > 
> > > With that in mind, I'm thinking I should drop the
> > > "permanent/transient" error classifications, and change it "failure
> > > behaviour" with the options "fast slow [never]" and only the slow
> > > option has retry/timeout configuration options.  I think the "never"
> > > option still needs to "fail at unmount" config variable, but we
> > > enable it by default rather than hanging unmount and requiring a
> > > manual shutdown like we do now....
> > 
> > I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
> > we just do with one knob per error type and that is retry-timout.
> 
> "retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is
> configured as:
> 
> # echo slow > fail_method
> # echo 0 > max_retries
> # echo 300 > retry_timeout

Hi Dave,

I am sure I am missing something but I will anyway ask. Why do we need this
knob "fail_method". Isn't it sort of implied in other two knobs based
on their values.

max_retries=0 retry_timeout=0 implies fail_method=fast.

A non-zero value of max_retries or retry_timeout implies fail_method=slow

A very high value (-1) of either max_retries or retry_timeout implies
fail_method="almost never".

> > retry-timeout=0 (Fail fast)
> > retry-timeout=X (Fail slow)
> > retry-timeout=-1 (Never Give up).
> 
> What do we do when we want to add a different failure type
> with different configuration requirements?

Ok, got it. So we are targettting something very generic so that other
cases can be handled too.

> 
> > Also do we really need this timeout per error type.
> 
> I don't follow your logic here.  What do need a timeout for with
> either the "never" or "fast" failure configurations?

Ignore this. I had misunderstood it.

> 
> > Also would be nice if this timeout was configurable using a mount
> > option. Then we can just specify it during mount time and be done
> > with it.
> 
> That way lies madness.  The error configuration iinfrastructure we
> need is not just for ENOSPC errors on metadata buffers.  We need
> configurable error behaviour for multiple different errors in
> multiple different subsystems (e.g. data IO failure vs metadata
> buffer IO failure vs memory allocation failure vs inode corruption
> vs freespace corruption vs ....).
> 
> And we still would need the sysfs interface for querying and
> configuring at runtime, so mount options are just a bad idea.  And
> with sysfs, the potential future route for automatic configuration
> at mount time is via udev events and configuration files, similar to
> block devices.

Agreed that sysfs provides lots of flexibility here. I guess I was
just thinking in terms of solving this particular issue  we are facing.

> 
> > Idea of auto tuning based on what block device is doing sounds reasonable
> > but that should not be a requirement for this patch and can go in even
> > later. It is one of those nice to have features.
> 
> "this patch"? Just the core infrastructure so far:

I was referring to Mike's patch where we add additional method to block
device operations.

> 
> 11 files changed, 290 insertions(+), 60 deletions(-)
> 
> and that will need to be split into 4-5 patches for review. There's
> a bunch of cleanup that preceeds this, and then there's a patch per
> error type we are going to handle in metadata buffer IO completion.
> IOWs, the dm-thinp autotuning is just a simple, small patch at the
> end of a much larger series - it's maybe 10 lines of code in XFS...

Ok. I will wait for the final patches. 

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1c50c58..55ee3cf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3948,7 +3948,7 @@  static struct target_type pool_target = {
 	.name = "thin-pool",
 	.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
 		    DM_TARGET_IMMUTABLE,
-	.version = {1, 16, 0},
+	.version = {1, 17, 0},
 	.module = THIS_MODULE,
 	.ctr = pool_ctr,
 	.dtr = pool_dtr,
@@ -4333,9 +4333,31 @@  static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
 	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
+static bool thin_has_space(struct dm_target *ti)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+	enum pool_mode m = get_pool_mode(pool);
+
+	/*
+	 * The thin-pool has space if it is either in write mode _or_
+	 * it is still waiting for space to be added.
+	 *
+	 * If 'error_if_no_space' was configured the pool will not queue
+	 * IO at all, even though the pool will stay in OODS mode, so
+	 * there is no point having upper layers (e.g. XFS) retry IO
+	 * given 'error_if_no_space' is meant to _not_ queue IO.
+	 */
+	if (m == PM_WRITE ||
+	    (m == PM_OUT_OF_DATA_SPACE && !pool->pf.error_if_no_space))
+		return true;
+
+	return false;
+}
+
 static struct target_type thin_target = {
 	.name = "thin",
-	.version = {1, 16, 0},
+	.version = {1, 17, 0},
 	.module	= THIS_MODULE,
 	.ctr = thin_ctr,
 	.dtr = thin_dtr,
@@ -4348,6 +4370,7 @@  static struct target_type thin_target = {
 	.merge = thin_merge,
 	.iterate_devices = thin_iterate_devices,
 	.io_hints = thin_io_hints,
+	.has_space = thin_has_space,
 };
 
 /*----------------------------------------------------------------*/
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab37ae1..14bf9df 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -597,6 +597,38 @@  out:
 	return r;
 }
 
+static bool dm_blk_has_space(struct block_device *bdev)
+{
+	struct mapped_device *md = bdev->bd_disk->private_data;
+	int srcu_idx;
+	struct dm_table *map;
+	struct dm_target *tgt;
+	bool r = true;
+
+	map = dm_get_live_table(md, &srcu_idx);
+
+	if (!map || !dm_table_get_size(map))
+		goto out;
+
+	/* We only support devices that have a single target */
+	if (dm_table_get_num_targets(map) != 1)
+		goto out;
+
+	tgt = dm_table_get_target(map, 0);
+	if (!tgt->type->has_space)
+		goto out;
+
+	if (dm_suspended_md(md))
+		goto out;
+
+	r = tgt->type->has_space(tgt);
+
+out:
+	dm_put_live_table(md, srcu_idx);
+
+	return r;
+}
+
 static struct dm_io *alloc_io(struct mapped_device *md)
 {
 	return mempool_alloc(md->io_pool, GFP_NOIO);
@@ -3647,6 +3679,7 @@  static const struct block_device_operations dm_blk_dops = {
 	.release = dm_blk_close,
 	.ioctl = dm_blk_ioctl,
 	.getgeo = dm_blk_getgeo,
+	.has_space = dm_blk_has_space,
 	.owner = THIS_MODULE
 };
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1982437..5034361 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -469,6 +469,16 @@  long bdev_direct_access(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
+bool bdev_has_space(struct block_device *bdev)
+{
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+	if (!ops->has_space)
+		return true;
+	return ops->has_space(bdev);
+}
+EXPORT_SYMBOL_GPL(bdev_has_space);
+
 /*
  * pseudo-fs
  */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 092d652..98efbca 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1059,6 +1059,9 @@  xfs_buf_iodone_callbacks(
 	if (likely(!bp->b_error))
 		goto do_callbacks;
 
+	if (!bdev_has_space(bp->b_target->bt_bdev))
+		xfs_force_shutdown(mp, SHUTDOWN_REMOTE_REQ);
+
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c17d..1ba66bc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1566,6 +1566,7 @@  struct block_device_operations {
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
 	/* this callback is with swap_lock and sometimes page table lock held */
 	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	bool (*has_space) (struct block_device *);
 	struct module *owner;
 };
 
@@ -1576,6 +1577,8 @@  extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
 						unsigned long *pfn, long size);
+extern bool bdev_has_space(struct block_device *);
+
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 51cc1de..c55053e 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -119,6 +119,13 @@  typedef void (*dm_io_hints_fn) (struct dm_target *ti,
  */
 typedef int (*dm_busy_fn) (struct dm_target *ti);
 
+/*
+ * Returns:
+ *    false: The target has no space that may be allocated.
+ *    true:  The target has space that may be allocated.
+ */
+typedef bool (*dm_has_space_fn) (struct dm_target *ti);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -164,6 +171,7 @@  struct target_type {
 	dm_busy_fn busy;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
+	dm_has_space_fn has_space;
 
 	/* For internal device-mapper use. */
 	struct list_head list;