xfs: Document error handlers behavior

Message ID	1473326635-30209-1-git-send-email-cmaiolino@redhat.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <xfs-bounces@oss.sgi.com> From: Carlos Maiolino <cmaiolino@redhat.com> To: linux-xfs@vger.kernel.org, xfs@oss.sgi.com Subject: [PATCH] xfs: Document error handlers behavior Date: Thu, 8 Sep 2016 05:23:55 -0400 xfs: Document error handlers behavior Message-Id: <1473326635-30209-1-git-send-email-cmaiolino@redhat.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com

Message ID

1473326635-30209-1-git-send-email-cmaiolino@redhat.com (mailing list archive)

State

Superseded, archived

Headers

From: Carlos Maiolino <cmaiolino@redhat.com>
To: linux-xfs@vger.kernel.org, xfs@oss.sgi.com
Subject: [PATCH] xfs: Document error handlers behavior
Date: Thu,  8 Sep 2016 05:23:55 -0400
Message-Id: <1473326635-30209-1-git-send-email-cmaiolino@redhat.com>
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: xfs-bounces@oss.sgi.com
Sender: xfs-bounces@oss.sgi.com

Commit Message

Carlos Maiolino Sept. 8, 2016, 9:23 a.m. UTC

Document the implementation of error handlers into sysfs.

Changelog:

V2:
	- Add a description of the precedence order of each option, focusing on
	  the behavior of "fail_at_unmount" which was not well explained in V1

V3:
	- Fix English spelling mistakes suggested by Eric

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 Documentation/filesystems/xfs.txt | 70 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Comments

Eric Sandeen Sept. 8, 2016, 2:29 p.m. UTC | #1

On 9/8/16 4:23 AM, Carlos Maiolino wrote:
> Document the implementation of error handlers into sysfs.
> 
> Changelog:
> 
> V2:
> 	- Add a description of the precedence order of each option, focusing on
> 	  the behavior of "fail_at_unmount" which was not well explained in V1
> 
> V3:
> 	- Fix English spelling mistakes suggested by Eric

Please put the patch version changelog after the "---" so it doesn't become
part of the permanent commit log; it's for current patch reviewers, not for
future code archaeologists.

> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  Documentation/filesystems/xfs.txt | 70 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
> 
> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
> index 8146e9f..8b6c861 100644
> --- a/Documentation/filesystems/xfs.txt
> +++ b/Documentation/filesystems/xfs.txt
> @@ -348,3 +348,73 @@ Removed Sysctls
>    ----				-------
>    fs.xfs.xfsbufd_centisec	v4.0
>    fs.xfs.age_buffer_centisecs	v4.0
> +
> +Error handling
> +==============
> +
> +XFS can act differently according to the type of error found
> +during its operation. The implementation introduces the following
> +concepts to the error handler:
> +
> + -failure speed:
> +	Defines how fast XFS should shut down when of a specific error is found

when a specific error is found

> +	during the filesystem operation. It can shut down immediately, after a
> +	defined number of retries, after a set time period, or simply retry
> +	forever. The old "retry forever" behavior is still the default, except
> +	during unmount, where any IOs retrying due to errors will be cancelled
> +	and unmount will be allowed to proceed.
> +
> + -error classes:
> +	Specifies the subsystem/location where the error handlers, such as

location of the error handlers

> +	metadata or memory allocation. Only metadata IO errors are handled
> +	at this time.
> +
> + -error handlers:
> +	Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files, where the
> +errors are organized with the structure below. Each configuration option works
> +independently, the first condition met for a specific configuration will cause
> +the filesystem to shut down:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/

The above line kind of hangs there oddly, because the first thing you do below
is describe a file which isn't in the above hierarchy.

Maybe we should show something like:

+  /sys/fs/xfs/<dev>/error/fail_at_unmount
+  /sys/fs/xfs/<dev>/error/<class>/<error>/<configuration>

to show everything that might be under it?  Not sure if that's better.

> +
> +Each directory contains:
> +
> + /sys/fs/xfs/<dev>/error/
> +
> +	fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> +		Defines the global error behavior at unmount time. If set to the
> +		default value of 1, XFS will cancel any pending IO retries, shut
> +		down, and unmount. If set to 0, pending IO retries may prevent
> +		the filesystem from unmounting.
> +
> +	<class> subdirectories
> +		Contains specific error handlers configuration
> +		(Ex: /sys/fs/xfs/<dev>/error/metadata, see below).
> +
> + /sys/fs/xfs/<dev>/error/<class>/
> +
> +	Directory containing configuration for a specific error <class>;
> +	currently only the "metadata" <class> is implemented.
> +	The contents of this directory are <class> specific, since each <class>
> +	might need to handle different types of errors.
> +
> + /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +	Contains the failure speed configuration files for specific errors in
> +	this <class, as well as a "default" behavior. Each <error> directory

<class>

> +	contains the following configuration files:
> +
> +	max_retries			(Min: -1  Default: -1  Max: INTMAX)
> +		Defines the allowed number of retries of a specific error before
> +		the filesystem will shut down.  The default value of "-1" will
> +		cause XFS to retry forever for this specific error.  Setting it
> +		to "0" will cause XFS to fail immediately when the specific
> +		error is found, and setting it to "N," where N is greater than 0,
> +		will make XFS retry "N" times before shutting down.
> +
> +	retry_timeout_seconds		(Min:  0  Default:  0  Max: INTMAX)
> +		Define the amount of time (in seconds) that the filesystem is
> +		allowed to retry its operations when the specific error is
> +		found. The default value of "0" will cause XFS to retry forever.

The default for ENODEV is different ... tricky to document that.  Good luck.  ;)

The maximum for retry_timeout_seconds is 86400 (1 day), not INTMAX:

retry_timeout_seconds_store()
{
...
        /* 1 day timeout maximum */
        if (val < 0 || val > 86400)
                return -EINVAL;
...
}

The default of -1 vs. 0 might change with the other patch I sent, but we can
fix this up if it's accepted.

-Eric

Carlos Maiolino Sept. 13, 2016, 8:59 a.m. UTC | #2

On Thu, Sep 08, 2016 at 09:29:18AM -0500, Eric Sandeen wrote:
> On 9/8/16 4:23 AM, Carlos Maiolino wrote:
> > Document the implementation of error handlers into sysfs.
> > 
> > Changelog:
> > 
> > V2:
> > 	- Add a description of the precedence order of each option, focusing on
> > 	  the behavior of "fail_at_unmount" which was not well explained in V1
> > 
> > V3:
> > 	- Fix English spelling mistakes suggested by Eric
> 
> Please put the patch version changelog after the "---" so it doesn't become
> part of the permanent commit log; it's for current patch reviewers, not for
> future code archaeologists.

Thanks, I'll make sure to do it with next patches too
> > +
> > +The filesystem behavior during an error can be set via sysfs files, where the
> > +errors are organized with the structure below. Each configuration option works
> > +independently, the first condition met for a specific configuration will cause
> > +the filesystem to shut down:
> > +
> > +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> 
> The above line kind of hangs there oddly, because the first thing you do below
> is describe a file which isn't in the above hierarchy.
> 
> Maybe we should show something like:
> 
> +  /sys/fs/xfs/<dev>/error/fail_at_unmount
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/<configuration>
> 
> to show everything that might be under it?  Not sure if that's better.
> 
> > +
> > +Each directory contains:
> > +
> > + /sys/fs/xfs/<dev>/error/
> > +
> > +	fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> > +		Defines the global error behavior at unmount time. If set to the
> > +		default value of 1, XFS will cancel any pending IO retries, shut
> > +		down, and unmount. If set to 0, pending IO retries may prevent
> > +		the filesystem from unmounting.
> > +
> > +	<class> subdirectories
> > +		Contains specific error handlers configuration
> > +		(Ex: /sys/fs/xfs/<dev>/error/metadata, see below).
> > +
> > + /sys/fs/xfs/<dev>/error/<class>/
> > +
> > +	Directory containing configuration for a specific error <class>;
> > +	currently only the "metadata" <class> is implemented.
> > +	The contents of this directory are <class> specific, since each <class>
> > +	might need to handle different types of errors.
> > +
> > + /sys/fs/xfs/<dev>/error/<class>/<error>/
> > +
> 
> The default for ENODEV is different ... tricky to document that.  Good luck.  ;)
> 
> The maximum for retry_timeout_seconds is 86400 (1 day), not INTMAX:
> 
> retry_timeout_seconds_store()
> {
> ...
>         /* 1 day timeout maximum */
>         if (val < 0 || val > 86400)
>                 return -EINVAL;
> ...

Fixing it, thanks for catching it, copy/paste sux :)

> }
> 
> The default of -1 vs. 0 might change with the other patch I sent, but we can
> fix this up if it's accepted.
> 

Ok.

Thanks for the review, I'll submit a new version in a few

diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 8146e9f..8b6c861 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -348,3 +348,73 @@  Removed Sysctls
   ----				-------
   fs.xfs.xfsbufd_centisec	v4.0
   fs.xfs.age_buffer_centisecs	v4.0
+
+Error handling
+==============
+
+XFS can act differently according to the type of error found
+during its operation. The implementation introduces the following
+concepts to the error handler:
+
+ -failure speed:
+	Defines how fast XFS should shut down when of a specific error is found
+	during the filesystem operation. It can shut down immediately, after a
+	defined number of retries, after a set time period, or simply retry
+	forever. The old "retry forever" behavior is still the default, except
+	during unmount, where any IOs retrying due to errors will be cancelled
+	and unmount will be allowed to proceed.
+
+ -error classes:
+	Specifies the subsystem/location where the error handlers, such as
+	metadata or memory allocation. Only metadata IO errors are handled
+	at this time.
+
+ -error handlers:
+	Defines the behavior for a specific error.
+
+The filesystem behavior during an error can be set via sysfs files, where the
+errors are organized with the structure below. Each configuration option works
+independently, the first condition met for a specific configuration will cause
+the filesystem to shut down:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+Each directory contains:
+
+ /sys/fs/xfs/<dev>/error/
+
+	fail_at_unmount		(Min:  0  Default:  1  Max: 1)
+		Defines the global error behavior at unmount time. If set to the
+		default value of 1, XFS will cancel any pending IO retries, shut
+		down, and unmount. If set to 0, pending IO retries may prevent
+		the filesystem from unmounting.
+
+	<class> subdirectories
+		Contains specific error handlers configuration
+		(Ex: /sys/fs/xfs/<dev>/error/metadata, see below).
+
+ /sys/fs/xfs/<dev>/error/<class>/
+
+	Directory containing configuration for a specific error <class>;
+	currently only the "metadata" <class> is implemented.
+	The contents of this directory are <class> specific, since each <class>
+	might need to handle different types of errors.
+
+ /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+	Contains the failure speed configuration files for specific errors in
+	this <class, as well as a "default" behavior. Each <error> directory
+	contains the following configuration files:
+
+	max_retries			(Min: -1  Default: -1  Max: INTMAX)
+		Defines the allowed number of retries of a specific error before
+		the filesystem will shut down.  The default value of "-1" will
+		cause XFS to retry forever for this specific error.  Setting it
+		to "0" will cause XFS to fail immediately when the specific
+		error is found, and setting it to "N," where N is greater than 0,
+		will make XFS retry "N" times before shutting down.
+
+	retry_timeout_seconds		(Min:  0  Default:  0  Max: INTMAX)
+		Define the amount of time (in seconds) that the filesystem is
+		allowed to retry its operations when the specific error is
+		found. The default value of "0" will cause XFS to retry forever.

xfs: Document error handlers behavior

Commit Message

Comments

Patch