From patchwork Wed Sep 14 01:23:34 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 9330189 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 69D2A60231 for ; Wed, 14 Sep 2016 01:24:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4A57C29720 for ; Wed, 14 Sep 2016 01:24:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3D8602980F; Wed, 14 Sep 2016 01:24:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1 Received: from oss.sgi.com (oss.sgi.com [192.48.182.195]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id E313E29720 for ; Wed, 14 Sep 2016 01:24:34 +0000 (UTC) Received: from oss.sgi.com (localhost [IPv6:::1]) by oss.sgi.com (Postfix) with ESMTP id 59E4D7CA3; Tue, 13 Sep 2016 20:24:33 -0500 (CDT) X-Original-To: xfs@oss.sgi.com Delivered-To: xfs@oss.sgi.com Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 3F0777CA2 for ; Tue, 13 Sep 2016 20:24:30 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id 8ABC8AC002 for ; Tue, 13 Sep 2016 18:24:26 -0700 (PDT) X-ASG-Debug-ID: 1473816261-0bf57b0f52245890001-NocioJ Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [150.101.137.143]) by cuda.sgi.com with ESMTP id ZSiyDTCyZN7MmOKx for ; Tue, 13 Sep 2016 18:24:22 -0700 (PDT) X-Barracuda-Envelope-From: david@fromorbit.com X-Barracuda-Effective-Source-IP: ipmail05.adl6.internode.on.net[150.101.137.143] X-Barracuda-Apparent-Source-IP: 150.101.137.143 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BTBACVpdhXPAI1LHldGwEBAQMBAQEJAQEBgzkBAQEBAR6BBE+Gc4ZAlXEBAQEBAQEGjHmGGoIPggOGGAICAQECgUw5FAECAQEBAQEBAQYBAQEBIRoRL4RhAQEBBDocIxAIAw4HAwklDwUlAwcaE4hJvCcBAQEHAgEkHoVKhReJfx0FjiyLOoknhiGBeI10hnmFX4N7HoUmKjSHTQEBAQ Received: from ppp121-44-53-2.lns20.syd4.internode.on.net (HELO dastard) ([121.44.53.2]) by ipmail05.adl6.internode.on.net with ESMTP; 14 Sep 2016 10:53:35 +0930 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1bjyvG-0005JI-UL; Wed, 14 Sep 2016 11:23:34 +1000 Date: Wed, 14 Sep 2016 11:23:34 +1000 From: Dave Chinner To: Carlos Maiolino Subject: Re: [PATCH V4] xfs: Document error handlers behavior Message-ID: <20160914012334.GK30497@dastard> X-ASG-Orig-Subj: Re: [PATCH V4] xfs: Document error handlers behavior References: <1473757385-81633-1-git-send-email-cmaiolino@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1473757385-81633-1-git-send-email-cmaiolino@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Barracuda-Connect: ipmail05.adl6.internode.on.net[150.101.137.143] X-Barracuda-Start-Time: 1473816261 X-Barracuda-URL: https://192.48.176.25:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 7456 X-Virus-Scanned: by bsmtpd at sgi.com X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.7 tests=BSF_SC0_MISMATCH_TO X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.32877 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 BSF_SC0_MISMATCH_TO Envelope rcpt doesn't match header Cc: linux-xfs@vger.kernel.org, xfs@oss.sgi.com X-BeenThere: xfs@oss.sgi.com X-Mailman-Version: 2.1.14 Precedence: list List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Sep 13, 2016 at 05:03:05AM -0400, Carlos Maiolino wrote: > Document the implementation of error handlers into sysfs. > > > Signed-off-by: Carlos Maiolino > --- > Changelog: > > V2: > - Add a description of the precedence order of each option, focusing on > the behavior of "fail_at_unmount" which was not well explained in V1 > > V3: > - Fix English spelling mistakes suggested by Eric > > V4: > - Typo mistakes, document ENODEV default value for max_retries, fix > directories's hierarchy description Ok, I had to update this for the change in retry timeout values from Eric, so I went and fixed all the other things I thought needed fixing, too. New patch below.... Dave. diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 8146e9f..705d064 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -348,3 +348,128 @@ Removed Sysctls ---- ------- fs.xfs.xfsbufd_centisec v4.0 fs.xfs.age_buffer_centisecs v4.0 + + +Error handling +============== + +XFS can act differently according to the type of error found during its +operation. The implementation introduces the following concepts to the error +handler: + + -failure speed: + Defines how fast XFS should propagate an error upwards when a specific + error is found during the filesystem operation. It can propagate + immediately, after a defined number of retries, after a set time period, + or simply retry forever. + + -error classes: + Specifies the subsystem the error configuration will apply to, such as + metadata IO or memory allocation. Different subsystems will have + different error handlers for which behaviour can be configured. + + -error handlers: + Defines the behavior for a specific error. + +The filesystem behavior during an error can be set via sysfs files, Each +error handler works independently, the first condition met by and error handler +for a specific class will cause the error to be propagated rather than reset and +retried. + +The action taken by the filesystem when the error is propagated is context +dependent - it may cause a shut down in the case of an unrecoverable error, +it may be reported back to userspace, or it may even be ignored because +there's nothing useful we can with the error or anyone we can report it to (e.g. +during unmount). + +The configuration files are organized into the following per-mounted filesystem +hierarchy: + + /sys/fs/xfs//error/// + +Where: + + The short device name of the mounted filesystem. This is the same device + name that shows up in XFS kernel error messages as "XFS(): ..." + + + The subsystem the error configuration belongs to. As of 4.9, the defined + classes are: + + - "metadata": applies metadata buffer write IO + + + The individual error handler configurations. + + +Each filesystem has "global" error configuration options defined in their top +level directory: + + /sys/fs/xfs//error/ + + fail_at_unmount (Min: 0 Default: 1 Max: 1) + Defines the filesystem error behavior at unmount time. + + If set to a value of 1, XFS will override all other error configurations + during unmount and replace them with "immediate fail" characteristics. + i.e. no retries, no retry timeout. This will always allow unmount to + succeed when there are persistent errors present. + + If set to 0, the configured retry behaviour will continue until all + retries and/or timeouts have been exhausted. This will delay unmount + completion when there are persistent errors, and it may prevent the + filesystem from ever unmounting fully in the case of "retry forever" + handler configurations. + + Note: there is no guarantee that fail_at_unmount can be set whilst an + unmount is in progress. It is possible that the sysfs entries are + removed by the unmounting filesystem before a "retry forever" error + handler configuration causes unmount to hang, and hence the filesystem + must be configured appropriately before unmount begins to prevent + unmount hangs. + +Each filesystem has specific error class handlers that define the error +propagation behaviour for specific errors. There is also a "default" error +handler defined, which defines the behaviour for all errors that don't have +specific handlers defined. The handler configurations are found in the +directory: + + /sys/fs/xfs//error/// + + max_retries (Min: -1 Default: Varies Max: INTMAX) + Defines the allowed number of retries of a specific error before + the filesystem will propagate the error. The retry count for a given + error context (e.g. a specific metadata buffer) is reset ever time there + is a successful completion of the operation. + + Setting the value to "-1" will cause XFS to retry forever for this + specific error. + + Setting the value to "0" will cause XFS to fail immediately when the + specific error is reported. + + Setting the value to "N" (where 0 < N < Max) will make XFS retry the + operation "N" times before propagating the error. + + retry_timeout_seconds (Min: -1 Default: Varies Max: 1 day) + Define the amount of time (in seconds) that the filesystem is + allowed to retry its operations when the specific error is + found. + + Setting the value to "-1" will set an infinite timeout, causing + error propagation behaviour to be determined solely by the "max_retries" + parameter. + + Setting the value to "0" will cause XFS to fail immediately when the + specific error is reported. + + Setting the value to "N" (where 0 < N < Max) will propagate the error + on the first retry that fails at least "N" seconds after the first error + was detected, unless the number of retries defined by max_retries + expires first. + +Note: The default behaviour for a specific error handler is dependent on both +the class and error context. For example, the default values for +"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults +to "fail immediately" behaviour. This is done because ENODEV is a fatal, +unrecoverable error no matter how many times the metadata IO is retried.