From patchwork Tue Jul 26 19:49:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12929743 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85F48C19F29 for ; Tue, 26 Jul 2022 19:49:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239817AbiGZTtR (ORCPT ); Tue, 26 Jul 2022 15:49:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239654AbiGZTtQ (ORCPT ); Tue, 26 Jul 2022 15:49:16 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0571E357C7; Tue, 26 Jul 2022 12:49:16 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 97FD161588; Tue, 26 Jul 2022 19:49:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 00281C433C1; Tue, 26 Jul 2022 19:49:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1658864955; bh=djyo2JEBh3td34QgGNsyWl5ZLJ5jgS6nCmNZbIAPiGk=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=r8SN2XvpB5+7IvCOuIqzW/NPQptgEgS+U4OQaMqHQVXJSia8gXvwgLqKy3tLwK71A 0C7ouvzUkFcr3sycz3TpyJhxL4vzabtD1Ih9f6Ra8um9QHWKmS7djvi9RVRhDMrnlv WqCdNJXw3gFPZOkLPjf7G9UafGpe4GJB/q3akvYZVRsIOgmw0taGWp/P/Nmw0fT4FL 7KGYB/5Xnk6G9fefq7Cf3MhoGSQdQYu6rTkN1ojRTbxvtchwiOjvRM1RzdwFDXqQ3i 1oZFXJ/gUydGEy0NmXBVCb1Pnyl/8Nm7qRku38XJpJcAW88U6OLOBcGW/Tz5ghALlW In47rTw1kqwwQ== Subject: [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 26 Jul 2022 12:49:14 -0700 Message-ID: <165886495460.1585306.10074516195471640063.stgit@magnolia> In-Reply-To: <165886494905.1585306.15343417924888857310.stgit@magnolia> References: <165886494905.1585306.15343417924888857310.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: fstests@vger.kernel.org From: Darrick J. Wong There's a slight mistake in _reset_xfs_sysfs_error_handling: it sets retry_timeout_seconds to 0, which is not the current default (-1) in upstream Linux. Fix this. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- common/xfs | 2 +- tests/xfs/006.out | 6 +++--- tests/xfs/264.out | 12 ++++++------ 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/common/xfs b/common/xfs index 9f84dffb..ba72027c 100644 --- a/common/xfs +++ b/common/xfs @@ -781,7 +781,7 @@ _reset_xfs_sysfs_error_handling() _get_fs_sysfs_attr $dev error/metadata/${e}/max_retries _set_fs_sysfs_attr $dev \ - error/metadata/${e}/retry_timeout_seconds 0 + error/metadata/${e}/retry_timeout_seconds -1 echo -n "error/metadata/${e}/retry_timeout_seconds=" _get_fs_sysfs_attr $dev \ error/metadata/${e}/retry_timeout_seconds diff --git a/tests/xfs/006.out b/tests/xfs/006.out index 3260b3a2..433b0bc3 100644 --- a/tests/xfs/006.out +++ b/tests/xfs/006.out @@ -1,8 +1,8 @@ QA output created by 006 error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 diff --git a/tests/xfs/264.out b/tests/xfs/264.out index 502e72d3..e45ac5a5 100644 --- a/tests/xfs/264.out +++ b/tests/xfs/264.out @@ -2,20 +2,20 @@ QA output created by 264 === Test EIO/max_retries === error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 error/fail_at_unmount=0 error/metadata/EIO/max_retries=1 === Test EIO/retry_timeout_seconds === error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 error/fail_at_unmount=0 error/metadata/EIO/retry_timeout_seconds=1 From patchwork Tue Jul 26 19:49:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12929744 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5C5CC00140 for ; Tue, 26 Jul 2022 19:49:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239698AbiGZTtZ (ORCPT ); Tue, 26 Jul 2022 15:49:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239654AbiGZTtY (ORCPT ); Tue, 26 Jul 2022 15:49:24 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35567357E5; Tue, 26 Jul 2022 12:49:23 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id D6DDEB80919; Tue, 26 Jul 2022 19:49:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 92318C433C1; Tue, 26 Jul 2022 19:49:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1658864960; bh=ehxNhR8SMV5Z67mOm4YqokKlpX6kINV6GrgKDo46cZ0=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=pCyivbnCLzKXaRUk+mpJd1atRc9of+6IBC04nWhQclwAQxKeioERQCe8xn8b6x6Hs PX7f8AXUV+4SN9wA+ImDhOPs717eEbNENxTI0bRcEdtCD5XgSeQqs+nhMyMBdDL0Ui QMDrFxu8r/fx8pl3ebPpq2S1Jtgu1ki4VXixlqZfkBYrJbmBsRK/PszPUjTzCNy/yJ gIPJem2GJgGCBt5ufDzGeVQvlTKcW/qwfk9t/qzBM6ZFnnES2urKh/F645k+vVz7mh zcOXrCUUCFJDHn83IZx+n57sRSmnpVNcQGAcGaZf6RASYn6iAFg16VDwKQZsOe72EO uL7eVtXGsfIsg== Subject: [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 26 Jul 2022 12:49:20 -0700 Message-ID: <165886496017.1585306.14180522898371330403.stgit@magnolia> In-Reply-To: <165886494905.1585306.15343417924888857310.stgit@magnolia> References: <165886494905.1585306.15343417924888857310.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: fstests@vger.kernel.org From: Darrick J. Wong This patch fixes a rather hard to hit livelock in the tests that test how xfs handles shutdown behavior when the device suddenly dies and starts returing EIO all the time. The livelock happens if the AIL is stuck retrying failed metadata updates forever, the log itself is not being written, and there is no more log grant space, which prevents the frontend from shutting down the log due to EIO errors during transactions. While most users probably want the default retry-forever behavior because EIO can be transient, the circumstances are different here. The tests are designed to flip the device back to working status only after the unmount succeeds, so we know there's no point in the filesystem retrying writes until after the unmount. This fixes some of the periodic hangs in generic/019 and generic/475. Signed-off-by: Darrick J. Wong --- common/dmerror | 4 ++++ common/fail_make_request | 1 + common/rc | 31 +++++++++++++++++++++++++++---- common/xfs | 29 +++++++++++++++++++++++++++++ 4 files changed, 61 insertions(+), 4 deletions(-) diff --git a/common/dmerror b/common/dmerror index 85ef9a16..ed5afaa4 100644 --- a/common/dmerror +++ b/common/dmerror @@ -138,6 +138,10 @@ _dmerror_load_error_table() suspend_opt="$*" fi + # If the full environment is set up, configure ourselves for shutdown + type _prepare_for_eio_shutdown &>/dev/null && \ + _prepare_for_eio_shutdown $DMERROR_DEV + # Suspend the scratch device before the log and realtime devices so # that the kernel can freeze and flush the filesystem if the caller # wanted a freeze. diff --git a/common/fail_make_request b/common/fail_make_request index 9f8ea500..b5370ba6 100644 --- a/common/fail_make_request +++ b/common/fail_make_request @@ -44,6 +44,7 @@ _start_fail_scratch_dev() { echo "Force SCRATCH_DEV device failure" + _prepare_for_eio_shutdown $SCRATCH_DEV _bdev_fail_make_request $SCRATCH_DEV 1 [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \ _bdev_fail_make_request $SCRATCH_LOGDEV 1 diff --git a/common/rc b/common/rc index 09c81be6..317049bc 100644 --- a/common/rc +++ b/common/rc @@ -4372,6 +4372,20 @@ _check_dmesg() fi } +# Make whatever configuration changes we need ahead of testing fs shutdowns due +# to unexpected IO errors while updating metadata. The sole parameter should +# be the fs device, e.g. $SCRATCH_DEV. +_prepare_for_eio_shutdown() +{ + local dev="$1" + + case "$FSTYP" in + "xfs") + _xfs_prepare_for_eio_shutdown "$dev" + ;; + esac +} + # capture the kmemleak report _capture_kmemleak() { @@ -4634,7 +4648,7 @@ run_fsx() # # Usage example: # _require_fs_sysfs error/fail_at_unmount -_require_fs_sysfs() +_has_fs_sysfs() { local attr=$1 local dname @@ -4650,9 +4664,18 @@ _require_fs_sysfs() _fail "Usage: _require_fs_sysfs " fi - if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then - _notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}" - fi + test -e /sys/fs/${FSTYP}/${dname}/${attr} +} + +# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR +_require_fs_sysfs() +{ + _has_fs_sysfs "$@" && return + + local attr=$1 + local dname=$(_short_dev $TEST_DEV) + + _notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}" } _require_statx() diff --git a/common/xfs b/common/xfs index ba72027c..a7bc661e 100644 --- a/common/xfs +++ b/common/xfs @@ -800,6 +800,35 @@ _scratch_xfs_unmount_dirty() _scratch_unmount } +# Prepare a mounted filesystem for an IO error shutdown test by disabling retry +# for metadata writes. This prevents a (rare) log livelock when: +# +# - The log has given out all available grant space, preventing any new +# writers from tripping over IO errors (and shutting down the fs/log), +# - All log buffers were written to disk, and +# - The log tail is pinned because the AIL keeps hitting EIO trying to write +# committed changes back into the filesystem. +# +# Real users might want the default behavior of the AIL retrying writes forever +# but for testing purposes we don't want to wait. +# +# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV. +_xfs_prepare_for_eio_shutdown() +{ + local dev="$1" + local ctlfile="error/fail_at_unmount" + + # Don't retry any writes during the (presumably) post-shutdown unmount + _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1 + + # Disable retry of metadata writes that fail with EIO + for ctl in max_retries retry_timeout_seconds; do + ctlfile="error/metadata/EIO/$ctl" + + _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0 + done +} + # Skip if we are running an older binary without the stricter input checks. # Make multiple checks to be sure that there is no regression on the one # selected feature check, which would skew the result. From patchwork Tue Jul 26 19:49:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12929745 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50F6CC00140 for ; Tue, 26 Jul 2022 19:49:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239654AbiGZTta (ORCPT ); Tue, 26 Jul 2022 15:49:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239623AbiGZTt3 (ORCPT ); Tue, 26 Jul 2022 15:49:29 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1247357E4; Tue, 26 Jul 2022 12:49:28 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 7AC14B81A0C; Tue, 26 Jul 2022 19:49:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2BBFCC433C1; Tue, 26 Jul 2022 19:49:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1658864966; bh=kGOZC86XoZIhEkIl9Rg3eU70eX/lOCt+6PenAyb4Ih8=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=VY+wxoe/a5jBDEh6Lxzt0+XdkA/ZHvB9hMSSpey0roy0PcndXg8kS2DzFJ/vv+WF0 Y07+y7txkJOxH6wdmGSui9DUR+aruufVoqG7NLcuIwYr8/V/gUu24JZM7onyOPn+Po 7WShlm5UZREWwK1tk5PKfGQmNmZ3rTYvpHofy32Qzvdx8M7NxMtND7vQJ175nSxL0i tLW8NOIaWtqoE2R9E8XjEt7ai8R+hNqawUpdyDfiMzaKCdiOaZ7jzF6po82L3jawqd H0gCDVfB8rtBOOCvuekxUCrYLrFQk8Ctkw8mlueitpFtMMq8qACWojrRfk4YXOu7jb 472/EJ6QZaA9g== Subject: [PATCH 3/3] common: filter internal errors during io error testing From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 26 Jul 2022 12:49:25 -0700 Message-ID: <165886496575.1585306.16047150077901464823.stgit@magnolia> In-Reply-To: <165886494905.1585306.15343417924888857310.stgit@magnolia> References: <165886494905.1585306.15343417924888857310.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: fstests@vger.kernel.org From: Darrick J. Wong The goal of an EIO shutdown test is to examine the shutdown and recovery behavior if we make the underlying storage device return EIO. On XFS, it's possible that the shutdown will come from a thread that cancels a dirty transaction due to the EIO. This is expected behavior, but _check_dmesg will flag it as a test failure. Make it so that we can add simple regexps to the default check_dmesg filter function, then add the "Internal error" string to filter function when we invoke an EIO test. This fixes periodic regressions in generic/019 and generic/475. Signed-off-by: Darrick J. Wong --- check | 1 + common/rc | 19 ++++++++++++++++++- common/xfs | 7 +++++++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/check b/check index 0b2f10ed..000e31cb 100755 --- a/check +++ b/check @@ -896,6 +896,7 @@ function run_section() echo "run fstests $seqnum at $date_time" > /dev/kmsg # _check_dmesg depends on this log in dmesg touch ${RESULT_DIR}/check_dmesg + rm -f ${RESULT_DIR}/dmesg_filter fi _try_wipe_scratch_devs > /dev/null 2>&1 diff --git a/common/rc b/common/rc index 317049bc..12964ae2 100644 --- a/common/rc +++ b/common/rc @@ -4331,8 +4331,25 @@ _check_dmesg_for() # lockdep. _check_dmesg_filter() { + local extra_filter= + local filter_file="${RESULT_DIR}/dmesg_filter" + + test -e "$filter_file" && extra_filter="-f $filter_file" + egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \ - -e "BUG: MAX_STACK_TRACE_ENTRIES too low" + -e "BUG: MAX_STACK_TRACE_ENTRIES too low" \ + $extra_filter +} + +# Add a simple expression to the default dmesg filter +_add_dmesg_filter() +{ + local regexp="$1" + local filter_file="${RESULT_DIR}/dmesg_filter" + + if [ ! -e "$filter_file" ] || ! grep -q "$regexp" "$filter_file"; then + echo "$regexp" >> "${RESULT_DIR}/dmesg_filter" + fi } # check dmesg log for WARNING/Oops/etc. diff --git a/common/xfs b/common/xfs index a7bc661e..8c52f0bb 100644 --- a/common/xfs +++ b/common/xfs @@ -818,6 +818,13 @@ _xfs_prepare_for_eio_shutdown() local dev="$1" local ctlfile="error/fail_at_unmount" + # Once we enable IO errors, it's possible that a writer thread will + # trip over EIO, cancel the transaction, and shut down the system. + # This is expected behavior, so we need to remove the "Internal error" + # message from the list of things that can cause the test to be marked + # as failed. + _add_dmesg_filter "Internal error" + # Don't retry any writes during the (presumably) post-shutdown unmount _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1