From patchwork Wed Aug 3 04:22:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12935178 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA2EBC19F28 for ; Wed, 3 Aug 2022 04:22:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233759AbiHCEWe (ORCPT ); Wed, 3 Aug 2022 00:22:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233637AbiHCEWd (ORCPT ); Wed, 3 Aug 2022 00:22:33 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A77B1564F7; Tue, 2 Aug 2022 21:22:32 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 6191BB82189; Wed, 3 Aug 2022 04:22:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 115C1C433C1; Wed, 3 Aug 2022 04:22:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1659500550; bh=fwLneSQEvK1E2ZbQamoDrrp9wiNA/G3mQjzy/FdIltI=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=iif+VMjFguKvWoknLCrY0IHIDuo4E2QxksZ5RFC5BbGIb7dEUy0EFZcWUO5tpJ1TS geslCbq0uIqFEwlSPOvSapApLYKBctN14C6n62tcsvlG/1h7Jd7o0e+mo1zcJbbgS6 05zIhFpJKQR+KrwkfHdn+wwANgdelUHjoJn1YdMa+qJ3i15wNuEmrgUWZofHllS1Ys E7TVM1uhYIy+SspF/oYK/zVoIGiiZGtkGc2ChxcEXWmF5i2Buu4wI6foVEqk9kdxBT vH4/DKD9Si1SjOKRePp2qu9WrQEBbMXnYd8xk/1MI5Qd8TKalOn3aBEo5KqGllMwyj WS7UPI9ORCGqA== Subject: [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: Christoph Hellwig , linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 02 Aug 2022 21:22:29 -0700 Message-ID: <165950054961.199222.14288700692275773893.stgit@magnolia> In-Reply-To: <165950054404.199222.5615656337332007333.stgit@magnolia> References: <165950054404.199222.5615656337332007333.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong There's a slight mistake in _reset_xfs_sysfs_error_handling: it sets retry_timeout_seconds to 0, which is not the current default (-1) in upstream Linux. Fix this. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- common/xfs | 2 +- tests/xfs/006.out | 6 +++--- tests/xfs/264.out | 12 ++++++------ 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/common/xfs b/common/xfs index f6f4cdd2..92c281c6 100644 --- a/common/xfs +++ b/common/xfs @@ -804,7 +804,7 @@ _reset_xfs_sysfs_error_handling() _get_fs_sysfs_attr $dev error/metadata/${e}/max_retries _set_fs_sysfs_attr $dev \ - error/metadata/${e}/retry_timeout_seconds 0 + error/metadata/${e}/retry_timeout_seconds -1 echo -n "error/metadata/${e}/retry_timeout_seconds=" _get_fs_sysfs_attr $dev \ error/metadata/${e}/retry_timeout_seconds diff --git a/tests/xfs/006.out b/tests/xfs/006.out index 3260b3a2..433b0bc3 100644 --- a/tests/xfs/006.out +++ b/tests/xfs/006.out @@ -1,8 +1,8 @@ QA output created by 006 error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 diff --git a/tests/xfs/264.out b/tests/xfs/264.out index 502e72d3..e45ac5a5 100644 --- a/tests/xfs/264.out +++ b/tests/xfs/264.out @@ -2,20 +2,20 @@ QA output created by 264 === Test EIO/max_retries === error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 error/fail_at_unmount=0 error/metadata/EIO/max_retries=1 === Test EIO/retry_timeout_seconds === error/fail_at_unmount=1 error/metadata/default/max_retries=-1 -error/metadata/default/retry_timeout_seconds=0 +error/metadata/default/retry_timeout_seconds=-1 error/metadata/EIO/max_retries=-1 -error/metadata/EIO/retry_timeout_seconds=0 +error/metadata/EIO/retry_timeout_seconds=-1 error/metadata/ENOSPC/max_retries=-1 -error/metadata/ENOSPC/retry_timeout_seconds=0 +error/metadata/ENOSPC/retry_timeout_seconds=-1 error/fail_at_unmount=0 error/metadata/EIO/retry_timeout_seconds=1 From patchwork Wed Aug 3 04:22:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12935179 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D78E5C19F2B for ; Wed, 3 Aug 2022 04:22:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233637AbiHCEWk (ORCPT ); Wed, 3 Aug 2022 00:22:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50976 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233846AbiHCEWj (ORCPT ); Wed, 3 Aug 2022 00:22:39 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 52B33564F9; Tue, 2 Aug 2022 21:22:38 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id EC54DB82188; Wed, 3 Aug 2022 04:22:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B0F35C43141; Wed, 3 Aug 2022 04:22:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1659500555; bh=9Ia1tzWrIJWaiGRaYTGUJ7wayflakEj7QXEn8JdrNMU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=akk2zDFk0CPu2FfIGPI62bAMjYRXD0QXWyX4ap2CFxidAbJG4/jIwhP2fXEOZ4JR7 AWzYKWAxTweHImc30z3jVCSVBtmP/fjqEtEG6GiPu8/gKcu2d/4BtcYccUc0Ol/9cm gpluD0Bhs981uIsddKOO0CgAeCSFWY0JY1b6pMACCVAWGLXFTcQldH7GelBswjFG4j 38sAITirnnRIEt+vdpbtlyhW/dq/y3vKgfYstso3u0u19+CNHtxkRmNy0Jsb7uJIG+ d26n98lD8vOjViGDJVyXdcBd0zc1uesXVl8aOScNPlqdnmhjqnyQCuqTmhiy6v6HOs OO0LLNldMcVbw== Subject: [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 02 Aug 2022 21:22:35 -0700 Message-ID: <165950055523.199222.9175019533516343488.stgit@magnolia> In-Reply-To: <165950054404.199222.5615656337332007333.stgit@magnolia> References: <165950054404.199222.5615656337332007333.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong This patch fixes a rather hard to hit livelock in the tests that test how xfs handles shutdown behavior when the device suddenly dies and starts returing EIO all the time. The livelock happens if the AIL is stuck retrying failed metadata updates forever, the log itself is not being written, and there is no more log grant space, which prevents the frontend from shutting down the log due to EIO errors during transactions. While most users probably want the default retry-forever behavior because EIO can be transient, the circumstances are different here. The tests are designed to flip the device back to working status only after the unmount succeeds, so we know there's no point in the filesystem retrying writes until after the unmount. This fixes some of the periodic hangs in generic/019 and generic/475. Signed-off-by: Darrick J. Wong --- common/dmerror | 4 ++++ common/fail_make_request | 1 + common/rc | 31 +++++++++++++++++++++++++++---- common/xfs | 29 +++++++++++++++++++++++++++++ 4 files changed, 61 insertions(+), 4 deletions(-) diff --git a/common/dmerror b/common/dmerror index 0934d220..54122b12 100644 --- a/common/dmerror +++ b/common/dmerror @@ -138,6 +138,10 @@ _dmerror_load_error_table() suspend_opt="$*" fi + # If the full environment is set up, configure ourselves for shutdown + type _prepare_for_eio_shutdown &>/dev/null && \ + _prepare_for_eio_shutdown $DMERROR_DEV + # Suspend the scratch device before the log and realtime devices so # that the kernel can freeze and flush the filesystem if the caller # wanted a freeze. diff --git a/common/fail_make_request b/common/fail_make_request index 9f8ea500..b5370ba6 100644 --- a/common/fail_make_request +++ b/common/fail_make_request @@ -44,6 +44,7 @@ _start_fail_scratch_dev() { echo "Force SCRATCH_DEV device failure" + _prepare_for_eio_shutdown $SCRATCH_DEV _bdev_fail_make_request $SCRATCH_DEV 1 [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \ _bdev_fail_make_request $SCRATCH_LOGDEV 1 diff --git a/common/rc b/common/rc index 63bafb4b..119cc477 100644 --- a/common/rc +++ b/common/rc @@ -4205,6 +4205,20 @@ _check_dmesg() fi } +# Make whatever configuration changes we need ahead of testing fs shutdowns due +# to unexpected IO errors while updating metadata. The sole parameter should +# be the fs device, e.g. $SCRATCH_DEV. +_prepare_for_eio_shutdown() +{ + local dev="$1" + + case "$FSTYP" in + "xfs") + _xfs_prepare_for_eio_shutdown "$dev" + ;; + esac +} + # capture the kmemleak report _capture_kmemleak() { @@ -4467,7 +4481,7 @@ run_fsx() # # Usage example: # _require_fs_sysfs error/fail_at_unmount -_require_fs_sysfs() +_has_fs_sysfs() { local attr=$1 local dname @@ -4483,9 +4497,18 @@ _require_fs_sysfs() _fail "Usage: _require_fs_sysfs " fi - if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then - _notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}" - fi + test -e /sys/fs/${FSTYP}/${dname}/${attr} +} + +# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR +_require_fs_sysfs() +{ + _has_fs_sysfs "$@" && return + + local attr=$1 + local dname=$(_short_dev $TEST_DEV) + + _notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}" } _require_statx() diff --git a/common/xfs b/common/xfs index 92c281c6..65234c8b 100644 --- a/common/xfs +++ b/common/xfs @@ -823,6 +823,35 @@ _scratch_xfs_unmount_dirty() _scratch_unmount } +# Prepare a mounted filesystem for an IO error shutdown test by disabling retry +# for metadata writes. This prevents a (rare) log livelock when: +# +# - The log has given out all available grant space, preventing any new +# writers from tripping over IO errors (and shutting down the fs/log), +# - All log buffers were written to disk, and +# - The log tail is pinned because the AIL keeps hitting EIO trying to write +# committed changes back into the filesystem. +# +# Real users might want the default behavior of the AIL retrying writes forever +# but for testing purposes we don't want to wait. +# +# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV. +_xfs_prepare_for_eio_shutdown() +{ + local dev="$1" + local ctlfile="error/fail_at_unmount" + + # Don't retry any writes during the (presumably) post-shutdown unmount + _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1 + + # Disable retry of metadata writes that fail with EIO + for ctl in max_retries retry_timeout_seconds; do + ctlfile="error/metadata/EIO/$ctl" + + _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0 + done +} + # Skip if we are running an older binary without the stricter input checks. # Make multiple checks to be sure that there is no regression on the one # selected feature check, which would skew the result. From patchwork Wed Aug 3 04:22:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12935180 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EACB0C19F2B for ; Wed, 3 Aug 2022 04:22:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233775AbiHCEWq (ORCPT ); Wed, 3 Aug 2022 00:22:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232299AbiHCEWo (ORCPT ); Wed, 3 Aug 2022 00:22:44 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E92F756B85; Tue, 2 Aug 2022 21:22:43 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id A3C61B82185; Wed, 3 Aug 2022 04:22:42 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 52A58C433D6; Wed, 3 Aug 2022 04:22:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1659500561; bh=pKUYjByPuJ29LOCUiZU9sKenDRnvO+UwLIiir20XrtU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=QiR8JgBXmpGrIpFYKNJXdfjb6q9By1I/S0LhImrYKmz0rM278LaV+VfvZNAUFCHoh 6a5NR7RY0aYLAydUkJ+F5NFEuZD3g7VA9YMlV2t+P6A26mOMY6bVP35nxOZapGUrxA S+5xGja8MTT1OatMigQXBxLivGBdx6zuyouXP1CgaBZlswELpuJGrYE8deCHDbyhkV d1rKuxpEN6djDZMSeGvPTILOrULmCjiG+DKY6I9CQ2zVv3rmi3vVTLSXiR/ebuQ41O siRZ1rVg0Jm0/tY+7eJcEG6o1qHZwWfDaKK2GNK+uqvduBA0AQqcPGhXFNnB94S2Qx awZP8of+JcAKQ== Subject: [PATCH 3/3] common: filter internal errors during io error testing From: "Darrick J. Wong" To: djwong@kernel.org, guaneryu@gmail.com, zlang@redhat.com Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me Date: Tue, 02 Aug 2022 21:22:40 -0700 Message-ID: <165950056086.199222.18041854919038769019.stgit@magnolia> In-Reply-To: <165950054404.199222.5615656337332007333.stgit@magnolia> References: <165950054404.199222.5615656337332007333.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong The goal of an EIO shutdown test is to examine the shutdown and recovery behavior if we make the underlying storage device return EIO. On XFS, it's possible that the shutdown will come from a thread that cancels a dirty transaction due to the EIO. This is expected behavior, but _check_dmesg will flag it as a test failure. Make it so that we can add simple regexps to the default check_dmesg filter function, then add the "Internal error" string to filter function when we invoke an EIO test. This fixes periodic regressions in generic/019 and generic/475. Signed-off-by: Darrick J. Wong --- check | 1 + common/rc | 19 ++++++++++++++++++- common/xfs | 7 +++++++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/check b/check index 0b2f10ed..000e31cb 100755 --- a/check +++ b/check @@ -896,6 +896,7 @@ function run_section() echo "run fstests $seqnum at $date_time" > /dev/kmsg # _check_dmesg depends on this log in dmesg touch ${RESULT_DIR}/check_dmesg + rm -f ${RESULT_DIR}/dmesg_filter fi _try_wipe_scratch_devs > /dev/null 2>&1 diff --git a/common/rc b/common/rc index 119cc477..0f233892 100644 --- a/common/rc +++ b/common/rc @@ -4164,8 +4164,25 @@ _check_dmesg_for() # lockdep. _check_dmesg_filter() { + local extra_filter= + local filter_file="${RESULT_DIR}/dmesg_filter" + + test -e "$filter_file" && extra_filter="-f $filter_file" + egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \ - -e "BUG: MAX_STACK_TRACE_ENTRIES too low" + -e "BUG: MAX_STACK_TRACE_ENTRIES too low" \ + $extra_filter +} + +# Add a simple expression to the default dmesg filter +_add_dmesg_filter() +{ + local regexp="$1" + local filter_file="${RESULT_DIR}/dmesg_filter" + + if [ ! -e "$filter_file" ] || ! grep -q "$regexp" "$filter_file"; then + echo "$regexp" >> "${RESULT_DIR}/dmesg_filter" + fi } # check dmesg log for WARNING/Oops/etc. diff --git a/common/xfs b/common/xfs index 65234c8b..ae81b3fe 100644 --- a/common/xfs +++ b/common/xfs @@ -841,6 +841,13 @@ _xfs_prepare_for_eio_shutdown() local dev="$1" local ctlfile="error/fail_at_unmount" + # Once we enable IO errors, it's possible that a writer thread will + # trip over EIO, cancel the transaction, and shut down the system. + # This is expected behavior, so we need to remove the "Internal error" + # message from the list of things that can cause the test to be marked + # as failed. + _add_dmesg_filter "Internal error" + # Don't retry any writes during the (presumably) post-shutdown unmount _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1