From patchwork Tue Mar 19 02:15:23 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13596090 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1AE565197 for ; Tue, 19 Mar 2024 02:15:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710814556; cv=none; b=Prt6h/mbGjS85FPkJMe/Sd7lRRf78E1GxpmZDzTjp7tz58aGi2QFIxlFKjZMdl1qgxsJiN2ABWTuFiCd6PpyeWMrfSUhQuBcdEOa6eklcaAGt9M1xXLqa1vq1IKfuQPrmFetfAy3dS/vXXgMVY1YRr1Dq4ofj1sXJ5nrZxSDZ/4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710814556; c=relaxed/simple; bh=nQLBaarSRJFvBUEubxM7xhMwp0+AqzbgsVkJ8MFpntw=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TwEpclydMLnEyB7z6fFDCShHHdlERjD7eSUzkO4RrQNNxvV1nPIpdCDtyZS1DLUfECfPBiUP9+/b5qXePWOZ6SitRXULRtgTERlyhw2hKx8/Lzjk37SF0ugOF2Wcwh6ZQGfid4n5SXLQ6Z5zx/lMj1rc5bknwz7//K1ksoZuPkY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=JC+MbMp0; arc=none smtp.client-ip=209.85.210.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="JC+MbMp0" Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-6e6b5432439so4856416b3a.1 for ; Mon, 18 Mar 2024 19:15:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1710814554; x=1711419354; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=VVhiFEWWawY+pfKjAMMd/EV/29oKOuM8OrVr5fN+bWo=; b=JC+MbMp0qYSABtkTA/9ty1pDdmEUnBMqKrjqCKZQMFjcrX+EtiUCI0gn1i/FpPrYmY zuj5LLPq+5LOfaxsRqEIOc+R5nXE1d3kVo4g3GKMAIdTMAb9NaWAYUtaQV58T/JhfAOx 6qkZ5VueKKa+nnvOGmQW12f872BD1y+dx7+zdNwTrh0O7WHqzuI2sT8+OQ5Lj19n1nTp fRPxhx0DyRQ2tKtDRIDtWFJHMrExavjAhN/fh6iRGGVJntjGOoKG6Vzn3JoEdHvuM/iA jJTwIwZkQJShTrpPaEitPbWvcyZ5zaAn4bdZjsTdsrcucIxB1kxzWE19EK/KzHZeuvGy SS2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710814554; x=1711419354; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VVhiFEWWawY+pfKjAMMd/EV/29oKOuM8OrVr5fN+bWo=; b=nLrLLax47SbNexfwLTIeIclfjgCGaheSHbI4d3the6TiFg8UkuGco2pKl27Wuh+d4L /MPlNjpuKCK0U4iTwYgovl6A5hMmD+oslBhnblceHL9RuvEtW2hSpSqCcJQ5M2yea+Zq xMF237TNUy5KYeoZolBjDHhp5bFscOiaSuxG01ngtyoiPIoPQ3pv7Qp1RhzGXKELB1a5 13jaYxeb72VLS2HWkJ0j6OkIUifEqPdGJCS92uQtzOzy6/IiC9aQriJWX3sMuNIHZCl+ 1YhWiHkYAFfAC4HZ9OsUOBF2VoXfC+1WMfxcK/EkCK6q8DfwktgZmlzq6UM9/jbj84nW 9mRQ== X-Gm-Message-State: AOJu0Yw3Teu8zpSMgvyb0NcHGNH3tnZKfESIgBN8TZIO1tf/t8BORgtd fnyQaNq8eM3pBxnuaNEIXTlK+ahY/K3UfjeHOQsKgsyQnTV7rpkyv35fIxEr5pfa7htBnskcDu0 C X-Google-Smtp-Source: AGHT+IF3Mw+mTRjXdphe8AfZzES0fI10fHq8COKzzvXHHeZUNUQVSsS8baYiXYAX4mdiyea+84FhyA== X-Received: by 2002:a05:6a00:803:b0:6e5:cd5c:620a with SMTP id m3-20020a056a00080300b006e5cd5c620amr18684679pfk.16.1710814553602; Mon, 18 Mar 2024 19:15:53 -0700 (PDT) Received: from dread.disaster.area (pa49-180-185-123.pa.nsw.optusnet.com.au. [49.180.185.123]) by smtp.gmail.com with ESMTPSA id i4-20020aa787c4000000b006e6ab7cb10esm8571848pfo.186.2024.03.18.19.15.51 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Mar 2024 19:15:52 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1rmP13-003s57-2K for linux-xfs@vger.kernel.org; Tue, 19 Mar 2024 13:15:49 +1100 Received: from dave by devoid.disaster.area with local (Exim 4.97) (envelope-from ) id 1rmP13-0000000Ec6x-0vu9 for linux-xfs@vger.kernel.org; Tue, 19 Mar 2024 13:15:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 4/5] xfs: detect partial buffer recovery operations Date: Tue, 19 Mar 2024 13:15:23 +1100 Message-ID: <20240319021547.3483050-5-david@fromorbit.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240319021547.3483050-1-david@fromorbit.com> References: <20240319021547.3483050-1-david@fromorbit.com> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Dave Chinner When a compound buffer is logged (e.g. fragmented large directory block) we record it in the log as a series of separate buffer log format items in the journal. These get recovered individually, and because they are for non-contiguous extent ranges, we cannot use buffer addresses to detect that the buffer format items are from the same directory block. Further, we cannot use LSN checks to determine if the partial block buffers should be recovered - apart from the first buffer we don't have a header with an LSN in it to check. Finally, we cannot add a verifier to a partial block buffer because, again, it will fail the verifier checks and report corruption. We already skip this step due to bad magic number detection, but we should be able to do better here. The one thing we can rely on, though, is that each buffer format item is written consecutively in the journal. They are built at commit time into a single log iovec and chained into the iclog write log vector chain as an unbroken sequence. Hence all the parts of a compound buffer should be consecutive buf log format items in the transaction being recovered. Unfortunately, we don't have the information available in recovery to do a full compound buffer instantiation for recovery. We only have the fragments that contained modifications in the journal, and so there may be missing fragments that are still clean and hence are not in the journal. Hence we cannot use journal state to rebuild the compound buffer entirely and hence recover it as a complete entity and run a verifier over it before writeback. Hence the first thing we need to do is detect such partial buffer recovery situations and track whether we need to skip all the partial buffers due to the LSN check in the initial header fragment read from disk. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf_item_recover.c | 178 +++++++++++++++++++++++++++------- 1 file changed, 143 insertions(+), 35 deletions(-) diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index 90740fcf2fbe..9225baa62755 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -434,18 +434,17 @@ xlog_recover_validate_buf_type( } /* - * Perform a 'normal' buffer recovery. Each logged region of the + * Perform recovery of the logged regions. Each logged region of the * buffer should be copied over the corresponding region in the * given buffer. The bitmap in the buf log format structure indicates * where to place the logged data. */ -static int -xlog_recover_do_reg_buffer( +static void +xlog_recover_buffer( struct xfs_mount *mp, struct xlog_recover_item *item, struct xfs_buf *bp, - struct xfs_buf_log_format *buf_f, - xfs_lsn_t current_lsn) + struct xfs_buf_log_format *buf_f) { int i; int bit; @@ -489,7 +488,17 @@ xlog_recover_do_reg_buffer( /* Shouldn't be any more regions */ ASSERT(i == item->ri_total); +} +static int +xlog_recover_do_reg_buffer( + struct xfs_mount *mp, + struct xlog_recover_item *item, + struct xfs_buf *bp, + struct xfs_buf_log_format *buf_f, + xfs_lsn_t current_lsn) +{ + xlog_recover_buffer(mp, item, bp, buf_f); return xlog_recover_validate_buf_type(mp, bp, buf_f, current_lsn); } @@ -735,6 +744,67 @@ xlog_recover_do_inode_buffer( return 0; } +static bool +xlog_recovery_is_dir_buf( + struct xfs_buf_log_format *buf_f) +{ + switch (xfs_blft_from_flags(buf_f)) { + case XFS_BLFT_DIR_BLOCK_BUF: + case XFS_BLFT_DIR_DATA_BUF: + case XFS_BLFT_DIR_FREE_BUF: + case XFS_BLFT_DIR_LEAF1_BUF: + case XFS_BLFT_DIR_LEAFN_BUF: + case XFS_BLFT_DA_NODE_BUF: + return true; + default: + break; + } + return false; +} + +/* + * Partial dabuf recovery. + * + * There are two main cases here - a buffer that contains the dabuf header and + * hence can be magic number and LSN checked, and then everything else. + * + * We can determine if the former should be replayed or not via LSN checks, but + * we cannot do that with the latter, so the only choice we have here is to + * always recover the changes regardless of whether this means metadata on disk + * will go backwards in time. This, at least, means that the changes in each + * checkpoint are applied consistently to the dabuf and we don't do really + * stupid things like skip the header fragment replay and then replay all the + * other changes to the dabuf block. + * + * While this is not ideal, finishing log recovery should them replay all the + * remaining changes across this buffer and so bring it back to being consistent + * on disk at the completion of recovery. Hence this "going backwards in time" + * situation will only be relevant to failed journal replay situations. These + * are rare and will require xfs_repair to be run, anyway, so the inconsistency + * that results will be corrected before the filesystem goes back into service, + * anyway. + * + * Important: This partial fragment recovery relies on log recovery purging the + * buffer cache after completion of this recovery phase. These partial buffers + * are never used at runtime (discontiguous buffers will be used instead), so + * they must be removed from the buffer cache to prevent them from causing + * overlapping range lookup failures for the entire dabuf range. + */ +static void +xlog_recover_do_partial_dabuf( + struct xfs_mount *mp, + struct xlog_recover_item *item, + struct xfs_buf *bp, + struct xfs_buf_log_format *buf_f) +{ + /* + * Always recover without verification or write verifiers. Use delwri + * and rely on post pass2 recovery cache purge to clean these out of + * memory. + */ + xlog_recover_buffer(mp, item, bp, buf_f); +} + /* * V5 filesystems know the age of the buffer on disk being recovered. We can * have newer objects on disk than we are replaying, and so for these cases we @@ -886,6 +956,54 @@ xlog_recover_get_buf_lsn( } +/* + * Recover the buffer only if we get an LSN from it and it's less than the lsn + * of the transaction we are replaying. + * + * Note that we have to be extremely careful of readahead here. Readahead does + * not attach verfiers to the buffers so if we don't actually do any replay + * after readahead because of the LSN we found in the buffer if more recent than + * that current transaction then we need to attach the verifier directly. + * Failure to do so can lead to future recovery actions (e.g. EFI and unlinked + * list recovery) can operate on the buffers and they won't get the verifier + * attached. This can lead to blocks on disk having the correct content but a + * stale CRC. + * + * It is safe to assume these clean buffers are currently up to date. If the + * buffer is dirtied by a later transaction being replayed, then the verifier + * will be reset to match whatever recover turns that buffer into. + * + * Return true if the buffer needs to be recovered, false if it doesn't. + */ +static bool +xlog_recover_this_buffer( + struct xfs_mount *mp, + struct xfs_buf *bp, + struct xfs_buf_log_format *buf_f, + xfs_lsn_t current_lsn) +{ + xfs_lsn_t lsn; + + lsn = xlog_recover_get_buf_lsn(mp, bp, buf_f); + if (!lsn) + return true; + if (lsn == -1) + return true; + if (XFS_LSN_CMP(lsn, current_lsn) < 0) + return true; + + trace_xfs_log_recover_buf_skip(mp->m_log, buf_f); + xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN); + + /* + * We're skipping replay of this buffer log item due to the log + * item LSN being behind the ondisk buffer. Verify the buffer + * contents since we aren't going to run the write verifier. + */ + if (bp->b_ops) + bp->b_ops->verify_read(bp); + return false; +} /* * This routine replays a modification made to a buffer at runtime. * There are actually two types of buffer, regular and inode, which @@ -920,7 +1038,6 @@ xlog_recover_buf_commit_pass2( struct xfs_mount *mp = log->l_mp; struct xfs_buf *bp; int error; - xfs_lsn_t lsn; /* * In this pass we only want to recover all the buffers which have @@ -962,7 +1079,8 @@ xlog_recover_buf_commit_pass2( if (error) goto out_release; - error = xlog_recover_do_reg_buffer(mp, item, bp, buf_f, + xlog_recover_buffer(mp, item, bp, buf_f); + error = xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN); if (error) goto out_release; @@ -970,41 +1088,31 @@ xlog_recover_buf_commit_pass2( } /* - * Recover the buffer only if we get an LSN from it and it's less than - * the lsn of the transaction we are replaying. - * - * Note that we have to be extremely careful of readahead here. - * Readahead does not attach verfiers to the buffers so if we don't - * actually do any replay after readahead because of the LSN we found - * in the buffer if more recent than that current transaction then we - * need to attach the verifier directly. Failure to do so can lead to - * future recovery actions (e.g. EFI and unlinked list recovery) can - * operate on the buffers and they won't get the verifier attached. This - * can lead to blocks on disk having the correct content but a stale - * CRC. - * - * It is safe to assume these clean buffers are currently up to date. - * If the buffer is dirtied by a later transaction being replayed, then - * the verifier will be reset to match whatever recover turns that - * buffer into. + * Directory buffers can be larger than a single filesystem block and + * if they are they can be fragmented. There are lots of concerns about + * recovering these, so push them out of line where the concerns can be + * documented clearly. */ - lsn = xlog_recover_get_buf_lsn(mp, bp, buf_f); - if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) { - trace_xfs_log_recover_buf_skip(log, buf_f); - xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN); + if (xlog_recovery_is_dir_buf(buf_f) && + mp->m_dir_geo->blksize != BBTOB(buf_f->blf_len)) { + xlog_recover_do_partial_dabuf(mp, item, bp, buf_f); + goto out_write; + } + /* + * Whole buffer recovery, dependent on the LSN in the on-disk structure. + */ + if (!xlog_recover_this_buffer(mp, bp, buf_f, current_lsn)) { /* - * We're skipping replay of this buffer log item due to the log - * item LSN being behind the ondisk buffer. Verify the buffer - * contents since we aren't going to run the write verifier. + * We may have verified this buffer even though we aren't + * recovering it. Return the verifier error for early detection + * of recovery inconsistencies. */ - if (bp->b_ops) { - bp->b_ops->verify_read(bp); - error = bp->b_error; - } + error = bp->b_error; goto out_release; } + error = xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn); if (error) goto out_release;