[7/7] xfs: only run COW extent recovery when there are no live extents

From: Darrick J. Wong <djwong@kernel.org>

From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/xfs_log_recover.c |   24 +++++++++++++++++++++++-
 fs/xfs/xfs_mount.c       |   10 ----------
 fs/xfs/xfs_reflink.c     |    5 ++++-
 fs/xfs/xfs_super.c       |    9 ---------
 4 files changed, 27 insertions(+), 21 deletions(-)

Message ID	163961699399.3129691.9449691191051808697.stgit@magnolia (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33CBAC433EF for <linux-xfs@archiver.kernel.org>; Thu, 16 Dec 2021 01:09:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229907AbhLPBJz (ORCPT <rfc822;linux-xfs@archiver.kernel.org>); Wed, 15 Dec 2021 20:09:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58348 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJz (ORCPT <rfc822;linux-xfs@vger.kernel.org>); Wed, 15 Dec 2021 20:09:55 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4E3B6C061574 for <linux-xfs@vger.kernel.org>; Wed, 15 Dec 2021 17:09:55 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E49E261BB8 for <linux-xfs@vger.kernel.org>; Thu, 16 Dec 2021 01:09:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 48F7BC36AE1; Thu, 16 Dec 2021 01:09:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616994; bh=68tLunglQaUDru1VRnVnSNOlJ+lStU8jjkicUwo/NuY=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=XK6kM+qse4a33UdUpfIAjhfGEiUSBrhz9XkowRRqEOvoxXUu+x6r3kh/J9ry/+e/I ocTq5onrt+2rOlLiIeKhFQuXzKrnjib4OgN1bKM173Lf4GQ8oKHTxgpaDCeISQ8pRq stzowCqMmkhw/qRR9E9ymefbHFepYquWWINdcBi0FsCOqtcaXck58SayxzzYy+pKfp jgm744Xh1cnbD1OEPYP8OItn66aZGL8gw7tbKqTPzocGPQP1P7k5yJTGW01ea9biDC OVjMXxCOSh0Z41SQvYcrv9zK31yQokLDRtWrkF5lO717JRlJ432pz6/Y6fQrOU50rH h8JLsQUrT2i7w== Subject: [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents From: "Darrick J. Wong" <djwong@kernel.org> To: djwong@kernel.org Cc: Chandan Babu R <chandan.babu@oracle.com>, linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:54 -0800 Message-ID: <163961699399.3129691.9449691191051808697.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: <linux-xfs.vger.kernel.org> X-Mailing-List: linux-xfs@vger.kernel.org
Series	xfs: random fixes for 5.17 \| expand [PATCHSET,0/7] xfs: random fixes for 5.17 [1/7] xfs: take the ILOCK when accessing the inode core [2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items [3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check [4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt [5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it [6/7] xfs: don't expose internal symlink metadata buffers to the vfs [7/7] xfs: only run COW extent recovery when there are no live extents

[7/7] xfs: only run COW extent recovery when there are no live extents

Commit Message

Comments

Patch