From patchwork Thu Dec 16 01:09:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679781 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D20EC433EF for ; Thu, 16 Dec 2021 01:09:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229739AbhLPBJY (ORCPT ); Wed, 15 Dec 2021 20:09:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJY (ORCPT ); Wed, 15 Dec 2021 20:09:24 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D98B1C061574 for ; Wed, 15 Dec 2021 17:09:23 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 95EFDB8226D for ; Thu, 16 Dec 2021 01:09:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47549C36AE1; Thu, 16 Dec 2021 01:09:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616961; bh=y+64aMttUIyIIpHriH84iy+QSi4K638GQ6bE5Mq3Knc=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=prsw7XfNIBd/njBLQFHz8CDDl8esLQy8jfaiKHzaPWfHBK+Ik+pbp9l3JoRIH+zlo mmon4bPRevUYX0Cjv3Uix2mw84/OWNE4vwH2Fok3v6J/HIZMyKZ98knDX4Vd/0Lz04 c4IqCS/3X6kd5xYwtkyIzYXoCRSgHJ46SC6lUay4tLDeCjlDgTwoIpQg31Iyo1sPve Z+V3S2v4ZG+Uwy+++rx+gBTddK8Jc2KG2nZL4UBmaq3UAluTerg71mdB33plMx+alO Ta6kcKR4w/G3kyTfOFzVQgpyoW5fjo26F8MYG9W6eA5Vm/l1t42XJAWTSnvA87AQQj MaMgr93hMGwNw== Subject: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:21 -0800 Message-ID: <163961696098.3129691.10143704907338536631.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong I was poking around in the directory code while diagnosing online fsck bugs, and noticed that xfs_readdir doesn't actually take the directory ILOCK when it calls xfs_dir2_isblock. xfs_dir_open most probably loaded the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so we're protected against writer threads, but we really need to follow the locking model like we do in other places. The same applies to the shortform getdents function. While we're at it, clean up the somewhat strange structure of this function. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_dir2_readdir.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 8310005af00f..25560151c273 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -507,8 +507,9 @@ xfs_readdir( size_t bufsize) { struct xfs_da_args args = { NULL }; - int rval; - int v; + unsigned int lock_mode; + int error; + int isblock; trace_xfs_readdir(dp); @@ -522,14 +523,19 @@ xfs_readdir( args.geo = dp->i_mount->m_dir_geo; args.trans = tp; - if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_getdents(&args, ctx); - else if ((rval = xfs_dir2_isblock(&args, &v))) - ; - else if (v) - rval = xfs_dir2_block_getdents(&args, ctx); - else - rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize); + lock_mode = xfs_ilock_data_map_shared(dp); + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) { + xfs_iunlock(dp, lock_mode); + return xfs_dir2_sf_getdents(&args, ctx); + } - return rval; + error = xfs_dir2_isblock(&args, &isblock); + xfs_iunlock(dp, lock_mode); + if (error) + return error; + + if (isblock) + return xfs_dir2_block_getdents(&args, ctx); + + return xfs_dir2_leaf_getdents(&args, ctx, bufsize); } From patchwork Thu Dec 16 01:09:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679783 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A3D5C433F5 for ; Thu, 16 Dec 2021 01:09:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229866AbhLPBJb (ORCPT ); Wed, 15 Dec 2021 20:09:31 -0500 Received: from ams.source.kernel.org ([145.40.68.75]:48120 "EHLO ams.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229809AbhLPBJ3 (ORCPT ); Wed, 15 Dec 2021 20:09:29 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 0C654B8226B for ; Thu, 16 Dec 2021 01:09:28 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C1461C36AE1; Thu, 16 Dec 2021 01:09:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616966; bh=LCsWpvfM7oVyUXPZrKubG7mceV7f1hUET1sWtlIb0Ew=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=ACtmmIcB1PWK0RVKjv70Jz4gj46+ge9WZxB28q9bSBzjOKHAnCkerGazImL1yC3gU Dy94vJhvAKozbnzptMxICIKrct3E7WNLvKMvJ3c1ApbnxsnPsak3yRgpOFzpQfmBIo 33KPEIYz16Gz68swOPyYJUUd8kldfd+4c7NLnZFC4YWw/DcQd6+3qqtytXDIPDzTaJ mVva7IDICTAGFCm0PS8k2h2CS2dBO+ujg0hj1C+q2m41JcQSGOzHw1iwlkfN0BXN7T fwA+SpIfuuK8H7jcGtqIzMSg7ANTRDick6eVGK14Zdl+FhraoUY3KOhcrk/I5rOT46 L8rv8izc2c2Vw== Subject: [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:26 -0800 Message-ID: <163961696648.3129691.5075630610079213754.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong While debugging some very strange rmap corruption reports in connection with the online directory repair code. I root-caused the error to the following incorrect sequence: Obviously, we should have committed the transaction instead of cancelling it. Thinking more broadly, however, xfs_trans_cancel should have warned us that we were throwing away work item that we already committed to performing. This is not correct, and we need to shut down the filesystem. Change xfs_trans_cancel to complain in the loudest manner if we're cancelling any transaction with deferred work items attached. Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_trans.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 234a9d9c2f43..59e2f9031b9f 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -942,8 +942,17 @@ xfs_trans_cancel( trace_xfs_trans_cancel(tp, _RET_IP_); - if (tp->t_flags & XFS_TRANS_PERM_LOG_RES) + /* + * It's never valid to cancel a transaction with deferred ops attached, + * because the transaction is effectively dirty. Complain about this + * loudly before freeing the in-memory defer items. + */ + if (!list_empty(&tp->t_dfops)) { + ASSERT(xfs_is_shutdown(mp) || list_empty(&tp->t_dfops)); + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); + dirty = true; xfs_defer_cancel(tp); + } /* * See if the caller is relying on us to shut down the From patchwork Thu Dec 16 01:09:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679785 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A74DC433EF for ; Thu, 16 Dec 2021 01:09:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229789AbhLPBJe (ORCPT ); Wed, 15 Dec 2021 20:09:34 -0500 Received: from ams.source.kernel.org ([145.40.68.75]:48136 "EHLO ams.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJe (ORCPT ); Wed, 15 Dec 2021 20:09:34 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 8087AB82163 for ; Thu, 16 Dec 2021 01:09:33 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 44F40C36AE0; Thu, 16 Dec 2021 01:09:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616972; bh=t/ZeqlaOqA8O6RfXXKrJLh80mUsLKCGZui2WrECJAyc=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=HM43La5ugHzvy/cY9z0+d8vJ31ChoUdnRzeaNbWGSj++1ARuznuQLkS5uWRVKzsri pATwxApTQqnognU7fXBWUIQ+rSwAxI0CdriOyrrdXE/MxR88xjjdXpa3AOv37dNS2E QvN9N5FPurVGXL04XM0v/zcqVSRkOxlEFj492pECU8YRGCMONVO5sTifz95AW/weIH s3biggE9ScDRgRwpiYgh34ax4SKA5rLPTkX/fu14vmfC6GolJUEDXiPrj98VRsPMsY 8mDh8dc9K9j6Kb+uV54eKIaeKXtjxRpsBc0R0piE4YqQulmGgoWl2X5IZCSPFYMsBS FDZN/ljuINfcg== Subject: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:32 -0800 Message-ID: <163961697197.3129691.1911552605195534271.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong When xfs_scrub encounters a directory with a leaf1 block, it tries to validate that the leaf1 block's bestcount (aka the best free count of each directory data block) is the correct size. Previously, this author believed that comparing bestcount to the directory isize (since directory data blocks are under isize, and leaf/bestfree blocks are above it) was sufficient. Unfortunately during testing of online repair, it was discovered that it is possible to create a directory with a hole between the last directory block and isize. The directory code seems to handle this situation just fine and xfs_repair doesn't complain, which effectively makes this quirk part of the disk format. Fix the check to work properly. Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/scrub/dir.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 200a63f58fe7..9a16932d77ce 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -497,6 +497,7 @@ STATIC int xchk_directory_leaf1_bestfree( struct xfs_scrub *sc, struct xfs_da_args *args, + xfs_dir2_db_t last_data_db, xfs_dablk_t lblk) { struct xfs_dir3_icleaf_hdr leafhdr; @@ -534,10 +535,14 @@ xchk_directory_leaf1_bestfree( } /* - * There should be as many bestfree slots as there are dir data - * blocks that can fit under i_size. + * There must be enough bestfree slots to cover all the directory data + * blocks that we scanned. It is possible for there to be a hole + * between the last data block and i_disk_size. This seems like an + * oversight to the scrub author, but as we have been writing out + * directories like this (and xfs_repair doesn't mind them) for years, + * that's what we have to check. */ - if (bestcount != xfs_dir2_byte_to_db(geo, sc->ip->i_disk_size)) { + if (bestcount != last_data_db + 1) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk); goto out; } @@ -669,6 +674,7 @@ xchk_directory_blocks( xfs_fileoff_t lblk; struct xfs_iext_cursor icur; xfs_dablk_t dabno; + xfs_dir2_db_t last_data_db = 0; bool found; int is_block = 0; int error; @@ -712,6 +718,7 @@ xchk_directory_blocks( args.geo->fsbcount); lblk < got.br_startoff + got.br_blockcount; lblk += args.geo->fsbcount) { + last_data_db = xfs_dir2_da_to_db(mp->m_dir_geo, lblk); error = xchk_directory_data_bestfree(sc, lblk, is_block); if (error) @@ -734,7 +741,7 @@ xchk_directory_blocks( xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk); goto out; } - error = xchk_directory_leaf1_bestfree(sc, &args, + error = xchk_directory_leaf1_bestfree(sc, &args, last_data_db, leaf_lblk); if (error) goto out; From patchwork Thu Dec 16 01:09:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679787 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B85C2C433EF for ; Thu, 16 Dec 2021 01:09:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229809AbhLPBJj (ORCPT ); Wed, 15 Dec 2021 20:09:39 -0500 Received: from dfw.source.kernel.org ([139.178.84.217]:45380 "EHLO dfw.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJi (ORCPT ); Wed, 15 Dec 2021 20:09:38 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6EF2261BB8 for ; Thu, 16 Dec 2021 01:09:38 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C7725C36AE0; Thu, 16 Dec 2021 01:09:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616977; bh=uzRA+0mCayJ5o2F7FFi9w1Kg2ICJjpMEAdvgHCNllJI=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=piUOAjc7KH5jYJsrvmTZuGguZxYTDsNS+Qj02/Jq5sKuS5d6CYVYc7BiiocuMCHTL xxisD/IX1vm/LEC2Fz7If+miNCajD2KVg45DiH8uwiAtZwmCBsyyzS8WF50KRqQXzG RsraPcrrxOEVwOwG/2YQHBgMqVywQOjlJ64id8zRz+PgXstGymnHIsFHnqLGLr+MvZ YHT4FpB+c9h4mQTQ9DRDse+ahonzcWH5UQU2gZ84Z7xm2G7yhy/cEu+C5gGLfM0f3A HZT/dg3/9rd6O4GGPwbR4MlyYCmuJILUfQLZTj5Kw8o/Cu7vrEjaiv5SNAESevHHIn hPFiOA0dG4LUg== Subject: [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:37 -0800 Message-ID: <163961697749.3129691.10072192517670663885.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong While I was running with KASAN and lockdep enabled, I stumbled upon an KASAN report about a UAF to a freed CIL checkpoint. Looking at the comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me that the original patch to xfs_defer_finish_noroll should have done something to lock the CIL to prevent it from switching the CIL contexts while the predicate runs. For upper level code that needs to know if a given log item is new enough not to need relogging, add a new wrapper that takes the CIL context lock long enough to sample the current CIL context. This is kind of racy in that the CIL can switch the contexts immediately after sampling, but that's ok because the consequence is that the defer ops code is a little slow to relog items. ================================================================== BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs] Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999 CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4 Call Trace: dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x83/0xdf xfs_log_item_in_current_chkpt+0x139/0x160 xfs_defer_finish_noroll+0x3bb/0x1e30 __xfs_trans_commit+0x6c8/0xcf0 xfs_reflink_remap_extent+0x66f/0x10e0 xfs_reflink_remap_blocks+0x2dd/0xa90 xfs_file_remap_range+0x27b/0xc30 vfs_dedupe_file_range_one+0x368/0x420 vfs_dedupe_file_range+0x37c/0x5d0 do_vfs_ioctl+0x308/0x1260 __x64_sys_ioctl+0xa1/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2c71a2950b Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20 Allocated by task 464064: kasan_save_stack+0x1e/0x50 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0xcd/0x2c0 [xfs] xlog_cil_ctx_alloc+0x17/0x1e0 [xfs] xlog_cil_push_work+0x141/0x13d0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Freed by task 51: kasan_save_stack+0x1e/0x50 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xed/0x130 slab_free_freelist_hook+0x7f/0x160 kfree+0xde/0x340 xlog_cil_committed+0xbfd/0xfe0 [xfs] xlog_cil_process_committed+0x103/0x1c0 [xfs] xlog_state_do_callback+0x45d/0xbd0 [xfs] xlog_ioend_work+0x116/0x1c0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Last potentially related work creation: kasan_save_stack+0x1e/0x50 __kasan_record_aux_stack+0xb7/0xc0 insert_work+0x48/0x2e0 __queue_work+0x4e7/0xda0 queue_work_on+0x69/0x80 xlog_cil_push_now.isra.0+0x16b/0x210 [xfs] xlog_cil_force_seq+0x1b7/0x850 [xfs] xfs_log_force_seq+0x1c7/0x670 [xfs] xfs_file_fsync+0x7c1/0xa60 [xfs] __x64_sys_fsync+0x52/0x80 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88804ea5f600 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 8 bytes inside of 256-byte region [ffff88804ea5f600, ffff88804ea5f700) The buggy address belongs to the page: page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e head:ffffea00013a9780 order:1 compound_mapcount:0 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items") Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 2 +- fs/xfs/xfs_log.h | 1 + fs/xfs/xfs_log_cil.c | 25 ++++++++++++++++++++++--- 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index a7a8e4528881..41a77f4e42ab 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -430,7 +430,7 @@ xfs_buf_item_format( if (bip->bli_flags & XFS_BLI_INODE_BUF) { if (xfs_has_v3inodes(lip->li_mountp) || !((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && - xfs_log_item_in_current_chkpt(lip))) + __xfs_log_item_in_current_chkpt(lip))) bip->__bli_format.blf_flags |= XFS_BLF_INODE_BUF; bip->bli_flags &= ~XFS_BLI_INODE_BUF; } diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index dc1b77b92fc1..6c7b6981c443 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -132,6 +132,7 @@ struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); void xlog_cil_process_committed(struct list_head *list); +bool __xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); void xfs_log_work_queue(struct xfs_mount *mp); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 6c93c8ada6f3..62e84e040642 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1441,10 +1441,10 @@ xlog_cil_force_seq( * transaction commit process when deciding what to format into the item. */ bool -xfs_log_item_in_current_chkpt( - struct xfs_log_item *lip) +__xfs_log_item_in_current_chkpt( + struct xfs_log_item *lip) { - struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; + struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; if (list_empty(&lip->li_cil)) return false; @@ -1457,6 +1457,25 @@ xfs_log_item_in_current_chkpt( return lip->li_seq == ctx->sequence; } +/* + * Check if the current log item was first committed in this sequence. + * This version locks out CIL flushes, but can only be used to gather + * a rough estimate of the answer. + */ +bool +xfs_log_item_in_current_chkpt( + struct xfs_log_item *lip) +{ + struct xfs_cil *cil = lip->li_mountp->m_log->l_cilp; + bool ret; + + down_read(&cil->xc_ctx_lock); + ret = __xfs_log_item_in_current_chkpt(lip); + up_read(&cil->xc_ctx_lock); + + return ret; +} + /* * Perform initial CIL structure initialisation. */ From patchwork Thu Dec 16 01:09:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679789 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83108C433F5 for ; Thu, 16 Dec 2021 01:09:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229888AbhLPBJq (ORCPT ); Wed, 15 Dec 2021 20:09:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbhLPBJo (ORCPT ); Wed, 15 Dec 2021 20:09:44 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 52FAFC061574 for ; Wed, 15 Dec 2021 17:09:44 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E7E9261BB8 for ; Thu, 16 Dec 2021 01:09:43 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4FEACC36AE1; Thu, 16 Dec 2021 01:09:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616983; bh=yIIFDdIbod588gBMZ0B3qCiTukToiOPvPxFHRRU6mUA=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=aqubLzOpuKvoBvfC4SH0nxX0mKpvvoSK5fvyTF7QKyMq9/FQUcDtUWj6+B2Un3pOX CgDOL5ClpIbg1BugXOlZ+Yor4zX96l8gWEqZ3Ke0XzINTqQwp5Ll6Lndj8WR3nuYLq q1lOPnMeQ65isBs8ryxDVzq7sSSSDCrs6I/1p1ldGK+h8K8Q09nIRySvC/5/MRZpBU eHOVXPFpojf8RUQEF6GpiUV0t88rHN8t88JOuRGjzrtRcxz9VHGQCcOPP0n+G4bdHB LnovHpJloB5BOV2HADsiUj2WEeiybJjbnsEPnjr12Fn2+1eff7Drr/C0wRLyCQ/fR0 vGy/uuRp5kmMQ== Subject: [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:43 -0800 Message-ID: <163961698300.3129691.4408918955613874796.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Prior to commit 40b52225e58c ("xfs: remove support for disabling quota accounting on a mounted file system"), we used the quotaoff mutex to protect dquot operations against quotaoff trying to pull down dquots as part of disabling quota. Now that we only support turning off quota enforcement, the quotaoff mutex only protects changes in m_qflags/sb_qflags. We don't need it to protect dquots, which means we can remove it from setqlimits and the dquot scrub code. While we're at it, fix the function that forces quotacheck, since it should have been taking the quotaoff mutex. Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/scrub/quota.c | 4 ++-- fs/xfs/scrub/repair.c | 3 +++ fs/xfs/scrub/scrub.c | 4 ---- fs/xfs/scrub/scrub.h | 1 - fs/xfs/xfs_qm_syscalls.c | 11 +---------- 5 files changed, 6 insertions(+), 17 deletions(-) diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index d6c1b00a4fc8..3c7506c7553c 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -48,10 +48,10 @@ xchk_setup_quota( dqtype = xchk_quota_to_dqtype(sc); if (dqtype == 0) return -EINVAL; - sc->flags |= XCHK_HAS_QUOTAOFFLOCK; - mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock); + if (!xfs_this_quota_on(sc->mp, dqtype)) return -ENOENT; + error = xchk_setup_fs(sc); if (error) return error; diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 8f3cba14ada3..1e7b6b209ee8 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -25,6 +25,7 @@ #include "xfs_ag.h" #include "xfs_ag_resv.h" #include "xfs_quota.h" +#include "xfs_qm.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -912,11 +913,13 @@ xrep_force_quotacheck( if (!(flag & sc->mp->m_qflags)) return; + mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock); sc->mp->m_qflags &= ~flag; spin_lock(&sc->mp->m_sb_lock); sc->mp->m_sb.sb_qflags &= ~flag; spin_unlock(&sc->mp->m_sb_lock); xfs_log_sb(sc->tp); + mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock); } /* diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 8d528d35b725..b11870d07c56 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -173,10 +173,6 @@ xchk_teardown( mnt_drop_write_file(sc->file); if (sc->flags & XCHK_REAPING_DISABLED) xchk_start_reaping(sc); - if (sc->flags & XCHK_HAS_QUOTAOFFLOCK) { - mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock); - sc->flags &= ~XCHK_HAS_QUOTAOFFLOCK; - } if (sc->buf) { kmem_free(sc->buf); sc->buf = NULL; diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index 80e5026bba44..3de5287e98d8 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -88,7 +88,6 @@ struct xfs_scrub { /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */ #define XCHK_TRY_HARDER (1 << 0) /* can't get resources, try again */ -#define XCHK_HAS_QUOTAOFFLOCK (1 << 1) /* we hold the quotaoff lock */ #define XCHK_REAPING_DISABLED (1 << 2) /* background block reaping paused */ #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */ diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c index 47fe60e1a887..7d5a31827681 100644 --- a/fs/xfs/xfs_qm_syscalls.c +++ b/fs/xfs/xfs_qm_syscalls.c @@ -302,13 +302,6 @@ xfs_qm_scall_setqlim( if ((newlim->d_fieldmask & XFS_QC_MASK) == 0) return 0; - /* - * We don't want to race with a quotaoff so take the quotaoff lock. - * We don't hold an inode lock, so there's nothing else to stop - * a quotaoff from happening. - */ - mutex_lock(&q->qi_quotaofflock); - /* * Get the dquot (locked) before we start, as we need to do a * transaction to allocate it if it doesn't exist. Once we have the @@ -319,7 +312,7 @@ xfs_qm_scall_setqlim( error = xfs_qm_dqget(mp, id, type, true, &dqp); if (error) { ASSERT(error != -ENOENT); - goto out_unlock; + return error; } defq = xfs_get_defquota(q, xfs_dquot_type(dqp)); @@ -415,8 +408,6 @@ xfs_qm_scall_setqlim( out_rele: xfs_qm_dqrele(dqp); -out_unlock: - mutex_unlock(&q->qi_quotaofflock); return error; } From patchwork Thu Dec 16 01:09:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679791 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 797C7C433EF for ; Thu, 16 Dec 2021 01:09:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229772AbhLPBJw (ORCPT ); Wed, 15 Dec 2021 20:09:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58328 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJv (ORCPT ); Wed, 15 Dec 2021 20:09:51 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 48F5DC061574 for ; Wed, 15 Dec 2021 17:09:51 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 12303B8226C for ; Thu, 16 Dec 2021 01:09:50 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C76C0C36AE0; Thu, 16 Dec 2021 01:09:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616988; bh=FZUTNG5EFGJx+rII8bxVz9Ui5NlHWS8npk+Z2sLBgC0=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=bYlWCZxFSuZVJpFF4XeV8pvJZFIw+FgLAXLAxsFnBf12XFwTJeTIe/VMDp07UEmOl BkB2RPLE5K2tXo1fb3f/I6QIYV+Su23mUdv4L8pFq14gU0ZWMhPDRlmInB2pmkbniP qVdvuOVcGYK2ZqkvQw9LPAbLSMl1LBUvLZ87zn2wWRQJd5nrI+iK8hq0yC+3waiJH/ 0wLwJgoy0bjNqA/CS3QlkoW7d5vcmMzWv+zae3ELJfNPmDiGGIpjEVCqxjwILHGmVM KNEHTVXbISMcLsU/pyDKy2fUIdGmdwXCkOJR7AfTe2y9TxWY3hQByrXu3LLH8T0rQf C1/EHpgXOn4ug== Subject: [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs From: "Darrick J. Wong" To: djwong@kernel.org Cc: Ian Kent , linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:48 -0800 Message-ID: <163961698851.3129691.1262560189729839928.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Ian Kent reported that for inline symlinks, it's possible for vfs_readlink to hang on to the target buffer returned by _vn_get_link_inline long after it's been freed by xfs inode reclaim. This is a layering violation -- we should never expose XFS internals to the VFS. When the symlink has a remote target, we allocate a separate buffer, copy the internal information, and let the VFS manage the new buffer's lifetime. Let's adapt the inline code paths to do this too. It's less efficient, but fixes the layering violation and avoids the need to adapt the if_data lifetime to rcu rules. Clearly I don't care about readlink benchmarks. As a side note, this fixes the minor locking violation where we can access the inode data fork without taking any locks; proper locking (and eliminating the possibility of having to switch inode_operations on a live inode) is essential to online repair coordinating repairs correctly. Reported-by: Ian Kent Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Acked-by: Ian Kent Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_iops.c | 34 +--------------------------------- fs/xfs/xfs_symlink.c | 19 ++++++++++++++----- 2 files changed, 15 insertions(+), 38 deletions(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a607d6aca5c4..72bdd7c79e93 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -511,27 +511,6 @@ xfs_vn_get_link( return ERR_PTR(error); } -STATIC const char * -xfs_vn_get_link_inline( - struct dentry *dentry, - struct inode *inode, - struct delayed_call *done) -{ - struct xfs_inode *ip = XFS_I(inode); - char *link; - - ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL); - - /* - * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if - * if_data is junk. - */ - link = ip->i_df.if_u1.if_data; - if (XFS_IS_CORRUPT(ip->i_mount, !link)) - return ERR_PTR(-EFSCORRUPTED); - return link; -} - static uint32_t xfs_stat_blksize( struct xfs_inode *ip) @@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = { .update_time = xfs_vn_update_time, }; -static const struct inode_operations xfs_inline_symlink_inode_operations = { - .get_link = xfs_vn_get_link_inline, - .getattr = xfs_vn_getattr, - .setattr = xfs_vn_setattr, - .listxattr = xfs_vn_listxattr, - .update_time = xfs_vn_update_time, -}; - /* Figure out if this file actually supports DAX. */ static bool xfs_inode_supports_dax( @@ -1408,10 +1379,7 @@ xfs_setup_iops( inode->i_fop = &xfs_dir_file_operations; break; case S_IFLNK: - if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) - inode->i_op = &xfs_inline_symlink_inode_operations; - else - inode->i_op = &xfs_symlink_inode_operations; + inode->i_op = &xfs_symlink_inode_operations; break; default: inode->i_op = &xfs_inode_operations; diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index fc2c6a404647..4868bd986f52 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -22,6 +22,7 @@ #include "xfs_trace.h" #include "xfs_trans.h" #include "xfs_ialloc.h" +#include "xfs_error.h" /* ----- Kernel only functions below ----- */ int @@ -101,12 +102,10 @@ xfs_readlink( { struct xfs_mount *mp = ip->i_mount; xfs_fsize_t pathlen; - int error = 0; + int error = -EFSCORRUPTED; trace_xfs_readlink(ip); - ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_LOCAL); - if (xfs_is_shutdown(mp)) return -EIO; @@ -121,12 +120,22 @@ xfs_readlink( __func__, (unsigned long long) ip->i_ino, (long long) pathlen); ASSERT(0); - error = -EFSCORRUPTED; goto out; } + if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) { + /* + * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if + * if_data is junk. + */ + if (XFS_IS_CORRUPT(ip->i_mount, !ip->i_df.if_u1.if_data)) + goto out; - error = xfs_readlink_bmap_ilocked(ip, link); + memcpy(link, ip->i_df.if_u1.if_data, ip->i_disk_size + 1); + error = 0; + } else { + error = xfs_readlink_bmap_ilocked(ip, link); + } out: xfs_iunlock(ip, XFS_ILOCK_SHARED); From patchwork Thu Dec 16 01:09:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12679793 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33CBAC433EF for ; Thu, 16 Dec 2021 01:09:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229907AbhLPBJz (ORCPT ); Wed, 15 Dec 2021 20:09:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58348 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229441AbhLPBJz (ORCPT ); Wed, 15 Dec 2021 20:09:55 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4E3B6C061574 for ; Wed, 15 Dec 2021 17:09:55 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E49E261BB8 for ; Thu, 16 Dec 2021 01:09:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 48F7BC36AE1; Thu, 16 Dec 2021 01:09:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1639616994; bh=68tLunglQaUDru1VRnVnSNOlJ+lStU8jjkicUwo/NuY=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=XK6kM+qse4a33UdUpfIAjhfGEiUSBrhz9XkowRRqEOvoxXUu+x6r3kh/J9ry/+e/I ocTq5onrt+2rOlLiIeKhFQuXzKrnjib4OgN1bKM173Lf4GQ8oKHTxgpaDCeISQ8pRq stzowCqMmkhw/qRR9E9ymefbHFepYquWWINdcBi0FsCOqtcaXck58SayxzzYy+pKfp jgm744Xh1cnbD1OEPYP8OItn66aZGL8gw7tbKqTPzocGPQP1P7k5yJTGW01ea9biDC OVjMXxCOSh0Z41SQvYcrv9zK31yQokLDRtWrkF5lO717JRlJ432pz6/Y6fQrOU50rH h8JLsQUrT2i7w== Subject: [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents From: "Darrick J. Wong" To: djwong@kernel.org Cc: Chandan Babu R , linux-xfs@vger.kernel.org Date: Wed, 15 Dec 2021 17:09:54 -0800 Message-ID: <163961699399.3129691.9449691191051808697.stgit@magnolia> In-Reply-To: <163961695502.3129691.3496134437073533141.stgit@magnolia> References: <163961695502.3129691.3496134437073533141.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount # (0) fsstress & # (1) while true; do fsstress mount -o remount,ro # (2) fsstress mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Reviewed-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_log_recover.c | 24 +++++++++++++++++++++++- fs/xfs/xfs_mount.c | 10 ---------- fs/xfs/xfs_reflink.c | 5 ++++- fs/xfs/xfs_super.c | 9 --------- 4 files changed, 27 insertions(+), 21 deletions(-) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 53366cc0bc9e..8ecb9a8567b7 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -27,7 +27,7 @@ #include "xfs_buf_item.h" #include "xfs_ag.h" #include "xfs_quota.h" - +#include "xfs_reflink.h" #define BLK_AVG(blk1, blk2) ((blk1+blk2) >> 1) @@ -3498,6 +3498,28 @@ xlog_recover_finish( xlog_recover_process_iunlinks(log); xlog_recover_check_summary(log); + + /* + * Recover any CoW staging blocks that are still referenced by the + * ondisk refcount metadata. During mount there cannot be any live + * staging extents as we have not permitted any user modifications. + * Therefore, it is safe to free them all right now, even on a + * read-only mount. + */ + error = xfs_reflink_recover_cow(log->l_mp); + if (error) { + xfs_alert(log->l_mp, + "Failed to recover leftover CoW staging extents, err %d.", + error); + /* + * If we get an error here, make sure the log is shut down + * but return zero so that any log items committed since the + * end of intents processing can be pushed through the CIL + * and AIL. + */ + xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); + } + return 0; } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 359109b6f0d3..bed73e8002a5 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -936,15 +936,6 @@ xfs_mountfs( xfs_warn(mp, "Unable to allocate reserve blocks. Continuing without reserve pool."); - /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - goto out_quota; - } - /* Reserve AG blocks for future btree expansion. */ error = xfs_fs_reserve_ag_blocks(mp); if (error && error != -ENOSPC) @@ -955,7 +946,6 @@ xfs_mountfs( out_agresv: xfs_fs_unreserve_ag_blocks(mp); - out_quota: xfs_qm_unmount_quotas(mp); out_rtunmount: xfs_rtunmount_inodes(mp); diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index cb0edb1d68ef..8b6c7163f684 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -749,7 +749,10 @@ xfs_reflink_end_cow( } /* - * Free leftover CoW reservations that didn't get cleaned out. + * Free all CoW staging blocks that are still referenced by the ondisk refcount + * metadata. The ondisk metadata does not track which inode created the + * staging extent, so callers must ensure that there are no cached inodes with + * live CoW staging extents. */ int xfs_reflink_recover_cow( diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 778b57b1f020..c7ac486ca5d3 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1739,15 +1739,6 @@ xfs_remount_rw( */ xfs_restore_resvblks(mp); xfs_log_work_queue(mp); - - /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - return error; - } xfs_blockgc_start(mp); /* Create the per-AG metadata reservation pool .*/