From patchwork Mon Nov 19 21:04:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689449 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6748C16B1 for ; Mon, 19 Nov 2018 21:05:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 57A052A4C7 for ; Mon, 19 Nov 2018 21:05:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4C3262A60B; Mon, 19 Nov 2018 21:05:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C0F982A550 for ; Mon, 19 Nov 2018 21:05:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730913AbeKTHbE (ORCPT ); Tue, 20 Nov 2018 02:31:04 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:55865 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730844AbeKTHbE (ORCPT ); Tue, 20 Nov 2018 02:31:04 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj6-0000XN-Vj for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj6-0002Fq-UM for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:00 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 1/7] xfs: zero length symlinks are not valid Date: Tue, 20 Nov 2018 08:04:53 +1100 Message-Id: <20181119210459.8506-2-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner A log recovery failure has been reproduced where a symlink inode has a zero length in extent form. It was caused by a shutdown during a combined fstress+fsmark workload. The underlying problem is the issue in xfs_inactive_symlink(): the inode is unlocked between the symlink inactivation/truncation and the inode being freed. This opens a window for the inode to be written to disk before it xfs_ifree() removes it from the unlinked list, marks it free in the inobt and zeros the mode. For shortform inodes, the fix is simple. xfs_ifree() clears the data fork state, so there's no need to do it in xfs_inactive_symlink(). This means the shortform fork verifier will not see a zero length data fork as it mirrors the inode size through to xfs_ifree()), and hence if the inode gets written back and the fork verifiers are run they will still see a fork that matches the on-disk inode size. For extent form (remote) symlinks, it is a little more tricky. Here we explicitly set the inode size to zero, so the above race can lead to zero length symlinks on disk. Because the inode is unlinked at this point (i.e. on the unlinked list) and unreferenced, it can never be seen again by a user. Hence when we set the inode size to zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG inodes to be of zero length, and so this avoids all the problems of zero length symlinks ever hitting the disk. It also avoids the problem of needing to handle zero length symlink inodes in log recovery to replay the extent free intents and the remaining deferops to free the extents the symlink used. Also add a couple of asserts to warn us if zero length symlinks end up in either the symlink create or inactivation paths. Signed-off-by: Dave Chinner --- fs/xfs/libxfs/xfs_symlink_remote.c | 14 +++++++++---- fs/xfs/xfs_symlink.c | 33 +++++++++++++++--------------- 2 files changed, 27 insertions(+), 20 deletions(-) diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c index 95374ab2dee7..77d80106f989 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.c +++ b/fs/xfs/libxfs/xfs_symlink_remote.c @@ -199,7 +199,10 @@ xfs_symlink_local_to_remote( ifp->if_bytes - 1); } -/* Verify the consistency of an inline symlink. */ +/* + * Verify the in-memory consistency of an inline symlink data fork. This + * does not do on-disk format checks. + */ xfs_failaddr_t xfs_symlink_shortform_verify( struct xfs_inode *ip) @@ -215,9 +218,12 @@ xfs_symlink_shortform_verify( size = ifp->if_bytes; endp = sfp + size; - /* Zero length symlinks can exist while we're deleting a remote one. */ - if (size == 0) - return NULL; + /* + * Zero length symlinks should never occur in memory as they are + * never alllowed to exist on disk. + */ + if (!size) + return __this_address; /* No negative sizes or overly long symlink targets. */ if (size < 0 || size > XFS_SYMLINK_MAXLEN) diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index a3e98c64b6e3..b2c1177c717f 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -192,6 +192,7 @@ xfs_symlink( pathlen = strlen(target_path); if (pathlen >= XFS_SYMLINK_MAXLEN) /* total string too long */ return -ENAMETOOLONG; + ASSERT(pathlen > 0); udqp = gdqp = NULL; prid = xfs_get_initial_prid(dp); @@ -378,6 +379,12 @@ xfs_symlink( /* * Free a symlink that has blocks associated with it. + * + * Note: zero length symlinks are not allowed to exist. When we set the size to + * zero, also change it to a regular file so that it does not get written to + * disk as a zero length symlink. The inode is on the unlinked list already, so + * userspace cannot find this inode anymore, so this change is not user visible + * but allows us to catch corrupt zero-length symlinks in the verifiers. */ STATIC int xfs_inactive_symlink_rmt( @@ -412,13 +419,14 @@ xfs_inactive_symlink_rmt( xfs_trans_ijoin(tp, ip, 0); /* - * Lock the inode, fix the size, and join it to the transaction. - * Hold it so in the normal path, we still have it locked for - * the second transaction. In the error paths we need it + * Lock the inode, fix the size, turn it into a regular file and join it + * to the transaction. Hold it so in the normal path, we still have it + * locked for the second transaction. In the error paths we need it * held so the cancel won't rele it, see below. */ size = (int)ip->i_d.di_size; ip->i_d.di_size = 0; + VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG; xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); /* * Find the block(s) so we can inval and unmap them. @@ -494,17 +502,10 @@ xfs_inactive_symlink( return -EIO; xfs_ilock(ip, XFS_ILOCK_EXCL); - - /* - * Zero length symlinks _can_ exist. - */ pathlen = (int)ip->i_d.di_size; - if (!pathlen) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - return 0; - } + ASSERT(pathlen); - if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) { + if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) { xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)", __func__, (unsigned long long)ip->i_ino, pathlen); xfs_iunlock(ip, XFS_ILOCK_EXCL); @@ -512,12 +513,12 @@ xfs_inactive_symlink( return -EFSCORRUPTED; } + /* + * Inline fork state gets removed by xfs_difree() so we have nothing to + * do here in that case. + */ if (ip->i_df.if_flags & XFS_IFINLINE) { - if (ip->i_df.if_bytes > 0) - xfs_idata_realloc(ip, -(ip->i_df.if_bytes), - XFS_DATA_FORK); xfs_iunlock(ip, XFS_ILOCK_EXCL); - ASSERT(ip->i_df.if_bytes == 0); return 0; } From patchwork Mon Nov 19 21:04:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689437 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DD5C113B5 for ; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D0D3C2A550 for ; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C54BE2A64B; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6BBEB2A550 for ; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730912AbeKTHbA (ORCPT ); Tue, 20 Nov 2018 02:31:00 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:55865 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730845AbeKTHbA (ORCPT ); Tue, 20 Nov 2018 02:31:00 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000XO-0R for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj6-0002Ft-VV for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:00 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Date: Tue, 20 Nov 2018 08:04:54 +1100 Message-Id: <20181119210459.8506-3-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner Useless: xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_unlock: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_submit: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_hold: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iowait: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iodone: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iowait_done: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_rele: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... Useful: xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_unlock: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_submit: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_hold: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iowait: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iodone: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iowait_done: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_rele: dev 253:32 bno 0x200b5 nblks 0x1 ... Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_trace.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 3043e5ed6495..8a6532aae779 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -280,7 +280,10 @@ DECLARE_EVENT_CLASS(xfs_buf_class, ), TP_fast_assign( __entry->dev = bp->b_target->bt_dev; - __entry->bno = bp->b_bn; + if (bp->b_bn == XFS_BUF_DADDR_NULL) + __entry->bno = bp->b_maps[0].bm_bn; + else + __entry->bno = bp->b_bn; __entry->nblks = bp->b_length; __entry->hold = atomic_read(&bp->b_hold); __entry->pincount = atomic_read(&bp->b_pin_count); From patchwork Mon Nov 19 21:04:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689439 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1824B16B1 for ; Mon, 19 Nov 2018 21:05:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C4072A550 for ; Mon, 19 Nov 2018 21:05:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 00D422A644; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 969F72A60B for ; Mon, 19 Nov 2018 21:05:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730845AbeKTHbA (ORCPT ); Tue, 20 Nov 2018 02:31:00 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:15953 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730844AbeKTHbA (ORCPT ); Tue, 20 Nov 2018 02:31:00 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000XP-1M for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj7-0002Fw-0A for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Date: Tue, 20 Nov 2018 08:04:55 +1100 Message-Id: <20181119210459.8506-4-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner When retrying a failed inode or dquot buffer, xfs_buf_resubmit_failed_buffers() clears all the failed flags from the inde/dquot log items. In doing so, it also drops all the reference counts on the buffer that the failed log items hold. This means it can drop all the active references on the buffer and hence free the buffer before it queues it for write again. Putting the buffer on the delwri queue takes a reference to the buffer (so that it hangs around until it has been written and completed), but this goes bang if the buffer has already been freed. Hence we need to add the buffer to the delwri queue before we remove the failed flags from the log items attached to the buffer to ensure it always remains referenced during the resubmit process. Reported-by: Josef Bacik Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 12d8455bfbb2..010db5f8fb00 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1233,9 +1233,23 @@ xfs_buf_iodone( } /* - * Requeue a failed buffer for writeback + * Requeue a failed buffer for writeback. * - * Return true if the buffer has been re-queued properly, false otherwise + * We clear the log item failed state here as well, but we have to be careful + * about reference counts because the only active reference counts on the buffer + * may be the failed log items. Hence if we clear the log item failed state + * before queuing the buffer for IO we can release all active references to + * the buffer and free it, leading to use after free problems in + * xfs_buf_delwri_queue. It makes no difference to the buffer or log items which + * order we process them in - the buffer is locked, and we own the buffer list + * so nothing on them is going to change while we are performing this action. + * + * Hence we can safely queue the buffer for IO before we clear the failed log + * item state, therefore always having an active reference to the buffer and + * avoiding the transient zero-reference state that leads to use-after-free. + * + * Return true if the buffer was added to the buffer list, false if it was + * already on the buffer list. */ bool xfs_buf_resubmit_failed_buffers( @@ -1243,16 +1257,16 @@ xfs_buf_resubmit_failed_buffers( struct list_head *buffer_list) { struct xfs_log_item *lip; + bool ret; + + ret = xfs_buf_delwri_queue(bp, buffer_list); /* - * Clear XFS_LI_FAILED flag from all items before resubmit - * - * XFS_LI_FAILED set/clear is protected by ail_lock, caller this + * XFS_LI_FAILED set/clear is protected by ail_lock, caller of this * function already have it acquired */ list_for_each_entry(lip, &bp->b_li_list, li_bio_list) xfs_clear_li_failed(lip); - /* Add this buffer back to the delayed write list */ - return xfs_buf_delwri_queue(bp, buffer_list); + return ret; } From patchwork Mon Nov 19 21:04:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689443 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E9A6F16B1 for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DCFB62A325 for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D11742A550; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 776F72A325 for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730917AbeKTHbC (ORCPT ); Tue, 20 Nov 2018 02:31:02 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:55865 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730844AbeKTHbC (ORCPT ); Tue, 20 Nov 2018 02:31:02 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000XQ-2L for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj7-0002Fz-15 for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Date: Tue, 20 Nov 2018 08:04:56 +1100 Message-Id: <20181119210459.8506-5-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner The last AG may be very small comapred to all other AGs, and hence AG reservations based on the superblock AG size may actually consume more space than the AG actually has. This results on assert failures like: XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319 [ 48.932891] xfs_ag_resv_init+0x1bd/0x1d0 [ 48.933853] xfs_fs_reserve_ag_blocks+0x37/0xb0 [ 48.934939] xfs_mountfs+0x5b3/0x920 [ 48.935804] xfs_fs_fill_super+0x462/0x640 [ 48.936784] ? xfs_test_remount_options+0x60/0x60 [ 48.937908] mount_bdev+0x178/0x1b0 [ 48.938751] mount_fs+0x36/0x170 [ 48.939533] vfs_kern_mount.part.43+0x54/0x130 [ 48.940596] do_mount+0x20e/0xcb0 [ 48.941396] ? memdup_user+0x3e/0x70 [ 48.942249] ksys_mount+0xba/0xd0 [ 48.943046] __x64_sys_mount+0x21/0x30 [ 48.943953] do_syscall_64+0x54/0x170 [ 48.944835] entry_SYSCALL_64_after_hwframe+0x49/0xbe Hence we need to ensure the finobt per-ag space reservations take into account the size of the last AG rather than treat it like all the other full size AGs. Note that both refcountbt and rmapbt already take the size of the AG into account via reading the AGF length directly. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_ialloc_btree.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index 86c50208a143..7fbf8af0b159 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -538,15 +538,18 @@ xfs_inobt_rec_check_count( static xfs_extlen_t xfs_inobt_max_size( - struct xfs_mount *mp) + struct xfs_mount *mp, + xfs_agnumber_t agno) { + xfs_agblock_t agblocks = xfs_ag_block_count(mp, agno); + /* Bail out if we're uninitialized, which can happen in mkfs. */ if (mp->m_inobt_mxr[0] == 0) return 0; return xfs_btree_calc_size(mp->m_inobt_mnr, - (uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock / - XFS_INODES_PER_CHUNK); + (uint64_t)agblocks * mp->m_sb.sb_inopblock / + XFS_INODES_PER_CHUNK); } static int @@ -594,7 +597,7 @@ xfs_finobt_calc_reserves( if (error) return error; - *ask += xfs_inobt_max_size(mp); + *ask += xfs_inobt_max_size(mp, agno); *used += tree_len; return 0; } From patchwork Mon Nov 19 21:04:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689435 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5FB3214BD for ; Mon, 19 Nov 2018 21:05:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 51D6A2A550 for ; Mon, 19 Nov 2018 21:05:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 462192A644; Mon, 19 Nov 2018 21:05:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7A772A550 for ; Mon, 19 Nov 2018 21:05:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730840AbeKTHa5 (ORCPT ); Tue, 20 Nov 2018 02:30:57 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:55865 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729428AbeKTHa5 (ORCPT ); Tue, 20 Nov 2018 02:30:57 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000XR-3J for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj7-0002G2-22 for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Date: Tue, 20 Nov 2018 08:04:57 +1100 Message-Id: <20181119210459.8506-6-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner The extent shifting code uses a flush and invalidate mechainsm prior to shifting extents around. This is similar to what xfs_free_file_space() does, but it doesn't take into account things like page cache vs block size differences, and it will fail if there is a page that it currently busy. xfs_flush_unmap_range() handles all of these cases, so just convert xfs_prepare_shift() to us that mechanism rather than having it's own special sauce. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_bmap_util.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 5d263dfdb3bc..167ff4297e5c 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1195,13 +1195,7 @@ xfs_prepare_shift( * Writeback and invalidate cache for the remainder of the file as we're * about to shift down every extent from offset to EOF. */ - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset, -1); - if (error) - return error; - error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping, - offset >> PAGE_SHIFT, -1); - if (error) - return error; + error = xfs_flush_unmap_range(ip, offset, XFS_ISIZE(ip)); /* * Clean out anything hanging around in the cow fork now that From patchwork Mon Nov 19 21:04:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689441 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8AC4514BD for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7D7742A550 for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 71F112A60B; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 221272A550 for ; Mon, 19 Nov 2018 21:05:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730916AbeKTHbC (ORCPT ); Tue, 20 Nov 2018 02:31:02 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:15953 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730913AbeKTHbC (ORCPT ); Tue, 20 Nov 2018 02:31:02 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000XW-4V for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj7-0002G5-33 for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Date: Tue, 20 Nov 2018 08:04:58 +1100 Message-Id: <20181119210459.8506-7-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner Just tried to punch a 40T sparse file when enospc was triggered due to extent size hints consuming more space than expected. It failed with: # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img fallocate: No space left on device # because the writeback error of ENOSPC was being reported. We're going to free that space, so we don't care if there was a ENOSPC writeback error. So ignore ENOSPC errors and punch anyway. Signed-off-by: Dave Chinner --- fs/xfs/xfs_bmap_util.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 167ff4297e5c..cc7a0d47c529 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1060,8 +1060,13 @@ xfs_flush_unmap_range( start = round_down(offset, rounding); end = round_up(offset + len, rounding) - 1; + /* + * We're going to trash the data in this range, so if writeback reports + * an enospc error, don't let it stop us from /freeing the space/ in the + * range to alleviate the ENOSPC condition. + */ error = filemap_write_and_wait_range(inode->i_mapping, start, end); - if (error) + if (error && error != -ENOSPC) return error; truncate_pagecache_range(inode, start, end); return 0; From patchwork Mon Nov 19 21:04:59 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689447 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BAF8414BD for ; Mon, 19 Nov 2018 21:05:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ACDFE2A4C7 for ; Mon, 19 Nov 2018 21:05:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A16D52A60B; Mon, 19 Nov 2018 21:05:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3E9882A4C7 for ; Mon, 19 Nov 2018 21:05:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730915AbeKTHbE (ORCPT ); Tue, 20 Nov 2018 02:31:04 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:15953 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730913AbeKTHbE (ORCPT ); Tue, 20 Nov 2018 02:31:04 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail07.adl2.internode.on.net with ESMTP; 20 Nov 2018 07:35:03 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gOqj7-0000Xb-5W for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gOqj7-0002G8-48 for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 08:05:01 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Date: Tue, 20 Nov 2018 08:04:59 +1100 Message-Id: <20181119210459.8506-8-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner On a sub-page block size filesystem, fsx is failing with a data corruption after a series of operations involving copying a file with the destination offset beyond EOF of the destination of the file: 8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW 8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes) 8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400 8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE 8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00 The second copy here is beyond EOF, and it is to sub-page (4k) but block aligned (1k) offset. The clone runs the EOF zeroing, landing in a pre-existing post-eof delalloc extent. This zeroes the post-eof extents in the page cache just fine, dirtying the pages correctly. The problem is that xfs_reflink_remap_prep() now truncates the page cache over the range that it is copying it to, and rounds that down to cover the entire start page. This removes the dirty page over the delalloc extent from the page cache without having written it back. Hence later, when the page cache is flushed, the page at offset 0x6f000 has not been written back and hence exposes stale data, which fsx trips over less than 10 operations later. Fix this by changing xfs_reflink_remap_prep() to use xfs_flush_unmap_range(). Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_bmap_util.c | 2 +- fs/xfs/xfs_bmap_util.h | 3 +++ fs/xfs/xfs_reflink.c | 17 +++++++++++++---- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index cc7a0d47c529..3e66cf0520a9 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1042,7 +1042,7 @@ xfs_unmap_extent( goto out_unlock; } -static int +int xfs_flush_unmap_range( struct xfs_inode *ip, xfs_off_t offset, diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index 87363d136bb6..7a78229cf1a7 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -80,4 +80,7 @@ int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork, xfs_extnum_t *nextents, xfs_filblks_t *count); +int xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset, + xfs_off_t len); + #endif /* __XFS_BMAP_UTIL_H__ */ diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index ecdb086bc23e..a41590b7c229 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1351,10 +1351,19 @@ xfs_reflink_remap_prep( if (ret) goto out_unlock; - /* Zap any page cache for the destination file's range. */ - truncate_inode_pages_range(&inode_out->i_data, - round_down(pos_out, PAGE_SIZE), - round_up(pos_out + *len, PAGE_SIZE) - 1); + /* + * If pos_out > EOF, we make have dirtied blocks between EOF and + * pos_out. In that case, we need to extend the flush and unmap to cover + * from EOF to the end of the copy length. + */ + if (pos_out > XFS_ISIZE(dest)) { + loff_t flen = *len + (pos_out - XFS_ISIZE(dest)); + ret = xfs_flush_unmap_range(dest, XFS_ISIZE(dest), flen); + } else { + ret = xfs_flush_unmap_range(dest, pos_out, *len); + } + if (ret) + goto out_unlock; return 1; out_unlock: From patchwork Tue Nov 20 06:36:05 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10689919 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8200413BF for ; Tue, 20 Nov 2018 06:36:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7117929B87 for ; Tue, 20 Nov 2018 06:36:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 652E02A4CC; Tue, 20 Nov 2018 06:36:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 512DD29B87 for ; Tue, 20 Nov 2018 06:36:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726979AbeKTRDh (ORCPT ); Tue, 20 Nov 2018 12:03:37 -0500 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:45232 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726883AbeKTRDh (ORCPT ); Tue, 20 Nov 2018 12:03:37 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail06.adl6.internode.on.net with ESMTP; 20 Nov 2018 17:06:07 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1gOzdl-0001BP-O5 for linux-xfs@vger.kernel.org; Tue, 20 Nov 2018 17:36:05 +1100 Date: Tue, 20 Nov 2018 17:36:05 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Message-ID: <20181120063605.GN19305@dastard> References: <20181119210459.8506-1-david@fromorbit.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20181119210459.8506-1-david@fromorbit.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner Long saga. There have been days spent following this through dead end after dead end in multi-GB event traces. This morning, after writing a trace-cmd wrapper that enabled me to be more selective about XFS trace points, I discovered that I could get just enough essential tracepoints enabled that there was a 50:50 chance the fsx config would fail at ~115k ops. If it didn't fail at op 115547, I stopped fsx at op 115548 anyway. That gave me two traces - one where the problem manifested, and one where it didn't. After refining the traces to have the necessary information, I found that in the failing case there was a real extent in the COW fork compared to an unwritten extent in the working case. Walking back through the two traces to the point where the CWO fork extents actually diverged, I found that the bad case had an extra unwritten extent in it. This is likely because the bug it led me to had triggered multiple times in those 115k ops, leaving stray COW extents around. What I saw was a COW delalloc conversion to an unwritten extent (as they should always be through xfs_iomap_write_allocate()) resulted in a /written extent/: xfs_writepage: dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0 xfs_iext_remove: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real xfs_bmap_pre_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex Basically, Cow fork before: 0 1 32 52 +H+DDDDDDDDDDDD+UUUUUUUUUUU+ PREV RIGHT COW delalloc conversion allocates: 1 32 +uuuuuuuuuuuu+ NEW And the result according to the xfs_bmap_post_update trace was: 0 1 32 52 +H+wwwwwwwwwwwwwwwwwwwwwwww+ PREV Which is clearly wrong - it should be a merged unwritten extent, not an unwritten extent. That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG case in xfs_bmap_add_extent_delay_real(), and sure enough, there's the bug. It takes the old delalloc extent (PREV) and adds the length of the RIGHT extent to it, takes the start block from NEW, removes the RIGHT extent and then updates PREV with the new extent. What it fails to do is update PREV.br_state. For delalloc, this is always XFS_EXT_NORM, while in this case we are converting the delayed allocation to unwritten, so it needs to be updated to XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so the resultant extent is always written. And that's the bug I've been chasing for a week - a bmap btree bug, not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced with the recent in core extent tree scalability enhancements. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_bmap.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 2d78fd53e822..39eaa2b86060 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1694,10 +1694,13 @@ xfs_bmap_add_extent_delay_real( case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG: /* * Filling in all of a previously delayed allocation extent. - * The right neighbor is contiguous, the left is not. + * The right neighbor is contiguous, the left is not. Take care + * with delay -> unwritten extent allocation here because the + * delalloc record we are overwriting is always written. */ PREV.br_startblock = new->br_startblock; PREV.br_blockcount += RIGHT.br_blockcount; + PREV.br_state = new->br_state; xfs_iext_next(ifp, &bma->icur); xfs_iext_remove(bma->ip, &bma->icur, state);