From patchwork Mon Mar 7 05:32:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12771175 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85840C433EF for ; Mon, 7 Mar 2022 05:33:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232064AbiCGFdx (ORCPT ); Mon, 7 Mar 2022 00:33:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51598 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235329AbiCGFdw (ORCPT ); Mon, 7 Mar 2022 00:33:52 -0500 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9D4B93BFBA for ; Sun, 6 Mar 2022 21:32:58 -0800 (PST) Received: from dread.disaster.area (pa49-186-17-0.pa.vic.optusnet.com.au [49.186.17.0]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id D1B6D10E1B06; Mon, 7 Mar 2022 16:32:57 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1nR5zM-002U7M-QM; Mon, 07 Mar 2022 16:32:56 +1100 Received: from dave by discord.disaster.area with local (Exim 4.95) (envelope-from ) id 1nR5zM-00AdZu-P6; Mon, 07 Mar 2022 16:32:56 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: willy@infradead.org Subject: [PATCH 1/3] xfs: log worker needs to start before intent/unlink recovery Date: Mon, 7 Mar 2022 16:32:50 +1100 Message-Id: <20220307053252.2534616-2-david@fromorbit.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20220307053252.2534616-1-david@fromorbit.com> References: <20220307053252.2534616-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=e9dl9Yl/ c=1 sm=1 tr=0 ts=6225990a a=+dVDrTVfsjPpH/ci3UuFng==:117 a=+dVDrTVfsjPpH/ci3UuFng==:17 a=o8Y5sQTvuykA:10 a=20KFwNOVAAAA:8 a=9zfcJeoDbhBuSCwZ9u8A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner After 963 iterations of generic/530, it deadlocked during recovery on a pinned inode cluster buffer like so: XFS (pmem1): Starting recovery (logdev: internal) INFO: task kworker/8:0:306037 blocked for more than 122 seconds. Not tainted 5.17.0-rc6-dgc+ #975 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/8:0 state:D stack:13024 pid:306037 ppid: 2 flags:0x00004000 Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker Call Trace: __schedule+0x30d/0x9e0 schedule+0x55/0xd0 schedule_timeout+0x114/0x160 __down+0x99/0xf0 down+0x5e/0x70 xfs_buf_lock+0x36/0xf0 xfs_buf_find+0x418/0x850 xfs_buf_get_map+0x47/0x380 xfs_buf_read_map+0x54/0x240 xfs_trans_read_buf_map+0x1bd/0x490 xfs_imap_to_bp+0x4f/0x70 xfs_iunlink_map_ino+0x66/0xd0 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0 xfs_iunlink_remove_inode+0xf2/0x1d0 xfs_inactive_ifree+0x1a3/0x900 xfs_inode_unlink+0xcc/0x210 xfs_inodegc_worker+0x1ac/0x2f0 process_one_work+0x1ac/0x390 worker_thread+0x56/0x3c0 kthread+0xf6/0x120 ret_from_fork+0x1f/0x30 task:mount state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000 Call Trace: __schedule+0x30d/0x9e0 schedule+0x55/0xd0 schedule_timeout+0x114/0x160 __down+0x99/0xf0 down+0x5e/0x70 xfs_buf_lock+0x36/0xf0 xfs_buf_find+0x418/0x850 xfs_buf_get_map+0x47/0x380 xfs_buf_read_map+0x54/0x240 xfs_trans_read_buf_map+0x1bd/0x490 xfs_imap_to_bp+0x4f/0x70 xfs_iget+0x300/0xb40 xlog_recover_process_one_iunlink+0x4c/0x170 xlog_recover_process_iunlinks.isra.0+0xee/0x130 xlog_recover_finish+0x57/0x110 xfs_log_mount_finish+0xfc/0x1e0 xfs_mountfs+0x540/0x910 xfs_fs_fill_super+0x495/0x850 get_tree_bdev+0x171/0x270 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x304/0xba0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000 Call Trace: __schedule+0x30d/0x9e0 schedule+0x55/0xd0 io_schedule+0x4b/0x80 xfs_buf_wait_unpin+0x9e/0xf0 __xfs_buf_submit+0x14a/0x230 xfs_buf_delwri_submit_buffers+0x107/0x280 xfs_buf_delwri_submit_nowait+0x10/0x20 xfsaild+0x27e/0x9d0 kthread+0xf6/0x120 ret_from_fork+0x1f/0x30 We have the mount process waiting on an inode cluster buffer read, inodegc doing unlink waiting on the same inode cluster buffer, and the AIL push thread blocked in writeback waiting for the inode to become unpinned. What has happened here is that the AIL push thread has raced with the inodegc process modifying, committing and pinning the inode cluster buffer here in xfs_buf_delwri_submit_buffers() here: blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (xfs_buf_ispinned(bp)) { pinned++; continue; } Here >>>>>> if (!xfs_buf_trylock(bp)) continue; Basically, the AIL has found the buffer wasn't pinned and got the lock without blocking, but then the buffer was pinned. This implies the processing here was pre-empted between the pin check and the lock, because the pin count can only be increased while holding the buffer locked. Hence when it has gone to submit the IO, it has blocked waiting for the buffer to be unpinned. With all executing threads now waiting on the buffer to be unpinned, we normally get out of situations like this via the background log worker issuing a log force which will unpinned stuck buffers like this. But at this point in recovery, we haven't started the log worker. In fact, the first thing we do after processing intents and unlinked inodes is *start the log worker*. IOWs, we start it too late to have it break deadlocks like this. Avoid this and any other similar deadlock vectors in intent and unlinked inode recovery by starting the log worker before we recover intents and unlinked inodes. This part of recovery runs as though the filesystem is fully active, so we really should have the same infrastructure running as we normally do at runtime. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 4196655e2ef4..8585305d1769 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -835,10 +835,9 @@ xfs_log_mount_finish( * mount failure occurs. */ mp->m_super->s_flags |= SB_ACTIVE; + xfs_log_work_queue(mp); if (xlog_recovery_needed(log)) error = xlog_recover_finish(log); - if (!error) - xfs_log_work_queue(mp); mp->m_super->s_flags &= ~SB_ACTIVE; evict_inodes(mp->m_super); From patchwork Mon Mar 7 05:32:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12771176 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FE98C43217 for ; Mon, 7 Mar 2022 05:33:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235330AbiCGFdy (ORCPT ); Mon, 7 Mar 2022 00:33:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230457AbiCGFdx (ORCPT ); Mon, 7 Mar 2022 00:33:53 -0500 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CBA7C3C489 for ; Sun, 6 Mar 2022 21:32:58 -0800 (PST) Received: from dread.disaster.area (pa49-186-17-0.pa.vic.optusnet.com.au [49.186.17.0]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E887652FF59; Mon, 7 Mar 2022 16:32:57 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1nR5zM-002U7N-RE; Mon, 07 Mar 2022 16:32:56 +1100 Received: from dave by discord.disaster.area with local (Exim 4.95) (envelope-from ) id 1nR5zM-00AdZy-QC; Mon, 07 Mar 2022 16:32:56 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: willy@infradead.org Subject: [PATCH 2/3] xfs: check buffer pin state after locking in delwri_submit Date: Mon, 7 Mar 2022 16:32:51 +1100 Message-Id: <20220307053252.2534616-3-david@fromorbit.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20220307053252.2534616-1-david@fromorbit.com> References: <20220307053252.2534616-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=deDjYVbe c=1 sm=1 tr=0 ts=6225990a a=+dVDrTVfsjPpH/ci3UuFng==:117 a=+dVDrTVfsjPpH/ci3UuFng==:17 a=o8Y5sQTvuykA:10 a=20KFwNOVAAAA:8 a=CLV5itarnJPXgn4NHyQA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner AIL flushing can get stuck here: [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds. [316649.007807] Not tainted 5.17.0-rc6-dgc+ #975 [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [316649.011720] task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000 [316649.014112] Call Trace: [316649.014841] [316649.015492] __schedule+0x30d/0x9e0 [316649.017745] schedule+0x55/0xd0 [316649.018681] io_schedule+0x4b/0x80 [316649.019683] xfs_buf_wait_unpin+0x9e/0xf0 [316649.021850] __xfs_buf_submit+0x14a/0x230 [316649.023033] xfs_buf_delwri_submit_buffers+0x107/0x280 [316649.024511] xfs_buf_delwri_submit_nowait+0x10/0x20 [316649.025931] xfsaild+0x27e/0x9d0 [316649.028283] kthread+0xf6/0x120 [316649.030602] ret_from_fork+0x1f/0x30 in the situation where flushing gets preempted between the unpin check and the buffer trylock under nowait conditions: blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (xfs_buf_ispinned(bp)) { pinned++; continue; } Here >>>>>> if (!xfs_buf_trylock(bp)) continue; This means submission is stuck until something else triggers a log force to unpin the buffer. To get onto the delwri list to begin with, the buffer pin state has already been checked, and hence it's relatively rare we get a race between flushing and encountering a pinned buffer in delwri submission to begin with. Further, to increase the pin count the buffer has to be locked, so the only way we can hit this race without failing the trylock is to be preempted between the pincount check seeing zero and the trylock being run. Hence to avoid this problem, just invert the order of trylock vs pin check. We shouldn't hit that many pinned buffers here, so optimising away the trylock for pinned buffers should not matter for performance at all. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 18a0b0b96071..63f27d20cf45 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2125,12 +2125,13 @@ xfs_buf_delwri_submit_buffers( blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { + if (!xfs_buf_trylock(bp)) + continue; if (xfs_buf_ispinned(bp)) { + xfs_buf_unlock(bp); pinned++; continue; } - if (!xfs_buf_trylock(bp)) - continue; } else { xfs_buf_lock(bp); } From patchwork Mon Mar 7 05:32:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12771177 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3967C433FE for ; Mon, 7 Mar 2022 05:33:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235329AbiCGFdx (ORCPT ); Mon, 7 Mar 2022 00:33:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235330AbiCGFdw (ORCPT ); Mon, 7 Mar 2022 00:33:52 -0500 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9F6663BFBB for ; Sun, 6 Mar 2022 21:32:58 -0800 (PST) Received: from dread.disaster.area (pa49-186-17-0.pa.vic.optusnet.com.au [49.186.17.0]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id CEE5F10E1B03; Mon, 7 Mar 2022 16:32:57 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1nR5zM-002U7O-S4; Mon, 07 Mar 2022 16:32:56 +1100 Received: from dave by discord.disaster.area with local (Exim 4.95) (envelope-from ) id 1nR5zM-00Ada2-R5; Mon, 07 Mar 2022 16:32:56 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: willy@infradead.org Subject: [PATCH 3/3] xfs: xfs_ail_push_all_sync() stalls when racing with updates Date: Mon, 7 Mar 2022 16:32:52 +1100 Message-Id: <20220307053252.2534616-4-david@fromorbit.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20220307053252.2534616-1-david@fromorbit.com> References: <20220307053252.2534616-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=VuxAv86n c=1 sm=1 tr=0 ts=6225990a a=+dVDrTVfsjPpH/ci3UuFng==:117 a=+dVDrTVfsjPpH/ci3UuFng==:17 a=o8Y5sQTvuykA:10 a=20KFwNOVAAAA:8 a=EOO2phA2CyUEHaC7MKsA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner xfs_ail_push_all_sync() has a loop like this: while max_ail_lsn { prepare_to_wait(ail_empty) target = max_ail_lsn wake_up(ail_task); schedule() } Which is designed to sleep until the AIL is emptied. When xfs_ail_finish_update() moves the tail of the log, it does: if (list_empty(&ailp->ail_head)) wake_up_all(&ailp->ail_empty); So it will only wake up the sync push waiter when the AIL goes empty. If, by the time the push waiter has woken, the AIL has more in it, it will reset the target, wake the push task and go back to sleep. The problem here is that if the AIL is having items added to it when xfs_ail_push_all_sync() is called, then they may get inserted into the AIL at a LSN higher than the target LSN. At this point, xfsaild_push() will see that the target is X, the item LSNs are (X+N) and skip over them, hence never pushing the out. The result of this the AIL will not get emptied by the AIL push thread, hence xfs_ail_finish_update() will never see the AIL being empty even if it moves the tail. Hence xfs_ail_push_all_sync() never gets woken and hence cannot update the push target to capture the items beyond the current target on the LSN. This is a TOCTOU type of issue so the way to avoid it is to not use the push target at all for sync pushes. We know that a sync push is being requested by the fact the ail_empty wait queue is active, hence the xfsaild can just set the target to max_ail_lsn on every push that we see the wait queue active. Hence we no longer will leave items on the AIL that are beyond the LSN sampled at the start of a sync push. Signed-off-by: Dave Chinner --- fs/xfs/xfs_trans_ail.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 2a8c8dc54c95..1b52952097c1 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -448,10 +448,22 @@ xfsaild_push( spin_lock(&ailp->ail_lock); - /* barrier matches the ail_target update in xfs_ail_push() */ - smp_rmb(); - target = ailp->ail_target; - ailp->ail_target_prev = target; + /* + * If we have a sync push waiter, we always have to push till the AIL is + * empty. Update the target to point to the end of the AIL so that + * capture updates that occur after the sync push waiter has gone to + * sleep. + */ + if (waitqueue_active(&ailp->ail_empty)) { + lip = xfs_ail_max(ailp); + if (lip) + target = lip->li_lsn; + } else { + /* barrier matches the ail_target update in xfs_ail_push() */ + smp_rmb(); + target = ailp->ail_target; + ailp->ail_target_prev = target; + } /* we're done if the AIL is empty or our push has reached the end */ lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn); @@ -724,7 +736,6 @@ xfs_ail_push_all_sync( spin_lock(&ailp->ail_lock); while ((lip = xfs_ail_max(ailp)) != NULL) { prepare_to_wait(&ailp->ail_empty, &wait, TASK_UNINTERRUPTIBLE); - ailp->ail_target = lip->li_lsn; wake_up_process(ailp->ail_task); spin_unlock(&ailp->ail_lock); schedule(); From patchwork Mon Mar 7 23:18:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12772737 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64472C433EF for ; Mon, 7 Mar 2022 23:19:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235196AbiCGXTz (ORCPT ); Mon, 7 Mar 2022 18:19:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231462AbiCGXTz (ORCPT ); Mon, 7 Mar 2022 18:19:55 -0500 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 1449F31368 for ; Mon, 7 Mar 2022 15:19:00 -0800 (PST) Received: from dread.disaster.area (pa49-186-17-0.pa.vic.optusnet.com.au [49.186.17.0]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 15BB610E1CAD; Tue, 8 Mar 2022 10:18:58 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1nRMcz-002mA7-UQ; Tue, 08 Mar 2022 10:18:58 +1100 Date: Tue, 8 Mar 2022 10:18:57 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: willy@infradead.org Subject: [PATCH 4/3] xfs: async CIL flushes need pending pushes to be made stable Message-ID: <20220307231857.GR59715@dread.disaster.area> References: <20220307053252.2534616-1-david@fromorbit.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20220307053252.2534616-1-david@fromorbit.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=deDjYVbe c=1 sm=1 tr=0 ts=622692e3 a=+dVDrTVfsjPpH/ci3UuFng==:117 a=+dVDrTVfsjPpH/ci3UuFng==:17 a=kj9zAlcOel0A:10 a=o8Y5sQTvuykA:10 a=20KFwNOVAAAA:8 a=JfrnYn6hAAAA:8 a=Icm_k0NaIDjJNltb-DgA:9 a=CjuIK1q_8ugA:10 a=1CNFftbPRP8L7MoqJWF3:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When the AIL tries to flush the CIL, it relies on the CIL push ending up on stable storage without having to wait for and manipulate iclog state directly. However, if there is already a pending CIL push when the AIL tries to flush the CIL, it won't set the cil->xc_push_commit_stable flag and so the CIL push will not actively flush the commit record iclog. generic/530 when run on a single CPU test VM can trigger this fairly reliably. This test exercises unlinked inode recovery, and can result in inodes being pinned in memory by ongoing modifications to the inode cluster buffer to record unlinked list modifications. As a result, the first inode unlinked in a buffer can pin the tail of the log whilst the inode cluster buffer is pinned by the current checkpoint that has been pushed but isn't on stable storage because because the cil->xc_push_commit_stable was not set. This results in the log/AIL effectively deadlocking until something triggers the commit record iclog to be pushed to stable storage (i.e. the periodic log worker calling xfs_log_force()). The fix is simple - always set the cil->xc_push_commit_stable when xlog_cil_flush() is called, regardless of whether there is already a pending push or not. Reported-by: Matthew Wilcox Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing") Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 05cee602406c..9785ac0351fe 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1490,11 +1490,21 @@ xlog_cil_push_now( if (!async) flush_workqueue(cil->xc_push_wq); + spin_lock(&cil->xc_push_lock); + + /* + * If this is an async flush request, we always need to set the + * xc_push_commit_stable flag even if something else has already queued + * a push. The flush caller is asking for the CIL to be on stable + * storage when the next push completes, so regardless of who has queued + * the push, the flush requires stable semantics from it. + */ + cil->xc_push_commit_stable = async; + /* * If the CIL is empty or we've already pushed the sequence then - * there's no work we need to do. + * there's no more work that we need to do. */ - spin_lock(&cil->xc_push_lock); if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) || push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); @@ -1502,7 +1512,6 @@ xlog_cil_push_now( } cil->xc_push_seq = push_seq; - cil->xc_push_commit_stable = async; queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work); spin_unlock(&cil->xc_push_lock); }