From patchwork Tue Oct 30 11:20:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 10660681 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C51EE14BD for ; Tue, 30 Oct 2018 11:20:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B28232A1AE for ; Tue, 30 Oct 2018 11:20:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B0C802A1B4; Tue, 30 Oct 2018 11:20:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2A2A42A125 for ; Tue, 30 Oct 2018 11:20:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726792AbeJ3UNy (ORCPT ); Tue, 30 Oct 2018 16:13:54 -0400 Received: from ipmail03.adl2.internode.on.net ([150.101.137.141]:63819 "EHLO ipmail03.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727151AbeJ3UNx (ORCPT ); Tue, 30 Oct 2018 16:13:53 -0400 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail03.adl2.internode.on.net with ESMTP; 30 Oct 2018 21:50:47 +1030 Received: from discord.disaster.area ([192.168.1.111]) by dastard with esmtp (Exim 4.80) (envelope-from ) id 1gHS4l-0005c0-2X for linux-xfs@vger.kernel.org; Tue, 30 Oct 2018 22:20:47 +1100 Received: from dave by discord.disaster.area with local (Exim 4.91) (envelope-from ) id 1gHS4l-0001jp-0r for linux-xfs@vger.kernel.org; Tue, 30 Oct 2018 22:20:47 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 7/7] repair: parallelise phase 6 Date: Tue, 30 Oct 2018 22:20:43 +1100 Message-Id: <20181030112043.6034-8-david@fromorbit.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181030112043.6034-1-david@fromorbit.com> References: <20181030112043.6034-1-david@fromorbit.com> MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Chinner A recent metadump provided to us caused repair to take hours in phase6. It wasn't IO bound - it was fully CPU bound the entire time. The only way to speed it up is to make phase 6 run multiple concurrent processing threads. The obvious way to do this is to spread the concurrency across AGs, like the other phases, and while this works it is not optimal. When a processing thread hits a really large directory, it essentially sits CPU bound until that directory is processed. IF an AG has lots of large directories, we end up with a really long single threaded tail that limits concurrency. Hence we also need to have concurrency /within/ the AG. This is realtively easy, as the inode chunk records allow for a simple concurrency mechanism within an AG. We can simply feed each chunk record to a workqueue, and we get concurrency within the AG for free. However, this allows prefetch to run way ahead of processing and this blows out the buffer cache size and can cause OOM. However, we can use the new workqueue depth limiting to limit the number of inode chunks queued, and this then backs up the inode prefetching to it's maximum queue depth. Hence we prevent having the prefetch code queue the entire AG's inode chunks on the workqueue blowing out memory by throttling the prefetch consumer. This takes phase 6 from taking many, many hours down to: Phase 6: 10/30 21:12:58 10/30 21:40:48 27 minutes, 50 seconds And burning 20-30 cpus that entire time on my test rig. Signed-off-by: Dave Chinner --- repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 35 insertions(+), 8 deletions(-) diff --git a/repair/phase6.c b/repair/phase6.c index 9477bc2527f3..422a53bca6c7 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -6,6 +6,7 @@ #include "libxfs.h" #include "threads.h" +#include "threads.h" #include "prefetch.h" #include "avl.h" #include "globals.h" @@ -3169,20 +3170,45 @@ check_for_orphaned_inodes( } static void -traverse_function( +do_dir_inode( struct workqueue *wq, - xfs_agnumber_t agno, + xfs_agnumber_t agno, void *arg) { - ino_tree_node_t *irec; + struct ino_tree_node *irec = arg; int i; + + for (i = 0; i < XFS_INODES_PER_CHUNK; i++) { + if (inode_isadir(irec, i)) + process_dir_inode(wq->wq_ctx, agno, irec, i); + } +} + +static void +traverse_function( + struct workqueue *wq, + xfs_agnumber_t agno, + void *arg) +{ + struct ino_tree_node *irec; prefetch_args_t *pf_args = arg; + struct workqueue lwq; + struct xfs_mount *mp = wq->wq_ctx; + wait_for_inode_prefetch(pf_args); if (verbose) do_log(_(" - agno = %d\n"), agno); + /* + * The more AGs we have in flight at once, the fewer processing threads + * per AG. This means we don't overwhelm the machine with hundreds of + * threads when we start acting on lots of AGs at once. We just want + * enough that we can keep multiple CPUs busy across multiple AGs. + */ + workqueue_create_bound(&lwq, mp, ag_stride, 1000); + for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) { if (irec->ino_isa_dir == 0) continue; @@ -3190,18 +3216,19 @@ traverse_function( if (pf_args) { sem_post(&pf_args->ra_count); #ifdef XR_PF_TRACE + { + int i; sem_getvalue(&pf_args->ra_count, &i); pftrace( "processing inode chunk %p in AG %d (sem count = %d)", irec, agno, i); + } #endif } - for (i = 0; i < XFS_INODES_PER_CHUNK; i++) { - if (inode_isadir(irec, i)) - process_dir_inode(wq->wq_ctx, agno, irec, i); - } + queue_work(&lwq, do_dir_inode, agno, irec); } + destroy_work_queue(&lwq); cleanup_inode_prefetch(pf_args); } @@ -3229,7 +3256,7 @@ static void traverse_ags( struct xfs_mount *mp) { - do_inode_prefetch(mp, 0, traverse_function, false, true); + do_inode_prefetch(mp, ag_stride, traverse_function, false, true); } void