[0/7] repair: Phase 6 performance improvements

Message ID	20210319013355.776008-1-david@fromorbit.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: hsiangkao@redhat.com Subject: [PATCH 0/7] repair: Phase 6 performance improvements Date: Fri, 19 Mar 2021 12:33:48 +1100 Message-Id: <20210319013355.776008-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	repair: Phase 6 performance improvements \| expand [0/7] repair: Phase 6 performance improvements [1/7] workqueue: bound maximum queue depth [2/7] repair: Protect bad inode list with mutex [3/7] repair: protect inode chunk tree records with a mutex [4/7] repair: parallelise phase 6 [5/7] repair: don't duplicate names in phase 6 [6/7] repair: convert the dir byaddr hash to a radix tree [7/7] repair: scale duplicate name checking in phase 6.

Dave Chinner March 19, 2021, 1:33 a.m. UTC

Hi folks,

This is largely a repost of my current code so that Xiang can take
over and finish it off. It applies against 5.11.0 and the
performance numbers are still valid. I can't remember how much of
the review comments I addressed from the first time I posted it, so
the changelog is poor....

Original description:

Phase 6 is single threaded, processing a single AG at a time and a
single directory inode at a time.  Phase 6 if often IO latency bound
despite the prefetching it does, resulting in low disk utilisation
and high runtimes. The solution for this is the same as phase 3 and
4 - scan multiple AGs at once for directory inodes to process. This
patch set enables phase 6 to scan multiple AGS at once, and hence
requires concurrent updates of inode records as tehy can be accessed
and modified by multiple scanning threads now. We also need to
protect the bad inodes list from concurrent access and then we can
enable concurrent processing of directories.

However, directory entry checking and reconstruction can also be CPU
bound - large directories overwhelm the directory name hash
structures because the algorithms have poor scalability - one is O(n
+ n^2), another is O(n^2) when the number of dirents greatly
outsizes the hash table sizes. Hence we need to more than just
parallelise across AGs - we need to parallelise processing within
AGs so that a single large directory doesn't completely serialise
processing within an AG.  This is done by using bound-depth
workqueues to allow inode records to be processed asynchronously as
the inode records are fetched from disk.

Further, we need to fix the bad alogrithmic scalability of the in
memory directory tracking structures. This is done through a
combination of better structures and more appropriate dynamic size
choices.

The results on a filesystem with a single 10 million entry directory
containing 400MB of directory entry data is as follows:

v5.6.0 (Baseline)

       XFS_REPAIR Summary    Thu Oct 22 12:10:52 2020

Phase           Start           End             Duration
Phase 1:        10/22 12:06:41  10/22 12:06:41
Phase 2:        10/22 12:06:41  10/22 12:06:41
Phase 3:        10/22 12:06:41  10/22 12:07:00  19 seconds
Phase 4:        10/22 12:07:00  10/22 12:07:12  12 seconds
Phase 5:        10/22 12:07:12  10/22 12:07:13  1 second
Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds
Phase 7:        10/22 12:10:51  10/22 12:10:51

Total run time: 4 minutes, 10 seconds

real	4m11.151s
user	4m20.083s
sys	0m14.744s


5.9.0-rc1 + patchset:

        XFS_REPAIR Summary    Thu Oct 22 13:19:02 2020

Phase           Start           End             Duration
Phase 1:        10/22 13:18:09  10/22 13:18:09
Phase 2:        10/22 13:18:09  10/22 13:18:09
Phase 3:        10/22 13:18:09  10/22 13:18:31  22 seconds
Phase 4:        10/22 13:18:31  10/22 13:18:45  14 seconds
Phase 5:        10/22 13:18:45  10/22 13:18:45
Phase 6:        10/22 13:18:45  10/22 13:19:00  15 seconds
Phase 7:        10/22 13:19:00  10/22 13:19:00

Total run time: 51 seconds

real	0m52.375s
user	1m3.739s
sys	0m20.346s


Performance improvements on filesystems with small directories and
really fast storage are, at best, modest. The big improvements are
seen with either really large directories and/or relatively slow
devices that are IO latency bound and can benefit from having more
IO in flight at once.

Cheers,

Dave.

Version 2
- use pthread_cond_broadcast() to wakeup throttled waiters in
  bounded workqueues.
- other changes I didn't record when I made them so I've forgotten
  about them.



Dave Chinner (7):
  workqueue: bound maximum queue depth
  repair: Protect bad inode list with mutex
  repair: protect inode chunk tree records with a mutex
  repair: parallelise phase 6
  repair: don't duplicate names in phase 6
  repair: convert the dir byaddr hash to a radix tree
  repair: scale duplicate name checking in phase 6.

 libfrog/radix-tree.c |  46 +++++
 libfrog/workqueue.c  |  42 ++++-
 libfrog/workqueue.h  |   4 +
 repair/dir2.c        |  32 ++--
 repair/incore.h      |  23 +++
 repair/incore_ino.c  |  15 ++
 repair/phase6.c      | 396 +++++++++++++++++++++----------------------
 7 files changed, 338 insertions(+), 220 deletions(-)

Gao Xiang March 19, 2021, 1:38 a.m. UTC | #1

On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> This is largely a repost of my current code so that Xiang can take
> over and finish it off. It applies against 5.11.0 and the
> performance numbers are still valid. I can't remember how much of
> the review comments I addressed from the first time I posted it, so
> the changelog is poor....

Yeah, I will catch what's missing (now looking the previous review),
and follow up then...

Thanks,
Gao Xiang

Darrick J. Wong March 19, 2021, 6:22 p.m. UTC | #2

On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is largely a repost of my current code so that Xiang can take
> > over and finish it off. It applies against 5.11.0 and the
> > performance numbers are still valid. I can't remember how much of
> > the review comments I addressed from the first time I posted it, so
> > the changelog is poor....
> 
> Yeah, I will catch what's missing (now looking the previous review),
> and follow up then...

:)

While you're revising the patches, you might as well convert:

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

into:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

because Exchange is so awful for inline replies that I don't use that
email address anymore.

--D

> Thanks,
> Gao Xiang
>

Gao Xiang March 20, 2021, 2:09 a.m. UTC | #3

On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > This is largely a repost of my current code so that Xiang can take
> > > over and finish it off. It applies against 5.11.0 and the
> > > performance numbers are still valid. I can't remember how much of
> > > the review comments I addressed from the first time I posted it, so
> > > the changelog is poor....
> > 
> > Yeah, I will catch what's missing (now looking the previous review),
> > and follow up then...
> 
> :)
> 
> While you're revising the patches, you might as well convert:
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> into:
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> because Exchange is so awful for inline replies that I don't use that
> email address anymore.

Yeah, I'm just starting sorting out all previous opinions
and patches diff. Will update in the next version.

Thanks,
Gao Xiang

> 
> --D
> 
> > Thanks,
> > Gao Xiang
> > 
>

Gao Xiang March 24, 2021, 1:26 a.m. UTC | #4

Hi Dave and Darrick,

On Sat, Mar 20, 2021 at 10:09:31AM +0800, Gao Xiang wrote:
> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > This is largely a repost of my current code so that Xiang can take
> > > > over and finish it off. It applies against 5.11.0 and the
> > > > performance numbers are still valid. I can't remember how much of
> > > > the review comments I addressed from the first time I posted it, so
> > > > the changelog is poor....
> > > 
> > > Yeah, I will catch what's missing (now looking the previous review),
> > > and follow up then...
> > 
> > :)
> > 
> > While you're revising the patches, you might as well convert:
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > into:
> > 
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > because Exchange is so awful for inline replies that I don't use that
> > email address anymore.
> 
> Yeah, I'm just starting sorting out all previous opinions
> and patches diff. Will update in the next version.
> 

Sorry for bothering... After reading the previous discussion for a while,
I'm fine with the trivial cleanups. Yet, it seems that there are mainly 2
remaining open discussions unsolved yet...

1 is magic number 1000,
https://lore.kernel.org/r/20201029172045.GP1061252@magnolia

while I also don't have better ideas of this (and have no idea why queue
depth 1000 is optimal compared with other configurations), so it'd be better
to get your thoughts about this in advance (e.g. just leave it as-is, or...
plus, I don't have such test setting with such many cpus)

2 is the hash size modificiation,
https://lore.kernel.org/r/20201029162922.GM1061252@magnolia/

it seems previously hash entires are limited to 64k, and this patch relaxes
such limitation, but for huge directories I'm not sure the hash table
utilization but from the previous commit message it seems the extra memory
usage can be ignored.

Anyway, I'm fine with just leave them as-is if agreed on these.

Thanks,
Gao Xiang

> Thanks,
> Gao Xiang
> 
> > 
> > --D
> > 
> > > Thanks,
> > > Gao Xiang
> > > 
> >

Darrick J. Wong March 24, 2021, 2:08 a.m. UTC | #5

On Wed, Mar 24, 2021 at 09:26:55AM +0800, Gao Xiang wrote:
> Hi Dave and Darrick,
> 
> On Sat, Mar 20, 2021 at 10:09:31AM +0800, Gao Xiang wrote:
> > On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > > > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > This is largely a repost of my current code so that Xiang can take
> > > > > over and finish it off. It applies against 5.11.0 and the
> > > > > performance numbers are still valid. I can't remember how much of
> > > > > the review comments I addressed from the first time I posted it, so
> > > > > the changelog is poor....
> > > > 
> > > > Yeah, I will catch what's missing (now looking the previous review),
> > > > and follow up then...
> > > 
> > > :)
> > > 
> > > While you're revising the patches, you might as well convert:
> > > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > into:
> > > 
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > because Exchange is so awful for inline replies that I don't use that
> > > email address anymore.
> > 
> > Yeah, I'm just starting sorting out all previous opinions
> > and patches diff. Will update in the next version.
> > 
> 
> Sorry for bothering... After reading the previous discussion for a while,
> I'm fine with the trivial cleanups. Yet, it seems that there are mainly 2
> remaining open discussions unsolved yet...
> 
> 1 is magic number 1000,
> https://lore.kernel.org/r/20201029172045.GP1061252@magnolia
> 
> while I also don't have better ideas of this (and have no idea why queue
> depth 1000 is optimal compared with other configurations), so it'd be better
> to get your thoughts about this in advance (e.g. just leave it as-is, or...
> plus, I don't have such test setting with such many cpus)
> 
> 2 is the hash size modificiation,
> https://lore.kernel.org/r/20201029162922.GM1061252@magnolia/
> 
> it seems previously hash entires are limited to 64k, and this patch relaxes
> such limitation, but for huge directories I'm not sure the hash table
> utilization but from the previous commit message it seems the extra memory
> usage can be ignored.
> 
> Anyway, I'm fine with just leave them as-is if agreed on these.

FWIW I didn't have any specific objections to either magic number, I
simply wanted to know where they came from and why. :)

--D

> Thanks,
> Gao Xiang
> 
> > Thanks,
> > Gao Xiang
> > 
> > > 
> > > --D
> > > 
> > > > Thanks,
> > > > Gao Xiang
> > > > 
> > > 
>

[0/7] repair: Phase 6 performance improvements

Message

Comments