[v3,8/8] repair: scale duplicate name checking in phase 6.

Message ID	20210330142531.19809-9-hsiangkao@aol.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Gao Xiang <hsiangkao@aol.com> To: linux-xfs@vger.kernel.org Cc: Dave Chinner <dchinner@redhat.com>, Gao Xiang <hsiangkao@redhat.com> Subject: [PATCH v3 8/8] repair: scale duplicate name checking in phase 6. Date: Tue, 30 Mar 2021 22:25:31 +0800 Message-Id: <20210330142531.19809-9-hsiangkao@aol.com> In-Reply-To: <20210330142531.19809-1-hsiangkao@aol.com> References: <20210330142531.19809-1-hsiangkao@aol.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	repair: Phase 6 performance improvements \| expand [v3,0/8] repair: Phase 6 performance improvements [v3,1/8] repair: turn bad inode list into array [v3,2/8] workqueue: bound maximum queue depth [v3,3/8] repair: Protect bad inode list with mutex [v3,4/8] repair: protect inode chunk tree records with a mutex [v3,5/8] repair: parallelise phase 6 [v3,6/8] repair: don't duplicate names in phase 6 [v3,7/8] repair: convert the dir byaddr hash to a radix tree [v3,8/8] repair: scale duplicate name checking in phase 6.

Message ID

20210330142531.19809-9-hsiangkao@aol.com (mailing list archive)

State

Superseded

Headers

From: Gao Xiang <hsiangkao@aol.com>
To: linux-xfs@vger.kernel.org
Cc: Dave Chinner <dchinner@redhat.com>,
        Gao Xiang <hsiangkao@redhat.com>
Subject: [PATCH v3 8/8] repair: scale duplicate name checking in phase 6.
Date: Tue, 30 Mar 2021 22:25:31 +0800
Message-Id: <20210330142531.19809-9-hsiangkao@aol.com>
In-Reply-To: <20210330142531.19809-1-hsiangkao@aol.com>
References: <20210330142531.19809-1-hsiangkao@aol.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

repair: Phase 6 performance improvements | expand

Commit Message

Gao Xiang March 30, 2021, 2:25 p.m. UTC

From: Dave Chinner <dchinner@redhat.com>

phase 6 on large directories is cpu bound on duplicate name checking
due to the algorithm having effectively O(n^2) scalability. Hence
when the duplicate name hash table  size is far smaller than the
number of directory entries, we end up with long hash chains that
are searched linearly on every new entry that is found in the
directory to do duplicate detection.

The in-memory hash table size is limited to 64k entries. Hence when
we have millions of entries in a directory, duplicate entry lookups
on the hash table have substantial overhead. Scale this table out to
larger sizes so that we keep the chain lengths short and hence the
O(n^2) scalability impact is limited because N is always small.

For a 10M entry directory consuming 400MB of directory data, the
hash table now sizes at 6.4 million entries instead of ~64k - it is
~100x larger. While the hash table now consumes ~50MB of RAM, the
xfs_repair footprint barely changes as it's using already consuming
~9GB of RAM at this point in time. IOWs, the incremental memory
usage change is noise, but the directory checking time:

Unpatched:

  97.11%  xfs_repair          [.] dir_hash_add
   0.38%  xfs_repair          [.] longform_dir2_entry_check_data
   0.34%  libc-2.31.so        [.] __libc_calloc
   0.32%  xfs_repair          [.] avl_ino_start

Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds

Patched:

  46.74%  xfs_repair          [.] radix_tree_lookup
  32.13%  xfs_repair          [.] dir_hash_see_all
   7.70%  xfs_repair          [.] radix_tree_tag_get
   3.92%  xfs_repair          [.] dir_hash_add
   3.52%  xfs_repair          [.] radix_tree_tag_clear
   2.43%  xfs_repair          [.] crc32c_le

Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds

has been reduced by an order of magnitude.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 repair/phase6.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

Comments

Darrick J. Wong March 30, 2021, 6:31 p.m. UTC | #1

On Tue, Mar 30, 2021 at 10:25:31PM +0800, Gao Xiang wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> phase 6 on large directories is cpu bound on duplicate name checking
> due to the algorithm having effectively O(n^2) scalability. Hence
> when the duplicate name hash table  size is far smaller than the
> number of directory entries, we end up with long hash chains that
> are searched linearly on every new entry that is found in the
> directory to do duplicate detection.
> 
> The in-memory hash table size is limited to 64k entries. Hence when
> we have millions of entries in a directory, duplicate entry lookups
> on the hash table have substantial overhead. Scale this table out to
> larger sizes so that we keep the chain lengths short and hence the
> O(n^2) scalability impact is limited because N is always small.
> 
> For a 10M entry directory consuming 400MB of directory data, the
> hash table now sizes at 6.4 million entries instead of ~64k - it is
> ~100x larger. While the hash table now consumes ~50MB of RAM, the
> xfs_repair footprint barely changes as it's using already consuming
> ~9GB of RAM at this point in time. IOWs, the incremental memory
> usage change is noise, but the directory checking time:
> 
> Unpatched:
> 
>   97.11%  xfs_repair          [.] dir_hash_add
>    0.38%  xfs_repair          [.] longform_dir2_entry_check_data
>    0.34%  libc-2.31.so        [.] __libc_calloc
>    0.32%  xfs_repair          [.] avl_ino_start
> 
> Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds
> 
> Patched:
> 
>   46.74%  xfs_repair          [.] radix_tree_lookup
>   32.13%  xfs_repair          [.] dir_hash_see_all
>    7.70%  xfs_repair          [.] radix_tree_tag_get
>    3.92%  xfs_repair          [.] dir_hash_add
>    3.52%  xfs_repair          [.] radix_tree_tag_clear
>    2.43%  xfs_repair          [.] crc32c_le
> 
> Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds
> 
> has been reduced by an order of magnitude.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  repair/phase6.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 063329636500..aa991bf76da6 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -288,19 +288,37 @@ dir_hash_done(
>  	free(hashtab);
>  }
>  
> +/*
> + * Create a directory hash index structure based on the size of the directory we
> + * are about to try to repair. The size passed in is the size of the data
> + * segment of the directory in bytes, so we don't really know exactly how many
> + * entries are in it. Hence assume an entry size of around 64 bytes - that's a
> + * name length of 40+ bytes so should cover a most situations with large
> + * really directories.

"...with really large directories."

> + */
>  static struct dir_hash_tab *
>  dir_hash_init(
>  	xfs_fsize_t		size)
>  {
> -	struct dir_hash_tab	*hashtab;
> +	struct dir_hash_tab	*hashtab = NULL;
>  	int			hsize;
>  
> -	hsize = size / (16 * 4);
> -	if (hsize > 65536)
> -		hsize = 63336;
> -	else if (hsize < 16)
> +	hsize = size / 64;
> +	if (hsize < 16)
>  		hsize = 16;
> -	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
> +
> +	/*
> +	 * Try to allocate as large a hash table as possible. Failure to
> +	 * allocate isn't fatal, it will just result in slower performance as we
> +	 * reduce the size of the table.
> +	 */
> +	while (hsize >= 16) {
> +		hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1);
> +		if (hashtab)
> +			break;
> +		hsize /= 2;
> +	}

I still kinda wonder if the hash size ought to be some prime number,
though I suppose /finding/ a prime number isn't necessarily easy.

Hm, let's assume you had the maximal 32GB directory.

size = 34359738368
hsize = size/64 = 536870912
DIR_HASH_TAB_SIZE(hsize) =
	(sizeof(dir_hash_tab_t) + (sizeof(dir_hash_ent_t *) * (hsize) * 2))

== 40 + (8 * 536870912 * 2) == 8589934632

That's ... kind of big?  I guess the saving grace is that we halve hsize
if calloc fails and the address space isn't populated with memory pages
until we actually use them.  It might turn out that users want a lower
cap than 8GB per 32GB directory, but I suppose we won't find /that/ out
until we try...

With the comment fixed,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +	if (!hashtab)
>  		do_error(_("calloc failed in dir_hash_init\n"));
>  	hashtab->size = hsize;
>  	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
> -- 
> 2.20.1
>

diff --git a/repair/phase6.c b/repair/phase6.c
index 063329636500..aa991bf76da6 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -288,19 +288,37 @@  dir_hash_done(
 	free(hashtab);
 }
 
+/*
+ * Create a directory hash index structure based on the size of the directory we
+ * are about to try to repair. The size passed in is the size of the data
+ * segment of the directory in bytes, so we don't really know exactly how many
+ * entries are in it. Hence assume an entry size of around 64 bytes - that's a
+ * name length of 40+ bytes so should cover a most situations with large
+ * really directories.
+ */
 static struct dir_hash_tab *
 dir_hash_init(
 	xfs_fsize_t		size)
 {
-	struct dir_hash_tab	*hashtab;
+	struct dir_hash_tab	*hashtab = NULL;
 	int			hsize;
 
-	hsize = size / (16 * 4);
-	if (hsize > 65536)
-		hsize = 63336;
-	else if (hsize < 16)
+	hsize = size / 64;
+	if (hsize < 16)
 		hsize = 16;
-	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
+
+	/*
+	 * Try to allocate as large a hash table as possible. Failure to
+	 * allocate isn't fatal, it will just result in slower performance as we
+	 * reduce the size of the table.
+	 */
+	while (hsize >= 16) {
+		hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1);
+		if (hashtab)
+			break;
+		hsize /= 2;
+	}
+	if (!hashtab)
 		do_error(_("calloc failed in dir_hash_init\n"));
 	hashtab->size = hsize;
 	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +