[V2] Stop searching for free slots in an inode chunk when there are none
diff mbox

Message ID 20170807101710.18734-1-cmaiolino@redhat.com
State New
Headers show

Commit Message

Carlos Maiolino Aug. 7, 2017, 10:17 a.m. UTC
In a filesystem without finobt, the Space manager selects an AG to alloc a new
inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.

When the new inode is in the samge AG as its parent, the btree will be searched
starting on the parent's record, and then retried from the top if no slot is
available beyond the parent's record.

To exit this loop though, xfs_dialloc_ag_inobt(), relies on the fact that the
btree must have a free slot available, once its callers relied on the
agi->freecount when deciding how/where to allocate this new inode.

In the case when the agi->freecount is corrupted, showing available inodes in an
AG, when in fact there is none, this becomes an infinite loop.

Add a way to stop the loop when a free slot is not found in the btree, making
the function to fall into the whole AG scan which will then, be able to detect
the corruption and shut the filesystem down.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---

V2: Use searchdistance variable to stop the infinite loop instead of adding a
    new mechanism.

 fs/xfs/libxfs/xfs_ialloc.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

Comments

Brian Foster Aug. 7, 2017, 3:19 p.m. UTC | #1
On Mon, Aug 07, 2017 at 12:17:10PM +0200, Carlos Maiolino wrote:
> In a filesystem without finobt, the Space manager selects an AG to alloc a new
> inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.
> 
> When the new inode is in the samge AG as its parent, the btree will be searched
> starting on the parent's record, and then retried from the top if no slot is
> available beyond the parent's record.
> 
> To exit this loop though, xfs_dialloc_ag_inobt(), relies on the fact that the
> btree must have a free slot available, once its callers relied on the
> agi->freecount when deciding how/where to allocate this new inode.
> 
> In the case when the agi->freecount is corrupted, showing available inodes in an
> AG, when in fact there is none, this becomes an infinite loop.
> 
> Add a way to stop the loop when a free slot is not found in the btree, making
> the function to fall into the whole AG scan which will then, be able to detect
> the corruption and shut the filesystem down.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> 
> V2: Use searchdistance variable to stop the infinite loop instead of adding a
>     new mechanism.
> 
>  fs/xfs/libxfs/xfs_ialloc.c | 34 ++++++++++++++++++----------------
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index d41ade5d293e..8e535290390d 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1123,6 +1123,7 @@ xfs_dialloc_ag_inobt(
>  	int			error;
>  	int			offset;
>  	int			i, j;
> +	int			searchdistance = 10;
>  
>  	pag = xfs_perag_get(mp, agno);
>  
> @@ -1149,7 +1150,6 @@ xfs_dialloc_ag_inobt(
>  	if (pagno == agno) {
>  		int		doneleft;	/* done, to the left */
>  		int		doneright;	/* done, to the right */
> -		int		searchdistance = 10;
>  
>  		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
>  		if (error)
> @@ -1210,10 +1210,10 @@ xfs_dialloc_ag_inobt(
>  		/*
>  		 * Loop until we find an inode chunk with a free inode.
>  		 */
> -		while (!doneleft || !doneright) {
> +		while (--searchdistance > 0 && (!doneleft || !doneright)) {
>  			int	useleft;  /* using left inode chunk this time */
>  
> -			if (!--searchdistance) {
> +			if (searchdistance <= 0) {

If the loop terminates when searchdistance <= 0, when would we ever hit
this?

>  				/*
>  				 * Not in range - save last search
>  				 * location and allocate a new inode
> @@ -1268,19 +1268,21 @@ xfs_dialloc_ag_inobt(
>  				goto error1;
>  		}
>  
> -		/*
> -		 * We've reached the end of the btree. because
> -		 * we are only searching a small chunk of the
> -		 * btree each search, there is obviously free
> -		 * inodes closer to the parent inode than we
> -		 * are now. restart the search again.
> -		 */
> -		pag->pagl_pagino = NULLAGINO;
> -		pag->pagl_leftrec = NULLAGINO;
> -		pag->pagl_rightrec = NULLAGINO;
> -		xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
> -		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> -		goto restart_pagno;
> +		if (searchdistance > 0) {
> +			/*
> +			 * We've reached the end of the btree. because
> +			 * we are only searching a small chunk of the
> +			 * btree each search, there is obviously free
> +			 * inodes closer to the parent inode than we
> +			 * are now. restart the search again.
> +			 */
> +			pag->pagl_pagino = NULLAGINO;
> +			pag->pagl_leftrec = NULLAGINO;
> +			pag->pagl_rightrec = NULLAGINO;
> +			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
> +			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +			goto restart_pagno;
> +		}

Hmm.. so if we have a tree with just a handful of records (say 3, for
example), we'd end up running through the tree searchdistance times
before correctly identifying the state and moving on (to the full
search). We avoid the livelock, but fwiw, that still seems like
buggy/brittle code to me. That's just my .02. If we want to use this
approach, we should at least add a comment somewhere that explains
how/why we handle this situation as such (i.e., that we require/rely on
searchdistance to deplete in the worst case of a corruption).

Also note that this has potential for performance side effects in the
common (non-corruption) case. That may or may not ultimately be a
problem, but the fallback to the optimized search is the brute force AG
scan. Given that, I think that case at least has to be considered and
noted in the commit log.

It looks to me that it shouldn't be a major problem because it only
affects the situation where the cached search "wraps" to the outside of
the tree, and that probably doesn't happen often with a search distance
of 10 records and a large tree. I am a bit curious where the
searchdistance of 10 comes from though (we fit many more records in a
single inobt leaf block)..?

Brian

>  	}
>  
>  	/*
> -- 
> 2.13.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 10, 2017, 11:56 a.m. UTC | #2
On Mon, Aug 07, 2017 at 11:19:56AM -0400, Brian Foster wrote:
> On Mon, Aug 07, 2017 at 12:17:10PM +0200, Carlos Maiolino wrote:
> Also note that this has potential for performance side effects in the
> common (non-corruption) case.

I'm not seeing an issue here once the code is corrected. Probably
just me being dumb again, but could you point it out for me, Brian?

> It looks to me that it shouldn't be a major problem because it only
> affects the situation where the cached search "wraps" to the outside of
> the tree, and that probably doesn't happen often with a search distance
> of 10 records and a large tree. I am a bit curious where the
> searchdistance of 10 comes from though (we fit many more records in a
> single inobt leaf block)..?

It was chosen based on CPU profiles and performance measurement to
limit the CPU usage of the problem case the finobt now solves. i.e.
finding the frees inode in a tree that indexes several million
allocated inodes and the free inodes are few and far between. It was
chosen to cap inode allocation performance degradation when free
inodes were extremely sparse at around 50% of the "lots of free
inodes that are easy to find" performance.

Cheers,

Dave.
Brian Foster Aug. 10, 2017, 12:28 p.m. UTC | #3
On Thu, Aug 10, 2017 at 09:56:18PM +1000, Dave Chinner wrote:
> On Mon, Aug 07, 2017 at 11:19:56AM -0400, Brian Foster wrote:
> > On Mon, Aug 07, 2017 at 12:17:10PM +0200, Carlos Maiolino wrote:
> > Also note that this has potential for performance side effects in the
> > common (non-corruption) case.
> 
> I'm not seeing an issue here once the code is corrected. Probably
> just me being dumb again, but could you point it out for me, Brian?
> 

Oh, I'm just pointing out that this version tweaks the normal runtime
algorithm (by further limiting the record search window in certain
cases) whereas the previous version did not and I didn't see any mention
as to why that is safe. The first sentence below explains why I think
the change has minimal performance impact, if any, and is probably fine.
I'm basically just asking that if we fix this by tweaking the
optimization algorithm, we add a brief justification for why this does
not impact normal runtime performance in the commit log (if my
understanding is correct, feel free to steal the text below).

> > It looks to me that it shouldn't be a major problem because it only
> > affects the situation where the cached search "wraps" to the outside of
> > the tree, and that probably doesn't happen often with a search distance
> > of 10 records and a large tree. I am a bit curious where the
> > searchdistance of 10 comes from though (we fit many more records in a
> > single inobt leaf block)..?
> 
> It was chosen based on CPU profiles and performance measurement to
> limit the CPU usage of the problem case the finobt now solves. i.e.
> finding the frees inode in a tree that indexes several million
> allocated inodes and the free inodes are few and far between. It was
> chosen to cap inode allocation performance degradation when free
> inodes were extremely sparse at around 50% of the "lots of free
> inodes that are easy to find" performance.
> 

Ah, Ok. So it was more of a CPU oriented optimization than an I/O one.
That makes sense, thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index d41ade5d293e..8e535290390d 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1123,6 +1123,7 @@  xfs_dialloc_ag_inobt(
 	int			error;
 	int			offset;
 	int			i, j;
+	int			searchdistance = 10;
 
 	pag = xfs_perag_get(mp, agno);
 
@@ -1149,7 +1150,6 @@  xfs_dialloc_ag_inobt(
 	if (pagno == agno) {
 		int		doneleft;	/* done, to the left */
 		int		doneright;	/* done, to the right */
-		int		searchdistance = 10;
 
 		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
 		if (error)
@@ -1210,10 +1210,10 @@  xfs_dialloc_ag_inobt(
 		/*
 		 * Loop until we find an inode chunk with a free inode.
 		 */
-		while (!doneleft || !doneright) {
+		while (--searchdistance > 0 && (!doneleft || !doneright)) {
 			int	useleft;  /* using left inode chunk this time */
 
-			if (!--searchdistance) {
+			if (searchdistance <= 0) {
 				/*
 				 * Not in range - save last search
 				 * location and allocate a new inode
@@ -1268,19 +1268,21 @@  xfs_dialloc_ag_inobt(
 				goto error1;
 		}
 
-		/*
-		 * We've reached the end of the btree. because
-		 * we are only searching a small chunk of the
-		 * btree each search, there is obviously free
-		 * inodes closer to the parent inode than we
-		 * are now. restart the search again.
-		 */
-		pag->pagl_pagino = NULLAGINO;
-		pag->pagl_leftrec = NULLAGINO;
-		pag->pagl_rightrec = NULLAGINO;
-		xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
-		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
-		goto restart_pagno;
+		if (searchdistance > 0) {
+			/*
+			 * We've reached the end of the btree. because
+			 * we are only searching a small chunk of the
+			 * btree each search, there is obviously free
+			 * inodes closer to the parent inode than we
+			 * are now. restart the search again.
+			 */
+			pag->pagl_pagino = NULLAGINO;
+			pag->pagl_leftrec = NULLAGINO;
+			pag->pagl_rightrec = NULLAGINO;
+			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			goto restart_pagno;
+		}
 	}
 
 	/*