diff mbox

Stop searching for free slots in an inode chunk when there are none

Message ID 20170803151915.7861-1-cmaiolino@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Carlos Maiolino Aug. 3, 2017, 3:19 p.m. UTC
In a filesystem without finobt, the Space manager selects an AG to alloc a new
inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.

When the new inode is in the samge AG as its parent, the btree will be
searched starting on the parent's record, and then retried from the top
if no slot is available beyond the parent's record.

To exit this loop though, xfs_dialloc_ag_inobt(), relies on the fact that the
btree must have a free slot available, once its callers relied on the
agi->freecount when deciding how/where to allocate this new inode.

In the case when the agi->freecount is corrupted, showing available
inodes in an AG, when in fact there is none, this becomes an infinite
loop.

Add a way to stop the loop when a free slot is not found in the btree,
making the function to fall into the whole AG scan which will then, be
able to detect the corruption and shut the filesystem down.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---

I have a xfstest to catch this agi->freecount corruption case, I'll send it to
the list soon.

 fs/xfs/libxfs/xfs_ialloc.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

Comments

Dave Chinner Aug. 3, 2017, 10:35 p.m. UTC | #1
On Thu, Aug 03, 2017 at 05:19:15PM +0200, Carlos Maiolino wrote:
> In a filesystem without finobt, the Space manager selects an AG to alloc a new
> inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.
> 
> When the new inode is in the samge AG as its parent, the btree will be
> searched starting on the parent's record, and then retried from the top
> if no slot is available beyond the parent's record.
> 
> To exit this loop though, xfs_dialloc_ag_inobt(), relies on the fact that the
> btree must have a free slot available, once its callers relied on the
> agi->freecount when deciding how/where to allocate this new inode.
> 
> In the case when the agi->freecount is corrupted, showing available
> inodes in an AG, when in fact there is none, this becomes an infinite
> loop.
> 
> Add a way to stop the loop when a free slot is not found in the btree,
> making the function to fall into the whole AG scan which will then, be
> able to detect the corruption and shut the filesystem down.

That doesn't sound quite right. The initial scan and the restart
loop are both limited to scanning search_distance records - we never
search the entire tree except when it's really small (i..e less than
10-20 records (640-1280 inodes) depending on balance). If the
pagino record to end of btree distance in both directions is shorter
than the search distance for a given loop (i.e. less than 10 records
from pagino to end-of-btree) then that is the only time a corrupted
agi->freecount can cause this problem.

IOWs, on production systems where there's more than a few hundred
inodes (i.e. the vast majority of installations) a corrupted
agi->freecount won't lead to a endless loop because search_distance
will terminate the retry loop and we'll allocate a new inode.

To tell the truth, I'd much rather we just use the search distance
to prevent endless looping than add a second method of limiting
the search loop. i.e. don't reset search_distance when we restart
the search loop at pagino.  That means even for small trees (<
search_distance * 2 records) we'll retry when we get to the end of
tree, but we'll still break out of the loop and allocate new inodes
as soon as we hit the search distance limit.

Cheers,

Dave.
Carlos Eduardo Maiolino Aug. 4, 2017, 8:55 a.m. UTC | #2
Hi Dave.


> > Add a way to stop the loop when a free slot is not found in the btree,
> > making the function to fall into the whole AG scan which will then, be
> > able to detect the corruption and shut the filesystem down.
> 
> That doesn't sound quite right. The initial scan and the restart
> loop are both limited to scanning search_distance records - we never
> search the entire tree except when it's really small (i..e less than
> 10-20 records (640-1280 inodes) depending on balance). If the
> pagino record to end of btree distance in both directions is shorter
> than the search distance for a given loop (i.e. less than 10 records
> from pagino to end-of-btree) then that is the only time a corrupted
> agi->freecount can cause this problem.
> 

I agree with you, but still, we are feasible to have this corruption happening,
and I've seen reports of users hitting it.


> IOWs, on production systems where there's more than a few hundred
> inodes (i.e. the vast majority of installations) a corrupted
> agi->freecount won't lead to a endless loop because search_distance
> will terminate the retry loop and we'll allocate a new inode.
> 
> To tell the truth, I'd much rather we just use the search distance
> to prevent endless looping than add a second method of limiting
> the search loop. i.e. don't reset search_distance when we restart
> the search loop at pagino.  That means even for small trees (<
> search_distance * 2 records) we'll retry when we get to the end of
> tree, but we'll still break out of the loop and allocate new inodes
> as soon as we hit the search distance limit.
> 

Sounds reasonable, I'll try that and send a V2,

Thank you!!
Carlos Eduardo Maiolino Aug. 4, 2017, 9:36 a.m. UTC | #3
One more thing.



----- Original Message -----
> From: "Carlos Eduardo Maiolino" <cmaiolin@redhat.com>
> To: "Dave Chinner" <david@fromorbit.com>
> Cc: linux-xfs@vger.kernel.org
> Sent: Friday, August 4, 2017 10:55:26 AM
> Subject: Re: [PATCH] Stop searching for free slots in an inode chunk when there are none
> 
> Hi Dave.
> 
> 
> > > Add a way to stop the loop when a free slot is not found in the btree,
> > > making the function to fall into the whole AG scan which will then, be
> > > able to detect the corruption and shut the filesystem down.
> > 
> > That doesn't sound quite right. The initial scan and the restart
> > loop are both limited to scanning search_distance records - we never
> > search the entire tree except when it's really small (i..e less than
> > 10-20 records (640-1280 inodes) depending on balance). If the
> > pagino record to end of btree distance in both directions is shorter
> > than the search distance for a given loop (i.e. less than 10 records
> > from pagino to end-of-btree) then that is the only time a corrupted
> > agi->freecount can cause this problem.
> > 
> 
> I agree with you, but still, we are feasible to have this corruption
> happening,
> and I've seen reports of users hitting it.
> 
> 
> > IOWs, on production systems where there's more than a few hundred
> > inodes (i.e. the vast majority of installations) a corrupted
> > agi->freecount won't lead to a endless loop because search_distance
> > will terminate the retry loop and we'll allocate a new inode.
> > 
> > To tell the truth, I'd much rather we just use the search distance
> > to prevent endless looping than add a second method of limiting
> > the search loop. i.e. don't reset search_distance when we restart
> > the search loop at pagino.  That means even for small trees (<
> > search_distance * 2 records) we'll retry when we get to the end of
> > tree, but we'll still break out of the loop and allocate new inodes
> > as soon as we hit the search distance limit.
> > 
> 

Here, you are assuming we enter into the 

while (!doneleft || !doneright) { }

on every interaction, so it will be able to decrease the searchdistance or you
mean by moving the --searchdistance somewhere else?

In very small trees we don't even enter the while loop (both doneleft and doneright are 1),
so searchdistance isn't decremented at all, resetting it or not will not make any difference
in this case.


My first thought about fixing this was to check both doneleft and doneright, with something like:

if ((pagino == NULLAGINO) && doneleft && doneright))
      /* nothing more to search, break the loop */

But talking with Brian on irc yesterday, we agreed it doesn't sound quite right.

Also, if you want to see the xfstests I'm using for it:

https://raw.githubusercontent.com/cmaiolino/xfstests-dev/872e98ce156292ae2ab69ee55812e22c58556efe/tests/xfs/057

I didn't send it to the list yet because I want to add some better comments and '-d' flag to
the xfs_io as Darrick suggested, but it's on a good state to trigger the bug 100% of the times.


> Sounds reasonable, I'll try that and send a V2,
> 
> Thank you!!
> 
> --
> --Carlos
>
Dave Chinner Aug. 4, 2017, 11:17 p.m. UTC | #4
On Fri, Aug 04, 2017 at 05:36:01AM -0400, Carlos Eduardo Maiolino wrote:
> One more thing.
> 
> 
> 
> ----- Original Message -----
> > From: "Carlos Eduardo Maiolino" <cmaiolin@redhat.com>
> > To: "Dave Chinner" <david@fromorbit.com>
> > Cc: linux-xfs@vger.kernel.org
> > Sent: Friday, August 4, 2017 10:55:26 AM
> > Subject: Re: [PATCH] Stop searching for free slots in an inode chunk when there are none
> > 
> > Hi Dave.
> > 
> > 
> > > > Add a way to stop the loop when a free slot is not found in the btree,
> > > > making the function to fall into the whole AG scan which will then, be
> > > > able to detect the corruption and shut the filesystem down.
> > > 
> > > That doesn't sound quite right. The initial scan and the restart
> > > loop are both limited to scanning search_distance records - we never
> > > search the entire tree except when it's really small (i..e less than
> > > 10-20 records (640-1280 inodes) depending on balance). If the
> > > pagino record to end of btree distance in both directions is shorter
> > > than the search distance for a given loop (i.e. less than 10 records
> > > from pagino to end-of-btree) then that is the only time a corrupted
> > > agi->freecount can cause this problem.
> > > 
> > 
> > I agree with you, but still, we are feasible to have this corruption
> > happening,
> > and I've seen reports of users hitting it.
> > 
> > 
> > > IOWs, on production systems where there's more than a few hundred
> > > inodes (i.e. the vast majority of installations) a corrupted
> > > agi->freecount won't lead to a endless loop because search_distance
> > > will terminate the retry loop and we'll allocate a new inode.
> > > 
> > > To tell the truth, I'd much rather we just use the search distance
> > > to prevent endless looping than add a second method of limiting
> > > the search loop. i.e. don't reset search_distance when we restart
> > > the search loop at pagino.  That means even for small trees (<
> > > search_distance * 2 records) we'll retry when we get to the end of
> > > tree, but we'll still break out of the loop and allocate new inodes
> > > as soon as we hit the search distance limit.
> > > 
> > 
> 
> Here, you are assuming we enter into the 
> 
> while (!doneleft || !doneright) { }
> 
> on every interaction, so it will be able to decrease the searchdistance or you
> mean by moving the --searchdistance somewhere else?
> 
> In very small trees we don't even enter the while loop (both doneleft and doneright are 1),
> so searchdistance isn't decremented at all, resetting it or not will not make any difference
> in this case.

Seems like a minor issue - the first search step left+right is
outside the while loop, and we don't account for that. So change
where the search distance check to take that into account:

	while (--searchdistance > 0 && (!doneleft || !doneright)) {
		.....
	}

	if (searchdistance <= 0) {
		/* save current chunk indexes */
		....
		goto newino;
	}

	/* restart at pagino */
	.....
	goto restart_pagno;


-Dave.
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index d41ade5d293e..8ebe0d89bdc5 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1123,6 +1123,7 @@  xfs_dialloc_ag_inobt(
 	int			error;
 	int			offset;
 	int			i, j;
+	int			retry = true; /* Search tree from the top */
 
 	pag = xfs_perag_get(mp, agno);
 
@@ -1205,6 +1206,12 @@  xfs_dialloc_ag_inobt(
 			error = xfs_ialloc_next_rec(cur, &rec, &doneright, 0);
 			if (error)
 				goto error1;
+
+			/*
+			 * We've already scanned the whole btree, no need to
+			 * retry the search.
+			 */
+			retry = false;
 		}
 
 		/*
@@ -1268,19 +1275,23 @@  xfs_dialloc_ag_inobt(
 				goto error1;
 		}
 
-		/*
-		 * We've reached the end of the btree. because
-		 * we are only searching a small chunk of the
-		 * btree each search, there is obviously free
-		 * inodes closer to the parent inode than we
-		 * are now. restart the search again.
-		 */
-		pag->pagl_pagino = NULLAGINO;
-		pag->pagl_leftrec = NULLAGINO;
-		pag->pagl_rightrec = NULLAGINO;
-		xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
-		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
-		goto restart_pagno;
+		if (retry) {
+			/*
+			 * We've reached the end of the btree. because
+			 * we are only searching a small chunk of the
+			 * btree each search, there must be free
+			 * inodes (unless something is corrupted)
+			 * closer to the parent inode than we
+			 * are now. restart the search again.
+			 */
+			pag->pagl_pagino = NULLAGINO;
+			pag->pagl_leftrec = NULLAGINO;
+			pag->pagl_rightrec = NULLAGINO;
+			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+			goto restart_pagno;
+		}
 	}
 
 	/*