Reviewing determinism of xfs_reclaim_inodes_ag()
diff mbox

Message ID 20170425082503.GN28800@wotan.suse.de
State New
Headers show

Commit Message

Luis Chamberlain April 25, 2017, 8:25 a.m. UTC
I've been staring at xfs_reclaim_inodes_ag() for a bit now to determine
if and how its deterministic it is. Fortunately the code makes it relatively
clear it chugs on 32 xfs_inodes at at a time before cond_resched()'ing. I was
concerned originally if we cond_resched() even if we goto restart but it seems
we do.

While reviewing this however I ran into the following question based on a
comment on the routine. Is such a thing as the below change needed ?


This is what the code looks like before but pay special attention to the
comment and a race with RCU:

                                if (done || xfs_reclaim_inode_grab(ip, flags))                                                                                                  
                                        batch[i] = NULL;                                                                                                                        
                                                                                                                                                                                
                                /*                                                                                                                                              
                                 * Update the index for the next lookup.                                                                                                        
                                 * Catch overflows into the next AG range                                                                                                       
                                 * which can occur if we have inodes in the                                                                                                     
                                 * last block of the AG and we are                                                                                                              
                                 * currently pointing to the last inode.                                                                                                        
                                 *                                                                                                                                              
                                 * Because we may see inodes that are from                                                                                                      
                                 * the wrong AG due to RCU freeing and                                                                                                          
                                 * reallocation, only update the index if                                                                                                       
                                 * it lies in this AG. It was a race that                                                                                                       
                                 * lead us to see this inode, so another                                                                                                        
                                 * lookup from                                                                                                                                  
                                 * the same index will not find it again.                                                                                                       
                                 */                                                                                                                                             
                                if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=                                                                                                           
                                                                pag->pag_agno)                                                                                                  
                                        continue;                                                                                                                               
                                first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);                                                                                              
                                if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))                                                                                              
                                        done = 1;  

This leads me to believe that without the above change we might be doing an
xfs_reclaim_inode_grab() and subsequent reclaim for an AG which does not match
the pag given xfs_reclaim_inode_grab() does not seem to check the pag matches.

a) Its unclear if the xfs_reclaim_inode() will find the inode after the
first xfs_reclaim_inode_grab() as its a race.
                                                                                                                                                                                
b) Its unclear what the implications of trying xfs_reclaim_inode_grab() followed
by xfs_reclaim_inode() without holding the correct pag mutex could be.

I checked with Jan Kara and he believes the current code is correct but that
its the comment that that may be misleading. As per Jan the race is between
getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
However it doesn't actually *reuse* the inode until RCU period passes
(unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
that inode we have (ip) is actually already reclaimed by the time we get to
xfs_reclaim_inode_grab() and that function detects this (as I_RECLAIM is                                                                                                        
set). Also ip->i_ino will be set to 0 in that case so we must really avoid
using that to update the 'first_index' and 'done'. But it seems to Jan we
take the 'continue' branch only if xfs_reclaim_inode_grab() returned 1
before.
                                                                                                                                                                                
If Jan is correct the comment that inode may be reallocated seems to
be wrong and pretty misleading and the change I suggest above would not be
needed.

Thoughts?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner April 26, 2017, 12:04 a.m. UTC | #1
[Luis, FYI - lots of stray whitespace in this email - I've trimmed
it from my reply...]

On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> I've been staring at xfs_reclaim_inodes_ag() for a bit now to determine
> if and how its deterministic it is. Fortunately the code makes it relatively
> clear it chugs on 32 xfs_inodes at at a time before cond_resched()'ing. I was
> concerned originally if we cond_resched() even if we goto restart but it seems
> we do.
> 
> While reviewing this however I ran into the following question based on a
> comment on the routine. Is such a thing as the below change needed ?
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 3531f8f72fa5..05f3c79b4f11 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1157,7 +1157,9 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				struct xfs_inode *ip = batch[i];
>  
> -				if (done || xfs_reclaim_inode_grab(ip, flags))
> +				if (done ||
> +				    XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno ||
> +				    xfs_reclaim_inode_grab(ip, flags))
>  					batch[i] = NULL;

If a new check is required here, it belongs in
xfs_reclaim_inode_grab(), not as separate logic in the if statement
here as it needs to be serialised with other checks the grab does.

As it is, the grab code first checks for a zero ip->ino, which will
catch all cases where the inode is still in the RCU grace period and
a lookup has raced. If the inode is not under reclaim (i.e the inode
number is non-zero) then the grab code checks and sets XFS_IRECLAIM
under a spinlock, thereby prevent all further inode cache
lookup+grab operations from succeeding until reclaim completes and
the RCU grace period expires and frees the inode. i.e. if the
XFS_IRECLAIM flag is already set, then the inode is skipped
regardless of the inode number because it's already on it's way to
being freed by RCU.

Hence, IMO, adding this check here doesn't change lookup/reclaim
processing correctness.

> This is what the code looks like before but pay special attention to the
> comment and a race with RCU:
> 
>                                 if (done || xfs_reclaim_inode_grab(ip, flags))
>                                         batch[i] = NULL;
>
>                                 /*
>                                  * Update the index for the next lookup.
>                                  * Catch overflows into the next AG range
>                                  * which can occur if we have inodes in the
>                                  * last block of the AG and we are
>                                  * currently pointing to the last inode.
>                                  *
>                                  * Because we may see inodes that are from
>                                  * the wrong AG due to RCU freeing and
>                                  * reallocation, only update the index if
>                                  * it lies in this AG. It was a race that
>                                  * lead us to see this inode, so another
>                                  * lookup from
>                                  * the same index will not find it again.
>                                  */
>                                 if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
>                                                                 pag->pag_agno)
>                                         continue;

It's stale code left over from development of the RCU freeing of
inodes where .....

> I checked with Jan Kara and he believes the current code is correct but that
> its the comment that that may be misleading. As per Jan the race is between
> getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> However it doesn't actually *reuse* the inode until RCU period passes
> (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen

..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
RCU grace period did not prevent reallocation of inodes that had
been freed. Hence this check was (once) necessary to prevent the
reclaim index going whacky on a reallocated inode.

However, I dropped the SLAB_DESTROY_BY_RCU behaviour because I
couldn't get it to work reliably and race free, and there was no
measurable improvement in performance just using call_rcu() t delay
the freeing of individual inodes to the end of the RCU grace period.

In hindsight, I forgot to remove the second comment hunk and the
check that was eventually committed as 1a3e8f3da09c ("xfs: convert
inode cache lookups to use RCU locking"). It also appears the commit
message was not updated, too. It says:

    To avoid the read vs write contention, change the cache to use
    RCU locking on the read side. To avoid needing to RCU free every
    single inode, use the built in slab RCU freeing mechanism. This
    requires us to be able to detect lookups of freed inodes, [...]

So, really, the comment+check you're confused about can simply be
removed....

Cheers,

Dave.
Jan Kara April 26, 2017, 7:34 a.m. UTC | #2
On Wed 26-04-17 10:04:26, Dave Chinner wrote:
> On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > I've been staring at xfs_reclaim_inodes_ag() for a bit now to determine
> > if and how its deterministic it is. Fortunately the code makes it relatively
> > clear it chugs on 32 xfs_inodes at at a time before cond_resched()'ing. I was
> > concerned originally if we cond_resched() even if we goto restart but it seems
> > we do.
> > 
> > While reviewing this however I ran into the following question based on a
> > comment on the routine. Is such a thing as the below change needed ?
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 3531f8f72fa5..05f3c79b4f11 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1157,7 +1157,9 @@ xfs_reclaim_inodes_ag(
> >  			for (i = 0; i < nr_found; i++) {
> >  				struct xfs_inode *ip = batch[i];
> >  
> > -				if (done || xfs_reclaim_inode_grab(ip, flags))
> > +				if (done ||
> > +				    XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno ||
> > +				    xfs_reclaim_inode_grab(ip, flags))
> >  					batch[i] = NULL;
> 
> If a new check is required here, it belongs in
> xfs_reclaim_inode_grab(), not as separate logic in the if statement
> here as it needs to be serialised with other checks the grab does.
> 
> As it is, the grab code first checks for a zero ip->ino, which will
> catch all cases where the inode is still in the RCU grace period and
> a lookup has raced. If the inode is not under reclaim (i.e the inode
> number is non-zero) then the grab code checks and sets XFS_IRECLAIM
> under a spinlock, thereby prevent all further inode cache
> lookup+grab operations from succeeding until reclaim completes and
> the RCU grace period expires and frees the inode. i.e. if the
> XFS_IRECLAIM flag is already set, then the inode is skipped
> regardless of the inode number because it's already on it's way to
> being freed by RCU.
> 
> Hence, IMO, adding this check here doesn't change lookup/reclaim
> processing correctness.

Agreed, the code looks correct as is to me.

> > This is what the code looks like before but pay special attention to the
> > comment and a race with RCU:
> > 
> >                                 if (done || xfs_reclaim_inode_grab(ip, flags))
> >                                         batch[i] = NULL;
> >
> >                                 /*
> >                                  * Update the index for the next lookup.
> >                                  * Catch overflows into the next AG range
> >                                  * which can occur if we have inodes in the
> >                                  * last block of the AG and we are
> >                                  * currently pointing to the last inode.
> >                                  *
> >                                  * Because we may see inodes that are from
> >                                  * the wrong AG due to RCU freeing and
> >                                  * reallocation, only update the index if
> >                                  * it lies in this AG. It was a race that
> >                                  * lead us to see this inode, so another
> >                                  * lookup from
> >                                  * the same index will not find it again.
> >                                  */
> >                                 if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
> >                                                                 pag->pag_agno)
> >                                         continue;
> 
> It's stale code left over from development of the RCU freeing of
> inodes where .....
> 
> > I checked with Jan Kara and he believes the current code is correct but that
> > its the comment that that may be misleading. As per Jan the race is between
> > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > However it doesn't actually *reuse* the inode until RCU period passes
> > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> 
> ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> RCU grace period did not prevent reallocation of inodes that had
> been freed. Hence this check was (once) necessary to prevent the
> reclaim index going whacky on a reallocated inode.
> 
> However, I dropped the SLAB_DESTROY_BY_RCU behaviour because I
> couldn't get it to work reliably and race free, and there was no
> measurable improvement in performance just using call_rcu() t delay
> the freeing of individual inodes to the end of the RCU grace period.

Yeah, so SLAB_DESTROY_BY_RCU is beneficial only if you have a workload
where getting CPU cache-hot objects on allocation improves performance
measurably. I remember quite a few years ago Nick Piggin had a workload
where this was advantageous for inodes (like opening & closing tons of
files so that inodes where flowing in and out of cache, I don't remember
the details anymore) but it was mostly a corner case.

> In hindsight, I forgot to remove the second comment hunk and the
> check that was eventually committed as 1a3e8f3da09c ("xfs: convert
> inode cache lookups to use RCU locking"). It also appears the commit
> message was not updated, too. It says:
> 
>     To avoid the read vs write contention, change the cache to use
>     RCU locking on the read side. To avoid needing to RCU free every
>     single inode, use the built in slab RCU freeing mechanism. This
>     requires us to be able to detect lookups of freed inodes, [...]
> 
> So, really, the comment+check you're confused about can simply be
> removed....

Thanks for confirmation and reference! Looking at that commit there seem to
be more comments there speaking about reallocation which probably need
updating - e.g. the one in xfs_reclaim_inode_grab(). Luis, will you take
care of that?

								Honza
Luis Chamberlain April 26, 2017, 9:12 a.m. UTC | #3
On Wed, Apr 26, 2017 at 10:04:26AM +1000, Dave Chinner wrote:
> On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > I checked with Jan Kara and he believes the current code is correct but that
> > its the comment that that may be misleading. As per Jan the race is between
> > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > However it doesn't actually *reuse* the inode until RCU period passes
> > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> 
> ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> RCU grace period did not prevent reallocation of inodes that had
> been freed. Hence this check was (once) necessary to prevent the
> reclaim index going whacky on a reallocated inode.

Alright this helps, but why does *having* the RCU grace period prevent
such type of race ? I can see it helping but removing completely such
a race as a possibility ? Also, just so I understand I am following, this
then implicates our reclaim rate is directly linked to the RCU grace
period ?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 26, 2017, 10:55 a.m. UTC | #4
On Wed 26-04-17 11:12:06, Luis R. Rodriguez wrote:
> On Wed, Apr 26, 2017 at 10:04:26AM +1000, Dave Chinner wrote:
> > On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > > I checked with Jan Kara and he believes the current code is correct but that
> > > its the comment that that may be misleading. As per Jan the race is between
> > > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > > However it doesn't actually *reuse* the inode until RCU period passes
> > > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> > 
> > ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> > RCU grace period did not prevent reallocation of inodes that had
> > been freed. Hence this check was (once) necessary to prevent the
> > reclaim index going whacky on a reallocated inode.
> 
> Alright this helps, but why does *having* the RCU grace period prevent
> such type of race ? I can see it helping but removing completely such
> a race as a possibility ?

Well, if the inode is freed only after RCU period expires and we are doing
xfs_reclaim_inode_grab() under rcu_read_lock - which we are - then this
surely prevents us from seeing inode reallocated. What are you missing?

> Also, just so I understand I am following, this then implicates our
> reclaim rate is directly linked to the RCU grace period ?

Yes, as for any RCU-freed object...

								Honza
Luis Chamberlain May 6, 2017, 5:41 p.m. UTC | #5
On Wed, Apr 26, 2017 at 12:55:56PM +0200, Jan Kara wrote:
> On Wed 26-04-17 11:12:06, Luis R. Rodriguez wrote:
> > On Wed, Apr 26, 2017 at 10:04:26AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > > > I checked with Jan Kara and he believes the current code is correct but that
> > > > its the comment that that may be misleading. As per Jan the race is between
> > > > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > > > However it doesn't actually *reuse* the inode until RCU period passes
> > > > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> > > 
> > > ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> > > RCU grace period did not prevent reallocation of inodes that had
> > > been freed. Hence this check was (once) necessary to prevent the
> > > reclaim index going whacky on a reallocated inode.
> > 
> > Alright this helps, but why does *having* the RCU grace period prevent
> > such type of race ? I can see it helping but removing completely such
> > a race as a possibility ?
> 
> Well, if the inode is freed only after RCU period expires and we are doing
> xfs_reclaim_inode_grab() under rcu_read_lock - which we are - then this
> surely prevents us from seeing inode reallocated. What are you missing?

Right, OK fair, its just simple RCU by definition.

> > Also, just so I understand I am following, this then implicates our
> > reclaim rate is directly linked to the RCU grace period ?
> 
> Yes, as for any RCU-freed object...

Right.. I see, this is also by definition.

But also by definition the RCU grace period should be long that "any readers
accessing the item being deleted have since dropped their references".  What
are the implications if during xfs reclaim this is not true *often* ? Not sure
what types of situations could implicate this, perhaps a full rsync without
first suspending work and heavy IO ? Lets call these contended xfs inodes.
Could in theory we not reach:

∑ contended xfs inodes > free xfs inodes

If this situation is dire, what counter measures are / should be in place for
it ? If this is all expected and gravy then I suspect there is no issue and
the non-determinism of the above is fair game.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain May 6, 2017, 5:52 p.m. UTC | #6
On Sat, May 06, 2017 at 07:41:10PM +0200, Luis R. Rodriguez wrote:
> On Wed, Apr 26, 2017 at 12:55:56PM +0200, Jan Kara wrote:
> > On Wed 26-04-17 11:12:06, Luis R. Rodriguez wrote:
> > > On Wed, Apr 26, 2017 at 10:04:26AM +1000, Dave Chinner wrote:
> > > > On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > > > > I checked with Jan Kara and he believes the current code is correct but that
> > > > > its the comment that that may be misleading. As per Jan the race is between
> > > > > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > > > > However it doesn't actually *reuse* the inode until RCU period passes
> > > > > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> > > > 
> > > > ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> > > > RCU grace period did not prevent reallocation of inodes that had
> > > > been freed. Hence this check was (once) necessary to prevent the
> > > > reclaim index going whacky on a reallocated inode.
> > > 
> > > Alright this helps, but why does *having* the RCU grace period prevent
> > > such type of race ? I can see it helping but removing completely such
> > > a race as a possibility ?
> > 
> > Well, if the inode is freed only after RCU period expires and we are doing
> > xfs_reclaim_inode_grab() under rcu_read_lock - which we are - then this
> > surely prevents us from seeing inode reallocated. What are you missing?
> 
> Right, OK fair, its just simple RCU by definition.
> 
> > > Also, just so I understand I am following, this then implicates our
> > > reclaim rate is directly linked to the RCU grace period ?
> > 
> > Yes, as for any RCU-freed object...
> 
> Right.. I see, this is also by definition.
> 
> But also by definition the RCU grace period should be long that "any readers
> accessing the item being deleted have since dropped their references".  What
> are the implications if during xfs reclaim this is not true *often* ? Not sure
> what types of situations could implicate this, perhaps a full rsync without
> first suspending work and heavy IO ? Lets call these contended xfs inodes.
> Could in theory we not reach:
> 
> ∑ contended xfs inodes > free xfs inodes
> 
> If this situation is dire, what counter measures are / should be in place for
> it ? If this is all expected and gravy then I suspect there is no issue and
> the non-determinism of the above is fair game.

Lets also recall that:

====
	Just as with spinlocks, RCU readers are not permitted to
	block, switch to user-mode execution, or enter the idle loop.
	Therefore, as soon as a CPU is seen passing through any of these
	three states, we know that that CPU has exited any previous RCU
	read-side critical sections.  So, if we remove an item from a
	linked list, and then wait until all CPUs have switched context,
	executed in user mode, or executed in the idle loop, we can
	safely free up that item.
====

So any "contended xfs inodes" should also be really busying out the CPU,
and if we only have X CPUs, well that gives us an upper limit before
we busy the hell out ?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 9, 2017, 10:41 a.m. UTC | #7
On Sat 06-05-17 19:52:12, Luis R. Rodriguez wrote:
> On Sat, May 06, 2017 at 07:41:10PM +0200, Luis R. Rodriguez wrote:
> > On Wed, Apr 26, 2017 at 12:55:56PM +0200, Jan Kara wrote:
> > > On Wed 26-04-17 11:12:06, Luis R. Rodriguez wrote:
> > > > On Wed, Apr 26, 2017 at 10:04:26AM +1000, Dave Chinner wrote:
> > > > > On Tue, Apr 25, 2017 at 10:25:03AM +0200, Luis R. Rodriguez wrote:
> > > > > > I checked with Jan Kara and he believes the current code is correct but that
> > > > > > its the comment that that may be misleading. As per Jan the race is between
> > > > > > getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
> > > > > > However it doesn't actually *reuse* the inode until RCU period passes
> > > > > > (unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
> > > > > 
> > > > > ..... I initially tried using SLAB_DESTROY_BY_RCU which meant the
> > > > > RCU grace period did not prevent reallocation of inodes that had
> > > > > been freed. Hence this check was (once) necessary to prevent the
> > > > > reclaim index going whacky on a reallocated inode.
> > > > 
> > > > Alright this helps, but why does *having* the RCU grace period prevent
> > > > such type of race ? I can see it helping but removing completely such
> > > > a race as a possibility ?
> > > 
> > > Well, if the inode is freed only after RCU period expires and we are doing
> > > xfs_reclaim_inode_grab() under rcu_read_lock - which we are - then this
> > > surely prevents us from seeing inode reallocated. What are you missing?
> > 
> > Right, OK fair, its just simple RCU by definition.
> > 
> > > > Also, just so I understand I am following, this then implicates our
> > > > reclaim rate is directly linked to the RCU grace period ?
> > > 
> > > Yes, as for any RCU-freed object...
> > 
> > Right.. I see, this is also by definition.
> > 
> > But also by definition the RCU grace period should be long that "any readers
> > accessing the item being deleted have since dropped their references".  What
> > are the implications if during xfs reclaim this is not true *often* ? Not sure
> > what types of situations could implicate this, perhaps a full rsync without
> > first suspending work and heavy IO ? Lets call these contended xfs inodes.
> > Could in theory we not reach:
> > 
> > ∑ contended xfs inodes > free xfs inodes
> > 
> > If this situation is dire, what counter measures are / should be in place for
> > it ? If this is all expected and gravy then I suspect there is no issue and
> > the non-determinism of the above is fair game.
> 
> Lets also recall that:
> 
> ====
> 	Just as with spinlocks, RCU readers are not permitted to
> 	block, switch to user-mode execution, or enter the idle loop.
> 	Therefore, as soon as a CPU is seen passing through any of these
> 	three states, we know that that CPU has exited any previous RCU
> 	read-side critical sections.  So, if we remove an item from a
> 	linked list, and then wait until all CPUs have switched context,
> 	executed in user mode, or executed in the idle loop, we can
> 	safely free up that item.
> ====
> 
> So any "contended xfs inodes" should also be really busying out the CPU,
> and if we only have X CPUs, well that gives us an upper limit before
> we busy the hell out ?

Well, the RCU grace period is a system global thing - all rcu_read_lock()
users in the kernel block the grace period from finishing. You can read
more about RCU in Documentation/RCU/ or on LWN. Anyway since holders of
rcu_read_lock() are not allowed to sleep the expected length of the grace
period is in milliseconds at most. So inodes freed by xfs_inode_free() will
be released to the slab cache with that delay.

								Honza

Patch
diff mbox

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3531f8f72fa5..05f3c79b4f11 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1157,7 +1157,9 @@  xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done ||
+				    XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno ||
+				    xfs_reclaim_inode_grab(ip, flags))
 					batch[i] = NULL;
 
 				/*