diff mbox series

[RFC,1/4] xfs: support deactivating AGs

Message ID 20210414195240.1802221-2-hsiangkao@redhat.com (mailing list archive)
State Deferred, archived
Headers show
Series xfs: support shrinking empty AGs | expand

Commit Message

Gao Xiang April 14, 2021, 7:52 p.m. UTC
To get rid of paralleled requests related to AGs which are pending
for shrinking, mark these perags as inactive rather than playing
with per-ag structures theirselves.

Since in that way, a per-ag lock can be used to stablize the inactive
status together with agi/agf buffer lock (which is much easier than
adding more complicated perag_{get, put} pairs..) Also, Such per-ags
can be released / reused when unmountfs / growfs.

On the read side, pag_inactive_rwsem can be unlocked immediately after
the agf or agi buffer lock is acquired. However, pag_inactive_rwsem
can only be unlocked after the agf/agi buffer locks are all acquired
with the inactive status on the write side.

XXX: maybe there are some missing cases.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c     | 16 +++++++++++++---
 fs/xfs/libxfs/xfs_alloc.c  | 12 +++++++++++-
 fs/xfs/libxfs/xfs_ialloc.c | 26 +++++++++++++++++++++++++-
 fs/xfs/xfs_mount.c         |  2 ++
 fs/xfs/xfs_mount.h         |  6 ++++++
 5 files changed, 57 insertions(+), 5 deletions(-)

Comments

Dave Chinner April 15, 2021, 3:42 a.m. UTC | #1
On Thu, Apr 15, 2021 at 03:52:37AM +0800, Gao Xiang wrote:
> To get rid of paralleled requests related to AGs which are pending
> for shrinking, mark these perags as inactive rather than playing
> with per-ag structures theirselves.
> 
> Since in that way, a per-ag lock can be used to stablize the inactive
> status together with agi/agf buffer lock (which is much easier than
> adding more complicated perag_{get, put} pairs..) Also, Such per-ags
> can be released / reused when unmountfs / growfs.
> 
> On the read side, pag_inactive_rwsem can be unlocked immediately after
> the agf or agi buffer lock is acquired. However, pag_inactive_rwsem
> can only be unlocked after the agf/agi buffer locks are all acquired
> with the inactive status on the write side.
> 
> XXX: maybe there are some missing cases.
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag.c     | 16 +++++++++++++---
>  fs/xfs/libxfs/xfs_alloc.c  | 12 +++++++++++-
>  fs/xfs/libxfs/xfs_ialloc.c | 26 +++++++++++++++++++++++++-
>  fs/xfs/xfs_mount.c         |  2 ++
>  fs/xfs/xfs_mount.h         |  6 ++++++
>  5 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index c68a36688474..ba5702e5c9ad 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -676,16 +676,24 @@ xfs_ag_get_geometry(
>  	if (agno >= mp->m_sb.sb_agcount)
>  		return -EINVAL;
>  
> +	pag = xfs_perag_get(mp, agno);
> +	down_read(&pag->pag_inactive_rwsem);

No need to encode the lock type in the lock name. We know it's a
rwsem from the lock API functions...

> +
> +	if (pag->pag_inactive) {
> +		error = -EBUSY;
> +		up_read(&pag->pag_inactive_rwsem);
> +		goto out;
> +	}

This looks kinda heavyweight. Having to take a rwsem whenever we do
a perag lookup to determine if we can access the perag completely
defeats the purpose of xfs_perag_get() being a lightweight, lockless
operation.

I suspect what we really want here is active/passive references like
are used for the superblock, and an API that hides the
implementation from all the callers.


> +
>  	/* Lock the AG headers. */
>  	error = xfs_ialloc_read_agi(mp, NULL, agno, &agi_bp);
> +	up_read(&pag->pag_inactive_rwsem);
>  	if (error)
> -		return error;
> +		goto out;
>  	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agf_bp);
>  	if (error)
>  		goto out_agi;

This seems to imply that we don't need to hold the inactive rwsem
to read the AGF. Either we need the semaphore held in read mode to
protect the inactive state for the entire operation we are
performing, or you haven't documented how the locking is supposed to
guarantee existence once the inactive_rwsem has been dropped.

Instead, lets' look at this as an active vs passive reference issue.
Currently everything takes passive references because we don't
enforce the reference count for anything - it's largely to tell us
that everything that took a reference also released it by the time
we tear down the perag.

So say that buffers take passive references because we need buffers
to be used while we tear down the AG, and that everything else that
does perag_get()...perag_put() in a loop or single logic context is
an active reference, we can actually use active references to
prevent a shrink from stepping into the perag while something else
is actively referencing the perag.

Hence we convert the sites in this patch to use
xfs_perag_get_active() ...  xfs_perag_put_active() pairs, and
everything else gets hidden away inside those functions.

That allows us to implement active references as an atomic counter
that is manipulated under RCU context to provide existence
guarantees for the structure.  i.e. something like this:

/*
 * Get an active reference to the perag for a given agno. If the AG
 * is in the process of being torn down, the active reference count
 * will be zero and this lookup will fail and return NULL.
 */
static inline struct xfs_perag *
xfs_perag_get_active(
	struct xfs_mount	*mp,
	xfs_agnumber_t		agno)
{
	struct xfs_perag        *pag;

	rcu_read_lock();
	pag = radix_tree_lookup(&mp->m_perag_tree, agno);
	if (pag) {
		if (!atomic_inc_not_zero(&pag->pag_active_ref))
			pag = NULL;
	}
	rcu_read_unlock();
	trace_xfs_perag_get(mp, agno, ref, _RET_IP_);
	return pag;
}

static inline void
xfs_perag_put_active(
	struct xfs_perag	*pag)
{
	atomic_dec(&pag->pag_active_ref);
}

For active references to work as an access barrier,
xfs_perag_initialise() needs to first take an active reference for
the mount when it is allocated. This makes the initial value
non-zero, so active references can be taken at any time after
initialisation.

When shrink needs to remove the AG, it first drops the mount's
active reference to the AG. This allows the active reference count
to fall to zero. It then waits for it to hit zero, and once it does
all future active reference lookups will fail to find this perag.
Hence shrink now can start the process of tearing down the AG
structures knowing that nobody is going to be using the AG for
per-ag based work...

This has no more overhead than the a passive lookup, and does not
require a rwsem or a separate inactive state flag to detect or
switch between active and inactive states. It's also trivially
reversable - the mount just has to take an active reference once the
perag has been re-initialised.

Cheers,

Dave.
Gao Xiang April 15, 2021, 4:28 a.m. UTC | #2
Hi Dave,

On Thu, Apr 15, 2021 at 01:42:55PM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2021 at 03:52:37AM +0800, Gao Xiang wrote:
> > To get rid of paralleled requests related to AGs which are pending
> > for shrinking, mark these perags as inactive rather than playing
> > with per-ag structures theirselves.
> > 
> > Since in that way, a per-ag lock can be used to stablize the inactive
> > status together with agi/agf buffer lock (which is much easier than
> > adding more complicated perag_{get, put} pairs..) Also, Such per-ags
> > can be released / reused when unmountfs / growfs.
> > 
> > On the read side, pag_inactive_rwsem can be unlocked immediately after
> > the agf or agi buffer lock is acquired. However, pag_inactive_rwsem
> > can only be unlocked after the agf/agi buffer locks are all acquired
> > with the inactive status on the write side.
> > 
> > XXX: maybe there are some missing cases.
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ag.c     | 16 +++++++++++++---
> >  fs/xfs/libxfs/xfs_alloc.c  | 12 +++++++++++-
> >  fs/xfs/libxfs/xfs_ialloc.c | 26 +++++++++++++++++++++++++-
> >  fs/xfs/xfs_mount.c         |  2 ++
> >  fs/xfs/xfs_mount.h         |  6 ++++++
> >  5 files changed, 57 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > index c68a36688474..ba5702e5c9ad 100644
> > --- a/fs/xfs/libxfs/xfs_ag.c
> > +++ b/fs/xfs/libxfs/xfs_ag.c
> > @@ -676,16 +676,24 @@ xfs_ag_get_geometry(
> >  	if (agno >= mp->m_sb.sb_agcount)
> >  		return -EINVAL;
> >  
> > +	pag = xfs_perag_get(mp, agno);
> > +	down_read(&pag->pag_inactive_rwsem);
> 
> No need to encode the lock type in the lock name. We know it's a
> rwsem from the lock API functions...
> 
> > +
> > +	if (pag->pag_inactive) {
> > +		error = -EBUSY;
> > +		up_read(&pag->pag_inactive_rwsem);
> > +		goto out;
> > +	}
> 
> This looks kinda heavyweight. Having to take a rwsem whenever we do
> a perag lookup to determine if we can access the perag completely
> defeats the purpose of xfs_perag_get() being a lightweight, lockless
> operation.

I'm not sure if it has some regression since write lock will be only
taken when shrinking (shrinking is a rare operation), for most cases
which is much similiar to perag radix root I think.

The locking logic is that, when pag->pag_inactive = false -> true,
the write lock, AGF/AGI locks all have to be taken in advance.

> 
> I suspect what we really want here is active/passive references like
> are used for the superblock, and an API that hides the
> implementation from all the callers.

If my understanding is correct, my own observation these months is
that the current XFS codebase is not well suitable to accept !pag
(due to many logic assumes pag structure won't go away, since some
 are indexed/passed by agno rather than some pag reference count).

Even I think we could introduce some active references, but handle
the cover range is still a big project. The current approach assumes
pag won't go away except for umounting and blocks allocation / imap/
... paths to access that.

My current thought is that we could implement it in that way as the
first step (in order to land the shrinking functionality to let
end-users benefit of this), and by the codebase evolves, it can be
transformed to a more gentle way.

Thanks,
Gao Xiang
Dave Chinner April 15, 2021, 6:28 a.m. UTC | #3
On Thu, Apr 15, 2021 at 12:28:37PM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On Thu, Apr 15, 2021 at 01:42:55PM +1000, Dave Chinner wrote:
> > On Thu, Apr 15, 2021 at 03:52:37AM +0800, Gao Xiang wrote:
> > > To get rid of paralleled requests related to AGs which are pending
> > > for shrinking, mark these perags as inactive rather than playing
> > > with per-ag structures theirselves.
> > > 
> > > Since in that way, a per-ag lock can be used to stablize the inactive
> > > status together with agi/agf buffer lock (which is much easier than
> > > adding more complicated perag_{get, put} pairs..) Also, Such per-ags
> > > can be released / reused when unmountfs / growfs.
> > > 
> > > On the read side, pag_inactive_rwsem can be unlocked immediately after
> > > the agf or agi buffer lock is acquired. However, pag_inactive_rwsem
> > > can only be unlocked after the agf/agi buffer locks are all acquired
> > > with the inactive status on the write side.
> > > 
> > > XXX: maybe there are some missing cases.
> > > 
> > > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_ag.c     | 16 +++++++++++++---
> > >  fs/xfs/libxfs/xfs_alloc.c  | 12 +++++++++++-
> > >  fs/xfs/libxfs/xfs_ialloc.c | 26 +++++++++++++++++++++++++-
> > >  fs/xfs/xfs_mount.c         |  2 ++
> > >  fs/xfs/xfs_mount.h         |  6 ++++++
> > >  5 files changed, 57 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > > index c68a36688474..ba5702e5c9ad 100644
> > > --- a/fs/xfs/libxfs/xfs_ag.c
> > > +++ b/fs/xfs/libxfs/xfs_ag.c
> > > @@ -676,16 +676,24 @@ xfs_ag_get_geometry(
> > >  	if (agno >= mp->m_sb.sb_agcount)
> > >  		return -EINVAL;
> > >  
> > > +	pag = xfs_perag_get(mp, agno);
> > > +	down_read(&pag->pag_inactive_rwsem);
> > 
> > No need to encode the lock type in the lock name. We know it's a
> > rwsem from the lock API functions...
> > 
> > > +
> > > +	if (pag->pag_inactive) {
> > > +		error = -EBUSY;
> > > +		up_read(&pag->pag_inactive_rwsem);
> > > +		goto out;
> > > +	}
> > 
> > This looks kinda heavyweight. Having to take a rwsem whenever we do
> > a perag lookup to determine if we can access the perag completely
> > defeats the purpose of xfs_perag_get() being a lightweight, lockless
> > operation.
> 
> I'm not sure if it has some regression since write lock will be only
> taken when shrinking (shrinking is a rare operation), for most cases
> which is much similiar to perag radix root I think.

It's still an extra pair of atomic operation per xfs_perag_get/put()
call pairs. pag_inactive being true is the slow/rare path, so we
should be trying to keep the overhead of detecting that path out of
all our fast paths...

Indeed, xfs_perag_get() already shows up on profiles because of the
number of calls we make to it, so adding an extra atomic operation
for this operation is going to be noticable in terms of CPU usage,
if nothing else.

> The locking logic is that, when pag->pag_inactive = false -> true,
> the write lock, AGF/AGI locks all have to be taken in advance.
> 
> > 
> > I suspect what we really want here is active/passive references like
> > are used for the superblock, and an API that hides the
> > implementation from all the callers.
> 
> If my understanding is correct, my own observation these months is
> that the current XFS codebase is not well suitable to accept !pag
> (due to many logic assumes pag structure won't go away, since some
>  are indexed/passed by agno rather than some pag reference count).

Maybe so, but that's exactly what this patch is addressing - it's
adding a way to detect that perag has "gone away"i and should not be
referenced any more.

It wasn't until I saw this patch that I realised that there is a way
that we can, in fact, safely handle perags being freed by ensuring
that the RCU lookup fails for these "active" references....

> Even I think we could introduce some active references, but handle
> the cover range is still a big project. The current approach assumes
> pag won't go away except for umounting and blocks allocation / imap/
> ... paths to access that.

The way I described should work just fine - nobody should be
accessing the per-ag without a reference gained through a lookup in
some way.  Buffers carry a passive reference, because the per-ag
teardown will do the teardown of those references before the perag
is freed.  Everything else carries an active reference and so
teardown cannot begin until all active references go away.

> My current thought is that we could implement it in that way as the
> first step (in order to land the shrinking functionality to let
> end-users benefit of this), and by the codebase evolves, it can be
> transformed to a more gentle way.

I think converting this patchset to active/passive references ias
I've described solves the problem entirely - there's no "evolving"
needed as we can solve it with this one structural change...

Cheers,

Dave.
Gao Xiang April 15, 2021, 7:08 a.m. UTC | #4
Hi Dave,

On Thu, Apr 15, 2021 at 04:28:24PM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2021 at 12:28:37PM +0800, Gao Xiang wrote:
> > Hi Dave,
> > 
> > On Thu, Apr 15, 2021 at 01:42:55PM +1000, Dave Chinner wrote:
> > > On Thu, Apr 15, 2021 at 03:52:37AM +0800, Gao Xiang wrote:
> > > > To get rid of paralleled requests related to AGs which are pending
> > > > for shrinking, mark these perags as inactive rather than playing
> > > > with per-ag structures theirselves.
> > > > 
> > > > Since in that way, a per-ag lock can be used to stablize the inactive
> > > > status together with agi/agf buffer lock (which is much easier than
> > > > adding more complicated perag_{get, put} pairs..) Also, Such per-ags
> > > > can be released / reused when unmountfs / growfs.
> > > > 
> > > > On the read side, pag_inactive_rwsem can be unlocked immediately after
> > > > the agf or agi buffer lock is acquired. However, pag_inactive_rwsem
> > > > can only be unlocked after the agf/agi buffer locks are all acquired
> > > > with the inactive status on the write side.
> > > > 
> > > > XXX: maybe there are some missing cases.
> > > > 
> > > > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_ag.c     | 16 +++++++++++++---
> > > >  fs/xfs/libxfs/xfs_alloc.c  | 12 +++++++++++-
> > > >  fs/xfs/libxfs/xfs_ialloc.c | 26 +++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_mount.c         |  2 ++
> > > >  fs/xfs/xfs_mount.h         |  6 ++++++
> > > >  5 files changed, 57 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > > > index c68a36688474..ba5702e5c9ad 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag.c
> > > > +++ b/fs/xfs/libxfs/xfs_ag.c
> > > > @@ -676,16 +676,24 @@ xfs_ag_get_geometry(
> > > >  	if (agno >= mp->m_sb.sb_agcount)
> > > >  		return -EINVAL;
> > > >  
> > > > +	pag = xfs_perag_get(mp, agno);
> > > > +	down_read(&pag->pag_inactive_rwsem);
> > > 
> > > No need to encode the lock type in the lock name. We know it's a
> > > rwsem from the lock API functions...
> > > 
> > > > +
> > > > +	if (pag->pag_inactive) {
> > > > +		error = -EBUSY;
> > > > +		up_read(&pag->pag_inactive_rwsem);
> > > > +		goto out;
> > > > +	}
> > > 
> > > This looks kinda heavyweight. Having to take a rwsem whenever we do
> > > a perag lookup to determine if we can access the perag completely
> > > defeats the purpose of xfs_perag_get() being a lightweight, lockless
> > > operation.
> > 
> > I'm not sure if it has some regression since write lock will be only
> > taken when shrinking (shrinking is a rare operation), for most cases
> > which is much similiar to perag radix root I think.
> 
> It's still an extra pair of atomic operation per xfs_perag_get/put()
> call pairs. pag_inactive being true is the slow/rare path, so we
> should be trying to keep the overhead of detecting that path out of
> all our fast paths...
> 
> Indeed, xfs_perag_get() already shows up on profiles because of the
> number of calls we make to it, so adding an extra atomic operation
> for this operation is going to be noticable in terms of CPU usage,
> if nothing else.
> 
> > The locking logic is that, when pag->pag_inactive = false -> true,
> > the write lock, AGF/AGI locks all have to be taken in advance.
> > 
> > > 
> > > I suspect what we really want here is active/passive references like
> > > are used for the superblock, and an API that hides the
> > > implementation from all the callers.
> > 
> > If my understanding is correct, my own observation these months is
> > that the current XFS codebase is not well suitable to accept !pag
> > (due to many logic assumes pag structure won't go away, since some
> >  are indexed/passed by agno rather than some pag reference count).
> 
> Maybe so, but that's exactly what this patch is addressing - it's
> adding a way to detect that perag has "gone away"i and should not be
> referenced any more.
> 
> It wasn't until I saw this patch that I realised that there is a way
> that we can, in fact, safely handle perags being freed by ensuring
> that the RCU lookup fails for these "active" references....
> 
> > Even I think we could introduce some active references, but handle
> > the cover range is still a big project. The current approach assumes
> > pag won't go away except for umounting and blocks allocation / imap/
> > ... paths to access that.
> 
> The way I described should work just fine - nobody should be
> accessing the per-ag without a reference gained through a lookup in
> some way.  Buffers carry a passive reference, because the per-ag
> teardown will do the teardown of those references before the perag
> is freed.  Everything else carries an active reference and so
> teardown cannot begin until all active references go away.
> 
> > My current thought is that we could implement it in that way as the
> > first step (in order to land the shrinking functionality to let
> > end-users benefit of this), and by the codebase evolves, it can be
> > transformed to a more gentle way.
> 
> I think converting this patchset to active/passive references ias
> I've described solves the problem entirely - there's no "evolving"
> needed as we can solve it with this one structural change...

Since currently even xfs_perag_put() reaches zero, it won't free
the per-ag anyway (it may just use to mark the pointer is no longer
used in the context? not sure what's the exact use of the such pairs),
so in practice I think after active/passive references are introduced,
there is still the only one real reference count that works for the
per-ag lifetime management and currently it doesn't manage whole
lifetime at all...

So (my own understanding is) I think in practice, that approachs
would be somewhat equal to relocate/rearrange xfs_perag_get()/put()
pair to manage the perag lifetime instead. and maybe use xfs_perag_ptr()
to access perag when some reference count is available.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner April 15, 2021, 8:44 a.m. UTC | #5
On Thu, Apr 15, 2021 at 03:08:33PM +0800, Gao Xiang wrote:
> On Thu, Apr 15, 2021 at 04:28:24PM +1000, Dave Chinner wrote:
> > On Thu, Apr 15, 2021 at 12:28:37PM +0800, Gao Xiang wrote:
> > > My current thought is that we could implement it in that way as the
> > > first step (in order to land the shrinking functionality to let
> > > end-users benefit of this), and by the codebase evolves, it can be
> > > transformed to a more gentle way.
> > 
> > I think converting this patchset to active/passive references ias
> > I've described solves the problem entirely - there's no "evolving"
> > needed as we can solve it with this one structural change...
> 
> Since currently even xfs_perag_put() reaches zero, it won't free
> the per-ag anyway (it may just use to mark the pointer is no longer
> used in the context? not sure what's the exact use of the such pairs),

It tells us when we have an unbalanced get/put pair in the code or
we are failing to clean up everything at unmount time (e.g due to a
shutdown bug). i.e. Unmount on debug kernels will assert fail if the
ref count is not zero, and this indicates either a resource leak or
unbalanced get/put pairing.

I just used exactly this code to debug the active references patch I
just sent. Three different conversion bugs, all count at unmount
time in a couple of different ways. One was:

[   92.219489] XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 150
[   92.223416] ------------[ cut here ]------------
[   92.225597] kernel BUG at fs/xfs/xfs_message.c:110!
[   92.227547] invalid opcode: 0000 [#1] PREEMPT SMP
[   92.229308] CPU: 8 PID: 18310 Comm: umount Not tainted 5.12.0-rc6-dgc+ #3114
[   92.231859] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[   92.234801] RIP: 0010:assfail+0x27/0x2d
[   92.235908] Code: 0b 5d c3 66 66 66 66 90 55 41 89 c8 48 89 d1 48 89 f2 48 89 e5 48 c7 c6 f0 b9 58 82 e8 79 f9 ff ff 80 3d 4e 1d 85 02 00 74 02 <0f> 0b 0f 09
[   92.241143] RSP: 0018:ffffc9000ed3bd80 EFLAGS: 00010202
[   92.242623] RAX: 0000000000000000 RBX: ffff8881031de400 RCX: 0000000000000000
[   92.244626] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffff82518041
[   92.246651] RBP: ffffc9000ed3bd80 R08: 0000000000000000 R09: 000000000000000a
[   92.248657] R10: 000000000000000a R11: f000000000000000 R12: 0000000000000001
[   92.250664] R13: ffff8881017775a0 R14: ffff888101777000 R15: ffff888101777578
[   92.252697] FS:  00007f69fe76ac80(0000) GS:ffff8885fec00000(0000) knlGS:0000000000000000
[   92.254973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   92.256603] CR2: 00007f241f55bad0 CR3: 00000001030ae002 CR4: 0000000000060ee0
[   92.258617] Call Trace:
[   92.259347]  xfs_free_perag+0xc1/0xf0
[   92.260403]  xfs_unmountfs+0xa8/0x130
[   92.261462]  xfs_fs_put_super+0x3a/0xa0
[   92.262556]  generic_shutdown_super+0x6a/0x100
[   92.263815]  kill_block_super+0x27/0x50
[   92.264919]  deactivate_locked_super+0x36/0xa0
[   92.266179]  deactivate_super+0x40/0x50
[   92.267271]  cleanup_mnt+0x135/0x190
[   92.268299]  __cleanup_mnt+0x12/0x20
[   92.269327]  task_work_run+0x61/0xb0
[   92.270349]  exit_to_user_mode_prepare+0x122/0x130
[   92.271706]  syscall_exit_to_user_mode+0x17/0x40
[   92.273034]  do_syscall_64+0x3f/0x50
[   92.274056]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   92.275485] RIP: 0033:0x7f69fe9b0ee7

Another was:

[   75.056242] XFS: Assertion failed: atomic_read(&pag->pag_ref) > 0, file: fs/xfs/libxfs/xfs_sb.c, line: 90
[   75.060870] ------------[ cut here ]------------
[   75.062750] kernel BUG at fs/xfs/xfs_message.c:110!
[   75.064699] invalid opcode: 0000 [#1] PREEMPT SMP
[   75.066437] CPU: 6 PID: 5859 Comm: umount Not tainted 5.12.0-rc6-dgc+ #3113
[   75.068369] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[   75.070654] RIP: 0010:assfail+0x27/0x2d
[   75.071760] Code: 0b 5d c3 66 66 66 66 90 55 41 89 c8 48 89 d1 48 89 f2 48 89 e5 48 c7 c6 f0 b9 58 82 e8 79 f9 ff ff 80 3d 4e 1d 85 02 00 74 02 <0f> 0b 0f 09
[   75.077411] RSP: 0018:ffffc9000271bcd0 EFLAGS: 00010202
[   75.078841] RAX: 0000000000000000 RBX: ffff8885c156d000 RCX: 0000000000000000
[   75.080806] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffff82518041
[   75.082755] RBP: ffffc9000271bcd0 R08: 0000000000000000 R09: 000000000000000a
[   75.087486] R10: 000000000000000a R11: f000000000000000 R12: ffff88825a85cc40
[   75.089465] R13: ffff8885c156d000 R14: ffff88825a85cca0 R15: ffff8883b9d7a800
[   75.091419] FS:  00007f5aa4206c80(0000) GS:ffff8883b9d00000(0000) knlGS:0000000000000000
[   75.093659] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.095254] CR2: 00007ffd3876cf48 CR3: 000000025897b001 CR4: 0000000000060ee0
[   75.097238] Call Trace:
[   75.097941]  xfs_perag_put+0x8f/0xa0
[   75.098949]  xfs_buf_rele+0x284/0x540
[   75.099999]  xfs_buftarg_drain+0x195/0x250
[   75.101142]  xfs_log_unmount+0x2a/0x80
[   75.102195]  xfs_unmountfs+0x88/0x130
[   75.103221]  xfs_fs_put_super+0x3a/0xa0
[   75.104320]  generic_shutdown_super+0x6a/0x100
[   75.105549]  kill_block_super+0x27/0x50
[   75.106626]  deactivate_locked_super+0x36/0xa0
[   75.107896]  deactivate_super+0x40/0x50
[   75.108969]  cleanup_mnt+0x135/0x190
[   75.109965]  __cleanup_mnt+0x12/0x20
[   75.110977]  task_work_run+0x61/0xb0
[   75.112141]  exit_to_user_mode_prepare+0x122/0x130
[   75.113473]  syscall_exit_to_user_mode+0x17/0x40
[   75.114770]  do_syscall_64+0x3f/0x50
[   75.115791]  entry_SYSCALL_64_after_hwframe+0x44/0xae

IOWs, the passive reference counting we have right now is extremely
useful for debug and triage purposes, even if it not actively used
by the code right now for life cycle management. It was always
intended to form the basis of dynamic perag management (e.g. for
huge filesystems with millions of AGs and a shrinker to manage perag
memory footprint like we do inodes....) and it turns out that it's
also useful for shrink....

> so in practice I think after active/passive references are introduced,
> there is still the only one real reference count that works for the
> per-ag lifetime management and currently it doesn't manage whole
> lifetime at all...

I'm not sure I follow you there - the perag reference count doesn't
manage the lifecycle of the perag object at all - it just keeps
track of the number of current users of the structure...

> So (my own understanding is) I think in practice, that approachs
> would be somewhat equal to relocate/rearrange xfs_perag_get()/put()
> pair to manage the perag lifetime instead. and maybe use xfs_perag_ptr()
> to access perag when some reference count is available.

We pass the actively referenced perag pointer around int eh
structures that use it. The lifetime of the active reference matches
the structure that keeps track of it, so nothing should be doing
lookups that don't take reference counts because they should already
have access to a the relevant perag object that has been looked
up...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index c68a36688474..ba5702e5c9ad 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -676,16 +676,24 @@  xfs_ag_get_geometry(
 	if (agno >= mp->m_sb.sb_agcount)
 		return -EINVAL;
 
+	pag = xfs_perag_get(mp, agno);
+	down_read(&pag->pag_inactive_rwsem);
+
+	if (pag->pag_inactive) {
+		error = -EBUSY;
+		up_read(&pag->pag_inactive_rwsem);
+		goto out;
+	}
+
 	/* Lock the AG headers. */
 	error = xfs_ialloc_read_agi(mp, NULL, agno, &agi_bp);
+	up_read(&pag->pag_inactive_rwsem);
 	if (error)
-		return error;
+		goto out;
 	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agf_bp);
 	if (error)
 		goto out_agi;
 
-	pag = agi_bp->b_pag;
-
 	/* Fill out form. */
 	memset(ageo, 0, sizeof(*ageo));
 	ageo->ag_number = agno;
@@ -707,5 +715,7 @@  xfs_ag_get_geometry(
 	xfs_buf_relse(agf_bp);
 out_agi:
 	xfs_buf_relse(agi_bp);
+out:
+	xfs_perag_put(pag);
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index aaa19101bb2a..01d4e4d4c1d6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2537,12 +2537,17 @@  xfs_alloc_fix_freelist(
 	/* deferred ops (AGFL block frees) require permanent transactions */
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
+	down_read(&pag->pag_inactive_rwsem);
+	if (pag->pag_inactive)
+		goto out_no_agbp;
+
 	if (!pag->pagf_init) {
 		error = xfs_alloc_read_agf(mp, tp, args->agno, flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
 				error = 0;
+			up_read(&pag->pag_inactive_rwsem);
 			goto out_no_agbp;
 		}
 	}
@@ -2555,13 +2560,16 @@  xfs_alloc_fix_freelist(
 	if (pag->pagf_metadata && (args->datatype & XFS_ALLOC_USERDATA) &&
 	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
 		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
+		up_read(&pag->pag_inactive_rwsem);
 		goto out_agbp_relse;
 	}
 
 	need = xfs_alloc_min_freelist(mp, pag);
 	if (!xfs_alloc_space_available(args, need, flags |
-			XFS_ALLOC_FLAG_CHECK))
+			XFS_ALLOC_FLAG_CHECK)) {
+		up_read(&pag->pag_inactive_rwsem);
 		goto out_agbp_relse;
+	}
 
 	/*
 	 * Get the a.g. freespace buffer.
@@ -2573,9 +2581,11 @@  xfs_alloc_fix_freelist(
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
 				error = 0;
+			up_read(&pag->pag_inactive_rwsem);
 			goto out_no_agbp;
 		}
 	}
+	up_read(&pag->pag_inactive_rwsem);
 
 	/* reset a padding mismatched agfl before final free space check */
 	if (pag->pagf_agflreset)
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index eefdb518fe64..4df218eeb088 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -969,11 +969,15 @@  xfs_ialloc_ag_select(
 	flags = XFS_ALLOC_FLAG_TRYLOCK;
 	for (;;) {
 		pag = xfs_perag_get(mp, agno);
+		down_read(&pag->pag_inactive_rwsem);
 		if (!pag->pagi_inodeok) {
 			xfs_ialloc_next_ag(mp);
 			goto nextag;
 		}
 
+		if (pag->pag_inactive)
+			goto nextag;
+
 		if (!pag->pagi_init) {
 			error = xfs_ialloc_pagi_init(mp, tp, agno);
 			if (error)
@@ -981,6 +985,7 @@  xfs_ialloc_ag_select(
 		}
 
 		if (pag->pagi_freecount) {
+			up_read(&pag->pag_inactive_rwsem);
 			xfs_perag_put(pag);
 			return agno;
 		}
@@ -1016,10 +1021,12 @@  xfs_ialloc_ag_select(
 
 		if (pag->pagf_freeblks >= needspace + ineed &&
 		    longest >= ineed) {
+			up_read(&pag->pag_inactive_rwsem);
 			xfs_perag_put(pag);
 			return agno;
 		}
 nextag:
+		up_read(&pag->pag_inactive_rwsem);
 		xfs_perag_put(pag);
 		/*
 		 * No point in iterating over the rest, if we're shutting
@@ -1776,10 +1783,13 @@  xfs_dialloc_select_ag(
 	agno = start_agno;
 	for (;;) {
 		pag = xfs_perag_get(mp, agno);
+		down_read(&pag->pag_inactive_rwsem);
 		if (!pag->pagi_inodeok) {
 			xfs_ialloc_next_ag(mp);
 			goto nextag;
 		}
+		if (pag->pag_inactive)
+			goto nextag;
 
 		if (!pag->pagi_init) {
 			error = xfs_ialloc_pagi_init(mp, *tpp, agno);
@@ -1802,6 +1812,7 @@  xfs_dialloc_select_ag(
 			break;
 
 		if (pag->pagi_freecount) {
+			up_read(&pag->pag_inactive_rwsem);
 			xfs_perag_put(pag);
 			goto found_ag;
 		}
@@ -1825,6 +1836,7 @@  xfs_dialloc_select_ag(
 			 * allocate one of the new inodes.
 			 */
 			ASSERT(pag->pagi_freecount > 0);
+			up_read(&pag->pag_inactive_rwsem);
 			xfs_perag_put(pag);
 
 			error = xfs_dialloc_roll(tpp, agbp);
@@ -1838,13 +1850,14 @@  xfs_dialloc_select_ag(
 nextag_relse_buffer:
 		xfs_trans_brelse(*tpp, agbp);
 nextag:
+		up_read(&pag->pag_inactive_rwsem);
 		xfs_perag_put(pag);
 		if (++agno == mp->m_sb.sb_agcount)
 			agno = 0;
 		if (agno == start_agno)
 			return noroom ? -ENOSPC : 0;
 	}
-
+	up_read(&pag->pag_inactive_rwsem);
 	xfs_perag_put(pag);
 	return error;
 found_ag:
@@ -2263,11 +2276,22 @@  xfs_imap_lookup(
 {
 	struct xfs_inobt_rec_incore rec;
 	struct xfs_btree_cur	*cur;
+	struct xfs_perag	*pag;
 	struct xfs_buf		*agbp;
 	int			error;
 	int			i;
 
+	pag = xfs_perag_get(mp, agno);
+	down_read(&pag->pag_inactive_rwsem);
+	if (pag->pag_inactive) {
+		up_read(&pag->pag_inactive_rwsem);
+		xfs_perag_put(pag);
+		return -EINVAL;
+	}
+
 	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
+	up_read(&pag->pag_inactive_rwsem);
+	xfs_perag_put(pag);
 	if (error) {
 		xfs_alert(mp,
 			"%s: xfs_ialloc_read_agi() returned error %d, agno %d",
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1c97b155a8ee..f86360514828 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -235,6 +235,8 @@  xfs_initialize_perag(
 		if (error)
 			goto out_hash_destroy;
 		spin_lock_init(&pag->pag_state_lock);
+
+		init_rwsem(&pag->pag_inactive_rwsem);
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 81829d19596e..667dae0acaf9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -309,6 +309,12 @@  typedef struct xfs_perag {
 	struct xfs_mount *pag_mount;	/* owner filesystem */
 	xfs_agnumber_t	pag_agno;	/* AG this structure belongs to */
 	atomic_t	pag_ref;	/* perag reference count */
+
+	struct rw_semaphore	pag_inactive_rwsem;
+	bool			pag_inactive;
+
+	/* zero the following fields when growfs pag_inactive == true pags */
+
 	char		pagf_init;	/* this agf's entry is initialized */
 	char		pagi_init;	/* this agi's entry is initialized */
 	char		pagf_metadata;	/* the agf is preferred to be metadata */