[RFC,7/7] libfs: Re-arrange locking in offset_iterate_dir()

Message ID	170786028847.11135.14775608389430603086.stgit@91.116.238.104.host.secureserver.net (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E16E4629E7; Tue, 13 Feb 2024 21:38:10 +0000 (UTC) Subject: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() From: Chuck Lever <cel@kernel.org> To: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, hughd@google.com, akpm@linux-foundation.org, Liam.Howlett@oracle.com, oliver.sang@intel.com, feng.tang@intel.com Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org, linux-mm@kvack.org, lkp@intel.com Date: Tue, 13 Feb 2024 16:38:08 -0500 Message-ID: <170786028847.11135.14775608389430603086.stgit@91.116.238.104.host.secureserver.net> In-Reply-To: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net> References: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net> User-Agent: StGit/1.5 Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit
Series	Use Maple Trees for simple_offset utilities \| expand [RFC,0/7] Use Maple Trees for simple_offset utilities [RFC,1/7] libfs: Rename "so_ctx" [RFC,2/7] libfs: Define a minimum directory offset [RFC,3/7] libfs: Add simple_offset_empty() [RFC,4/7] maple_tree: Add mtree_alloc_cyclic() [RFC,5/7] test_maple_tree: testing the cyclic allocation [RFC,6/7] libfs: Convert simple directory offsets to use a Maple Tree [RFC,7/7] libfs: Re-arrange locking in offset_iterate_dir()

Chuck Lever Feb. 13, 2024, 9:38 p.m. UTC

From: Chuck Lever <chuck.lever@oracle.com>

Liam says that, unlike with xarray, once the RCU read lock is
released ma_state is not safe to re-use for the next mas_find() call.
But the RCU read lock has to be released on each loop iteration so
that dput() can be called safely.

Thus we are forced to walk the offset tree with fresh state for each
directory entry. mt_find() can do this for us, though it might be a
little less efficient than maintaining ma_state locally.

Since offset_iterate_dir() doesn't build ma_state locally any more,
there's no longer a strong need for offset_find_next(). Clean up by
rolling these two helpers together.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c |   39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

Jan Kara Feb. 15, 2024, 1:16 p.m. UTC | #1

On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Liam says that, unlike with xarray, once the RCU read lock is
> released ma_state is not safe to re-use for the next mas_find() call.
> But the RCU read lock has to be released on each loop iteration so
> that dput() can be called safely.
> 
> Thus we are forced to walk the offset tree with fresh state for each
> directory entry. mt_find() can do this for us, though it might be a
> little less efficient than maintaining ma_state locally.
> 
> Since offset_iterate_dir() doesn't build ma_state locally any more,
> there's no longer a strong need for offset_find_next(). Clean up by
> rolling these two helpers together.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Well, in general I think even xas_next_entry() is not safe to use how
offset_find_next() was using it. Once you drop rcu_read_lock(),
xas->xa_node could go stale. But since you're holding inode->i_rwsem when
using offset_find_next() you should be protected from concurrent
modifications of the mapping (whatever the underlying data structure is) -
that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
maple tree? Am I missing something?

								Honza

> ---
>  fs/libfs.c |   39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index f073e9aeb2bf..6e01fde1cf95 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>  	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
>  }
>  
> -static struct dentry *offset_find_next(struct ma_state *mas)
> -{
> -	struct dentry *child, *found = NULL;
> -
> -	rcu_read_lock();
> -	child = mas_find(mas, ULONG_MAX);
> -	if (!child)
> -		goto out;
> -	spin_lock(&child->d_lock);
> -	if (simple_positive(child))
> -		found = dget_dlock(child);
> -	spin_unlock(&child->d_lock);
> -out:
> -	rcu_read_unlock();
> -	return found;
> -}
> -
>  static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  {
>  	unsigned long offset = dentry2offset(dentry);
> @@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
>  {
>  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> -	MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
> -	struct dentry *dentry;
> +	struct dentry *dentry, *found;
> +	unsigned long offset;
>  
> +	offset = ctx->pos;
>  	while (true) {
> -		dentry = offset_find_next(&mas);
> +		found = mt_find(&octx->mt, &offset, ULONG_MAX);
> +		if (!found)
> +			goto out_noent;
> +
> +		dentry = NULL;
> +		spin_lock(&found->d_lock);
> +		if (simple_positive(found))
> +			dentry = dget_dlock(found);
> +		spin_unlock(&found->d_lock);
>  		if (!dentry)
> -			return ERR_PTR(-ENOENT);
> +			goto out_noent;
>  
>  		if (!offset_dir_emit(ctx, dentry)) {
>  			dput(dentry);
> @@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
>  		}
>  
>  		dput(dentry);
> -		ctx->pos = mas.index + 1;
> +		ctx->pos = offset;
>  	}
>  	return NULL;
> +
> +out_noent:
> +	return ERR_PTR(-ENOENT);
>  }
>  
>  /**
> 
>

Liam R. Howlett Feb. 15, 2024, 5 p.m. UTC | #2

* Jan Kara <jack@suse.cz> [240215 08:16]:
> On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Liam says that, unlike with xarray, once the RCU read lock is
> > released ma_state is not safe to re-use for the next mas_find() call.
> > But the RCU read lock has to be released on each loop iteration so
> > that dput() can be called safely.
> > 
> > Thus we are forced to walk the offset tree with fresh state for each
> > directory entry. mt_find() can do this for us, though it might be a
> > little less efficient than maintaining ma_state locally.
> > 
> > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > there's no longer a strong need for offset_find_next(). Clean up by
> > rolling these two helpers together.
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> Well, in general I think even xas_next_entry() is not safe to use how
> offset_find_next() was using it. Once you drop rcu_read_lock(),
> xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> using offset_find_next() you should be protected from concurrent
> modifications of the mapping (whatever the underlying data structure is) -
> that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> maple tree? Am I missing something?

If you are stopping, you should be pausing the iteration.  Although this
works today, it's not how it should be used because if we make changes
(ie: compaction requires movement of data), then you may end up with a
UAF issue.  We'd have no way of knowing you are depending on the tree
structure to remain consistent.

IOW the inode->i_rwsem is protecting writes of data but not the
structure holding the data.

This is true for both xarray and maple tree.

Thanks,
Liam

Jan Kara Feb. 15, 2024, 5:16 p.m. UTC | #3

On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > > 
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > > 
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
> 
> If you are stopping, you should be pausing the iteration.  Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue.  We'd have no way of knowing you are depending on the tree
> structure to remain consistent.

I see. But we have versions of these structures that have locking external
to the structure itself, don't we? Then how do you imagine serializing the
background operations like compaction? As much as I agree your argument is
"theoretically clean", it seems a bit like a trap and there are definitely
xarray users that are going to be broken by this (e.g.
tag_pages_for_writeback())...

								Honza

Chuck Lever Feb. 15, 2024, 5:40 p.m. UTC | #4

On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > > 
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > > 
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
> 
> If you are stopping, you should be pausing the iteration.  Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue.  We'd have no way of knowing you are depending on the tree
> structure to remain consistent.
> 
> IOW the inode->i_rwsem is protecting writes of data but not the
> structure holding the data.
> 
> This is true for both xarray and maple tree.

Would it be appropriate to reorder this series so 7/7 comes before
the transition to use Maple Tree?

Liam R. Howlett Feb. 15, 2024, 9:07 p.m. UTC | #5

* Jan Kara <jack@suse.cz> [240215 12:16]:
> On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > > 
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > > 
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > > 
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> > 
> > If you are stopping, you should be pausing the iteration.  Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue.  We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
> 
> I see. But we have versions of these structures that have locking external
> to the structure itself, don't we?

Ah, I do have them - but I don't want to propagate its use as the dream
is that it can be removed.

> Then how do you imagine serializing the
> background operations like compaction? As much as I agree your argument is
> "theoretically clean", it seems a bit like a trap and there are definitely
> xarray users that are going to be broken by this (e.g.
> tag_pages_for_writeback())...

I'm not sure I follow the trap logic.  There are locks for the data
structure that need to be followed for reading (rcu) and writing
(spinlock for the maple tree).  If you don't correctly lock the data
structure then you really are setting yourself up for potential issues
in the future.

The limitations are outlined in the documentation as to how and when to
lock.  I'm not familiar with the xarray users, but it does check for
locking with lockdep, but the way this is written bypasses the lockdep
checking as the locks are taken and dropped without the proper scope.

If you feel like this is a trap, then maybe we need to figure out a new
plan to detect incorrect use?

Looking through tag_pages_for_writeback(), it does what is necessary to
keep a safe state - before it unlocks it calls xas_pause().  We have the
same on maple tree; mas_pause().  This will restart the next operation
from the root of the tree (the root can also change), to ensure that it
is safe.

If you have other examples you think are unsafe then I can have a look
at them as well.

You can make the existing code safe by also calling xas_pause() before
the rcu lock is dropped, but that is essentially what Chuck has done in
the maple tree conversion by using mt_find().

Regarding compaction, I would expect the write lock to be taken to avoid
any writes happening while compaction occurs.  Readers would use rcu
locking to ensure they return either the old or new value.  During the
write critical section, other writers would be in a
"mas_pause()/xas_pause()" state - so once they continue, they will
re-start the walk to the next element in the new tree.

If external locks are used, then compaction would be sub-optimal as it
may unnecessarily hold up readers, or partial write work (before the
point of a store into the maple tree).

Thanks,
Liam

Liam R. Howlett Feb. 15, 2024, 9:08 p.m. UTC | #6

* Chuck Lever <chuck.lever@oracle.com> [240215 12:40]:
> On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > > 
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > > 
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > > 
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> > 
> > If you are stopping, you should be pausing the iteration.  Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue.  We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
> > 
> > IOW the inode->i_rwsem is protecting writes of data but not the
> > structure holding the data.
> > 
> > This is true for both xarray and maple tree.
> 
> Would it be appropriate to reorder this series so 7/7 comes before
> the transition to use Maple Tree?

I think it would, yes.

Thanks,
Liam

Jan Kara Feb. 16, 2024, 10:15 a.m. UTC | #7

On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 12:16]:
> > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > 
> > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > that dput() can be called safely.
> > > > > 
> > > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > > directory entry. mt_find() can do this for us, though it might be a
> > > > > little less efficient than maintaining ma_state locally.
> > > > > 
> > > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > > rolling these two helpers together.
> > > > > 
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Well, in general I think even xas_next_entry() is not safe to use how
> > > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > > using offset_find_next() you should be protected from concurrent
> > > > modifications of the mapping (whatever the underlying data structure is) -
> > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > > maple tree? Am I missing something?
> > > 
> > > If you are stopping, you should be pausing the iteration.  Although this
> > > works today, it's not how it should be used because if we make changes
> > > (ie: compaction requires movement of data), then you may end up with a
> > > UAF issue.  We'd have no way of knowing you are depending on the tree
> > > structure to remain consistent.
> > 
> > I see. But we have versions of these structures that have locking external
> > to the structure itself, don't we?
> 
> Ah, I do have them - but I don't want to propagate its use as the dream
> is that it can be removed.
> 
> 
> > Then how do you imagine serializing the
> > background operations like compaction? As much as I agree your argument is
> > "theoretically clean", it seems a bit like a trap and there are definitely
> > xarray users that are going to be broken by this (e.g.
> > tag_pages_for_writeback())...
> 
> I'm not sure I follow the trap logic.  There are locks for the data
> structure that need to be followed for reading (rcu) and writing
> (spinlock for the maple tree).  If you don't correctly lock the data
> structure then you really are setting yourself up for potential issues
> in the future.
> 
> The limitations are outlined in the documentation as to how and when to
> lock.  I'm not familiar with the xarray users, but it does check for
> locking with lockdep, but the way this is written bypasses the lockdep
> checking as the locks are taken and dropped without the proper scope.
> 
> If you feel like this is a trap, then maybe we need to figure out a new
> plan to detect incorrect use?

OK, I was a bit imprecise. What I wanted to say is that this is a shift in
the paradigm in the sense that previously, we mostly had (and still have)
data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
guaranteeing that unless you call into the function to mutate the data
structure it stays intact. Now maple trees are shifting more in a direction
of black-box API where you cannot assume what happens inside. Which is fine
but then we have e.g. these iterators which do not quite follow this
black-box design and you have to remember subtle details like calling
"mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
the black-box API shouldn't be exposed to the details of the internal
locking at all (but then the performance suffers so I understand why you do
things this way). Second to this ideal variant would be if we could detect
we unlocked the lock without calling xas_pause() and warn on that. Or maybe
xas_unlock*() should be calling xas_pause() automagically and we'd have
similar helpers for RCU to do the magic for you?

> Looking through tag_pages_for_writeback(), it does what is necessary to
> keep a safe state - before it unlocks it calls xas_pause().  We have the
> same on maple tree; mas_pause().  This will restart the next operation
> from the root of the tree (the root can also change), to ensure that it
> is safe.

OK, I've missed the xas_pause(). Thanks for correcting me.

> If you have other examples you think are unsafe then I can have a look
> at them as well.

I'm currently not aware of any but I'll let you know if I find some.
Missing xas/mas_pause() seems really easy.

								Honza

Matthew Wilcox Feb. 16, 2024, 3:57 p.m. UTC | #8

On Fri, Feb 16, 2024 at 11:15:46AM +0100, Jan Kara wrote:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > The limitations are outlined in the documentation as to how and when to
> > lock.  I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> > 
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
> 
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact.

hm, no.  The radix tree never guaranteed that to you; I just documented
that it wasn't guaranteed for the XArray.

> Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?

If you're unlocking the lock that protects a data structure while still
using that data structure, you should always be aware that you're doing
something very dangerous!  It's no different from calling inode_unlock()
inside a filesystem.  Sure, you can do it, but you'd better be ready to
deal with the consequences.

The question is, do we want to be able to defragment slabs or not?
My thinking is "yes", for objects where we can ensure there are no
current users (at least after an RCU grace period), we want to be able
to move them.  That does impose certain costs (and subtleties), but just
like fast-GUP and lockless page-cache, I think it's worth doing.

Of course, we don't have slab defragmentation yet, so we're not getting
any benefit from this.  The most recent attempt was in 2019:
https://lore.kernel.org/linux-mm/20190603042637.2018-1-tobin@kernel.org/
but there were earlier attepts in 2017:
https://lore.kernel.org/linux-mm/20171227220636.361857279@linux.com/
and 2008:
https://lore.kernel.org/linux-mm/20080216004526.763643520@sgi.com/

so I have to believe it's something we want, just haven't been able to
push across the "merge this now" line.

Liam R. Howlett Feb. 16, 2024, 4:33 p.m. UTC | #9

* Jan Kara <jack@suse.cz> [240216 05:15]:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 12:16]:
> > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > > 
> > > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > > that dput() can be called safely.
> > > > > > 

...

> > 
> > > Then how do you imagine serializing the
> > > background operations like compaction? As much as I agree your argument is
> > > "theoretically clean", it seems a bit like a trap and there are definitely
> > > xarray users that are going to be broken by this (e.g.
> > > tag_pages_for_writeback())...
> > 
> > I'm not sure I follow the trap logic.  There are locks for the data
> > structure that need to be followed for reading (rcu) and writing
> > (spinlock for the maple tree).  If you don't correctly lock the data
> > structure then you really are setting yourself up for potential issues
> > in the future.
> > 
> > The limitations are outlined in the documentation as to how and when to
> > lock.  I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> > 
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
> 
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact. Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?
> 
...

Fair enough.  You can think of the radix-tree and xarray as black boxes
as well; not everyone knows what's going on in there.  The rbtree and
list are embedded in your own data structure and you have to do a lot of
work for your operations.

Right now, it is the way you have described; it's a data structure API.
It will work this way, but you lose the really neat and performant part
of the tree.  If we do this right, we can compact at the same time as
reading data.  We cannot do that when depending on the locks you use
today. 

You don't have to use the mas_pause() functions, there are less
error-prone methods such as mt_find() that Chuck used here.  If you want
to do really neat stuff though, you should be looking at the mas_*
interface.  And yes, we totally hand you enough rope to hang yourself.
I'm not sure we can have an advanced interface without doing this.

Using the correct calls will allow for us to smoothly transition to a
model where you don't depend on the data remaining in the same place in
the tree (or xarray).

> 
> > If you have other examples you think are unsafe then I can have a look
> > at them as well.
> 
> I'm currently not aware of any but I'll let you know if I find some.
> Missing xas/mas_pause() seems really easy.

What if we convert the rcu_read_lock() to a mas_read_lock() or
xas_read_lock() and we can check to ensure the state isn't being locked
without being in the 'parked' state (paused or otherwise)?

mas_read_lock(struct ma_state *mas) {
	assert(!mas_active(mas));
	rcu_read_lock();
}

Would that be a reasonable resolution to your concern?  Unfortunately,
what was done with the locking in this case would not be detected with
this change unless the rcu_read_lock() was replaced.  IOW, people could
still use the rcu_read_lock() and skip the detection.

Doing the same in the mas_unlock() doesn't make as much sense since that
may be called without the intent to reuse the state at all.  So we'd be
doing more work than necessary at the end of each loop or use.

Thanks,
Liam

Jan Kara Feb. 19, 2024, 6:06 p.m. UTC | #10

On Fri 16-02-24 11:33:18, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240216 05:15]:
> > > If you have other examples you think are unsafe then I can have a look
> > > at them as well.
> > 
> > I'm currently not aware of any but I'll let you know if I find some.
> > Missing xas/mas_pause() seems really easy.
> 
> What if we convert the rcu_read_lock() to a mas_read_lock() or
> xas_read_lock() and we can check to ensure the state isn't being locked
> without being in the 'parked' state (paused or otherwise)?
> 
> mas_read_lock(struct ma_state *mas) {
> 	assert(!mas_active(mas));
> 	rcu_read_lock();
> }
> 
> Would that be a reasonable resolution to your concern?  Unfortunately,
> what was done with the locking in this case would not be detected with
> this change unless the rcu_read_lock() was replaced.  IOW, people could
> still use the rcu_read_lock() and skip the detection.

Yes, I guess this is still better than nothing.

> Doing the same in the mas_unlock() doesn't make as much sense since that
> may be called without the intent to reuse the state at all.  So we'd be
> doing more work than necessary at the end of each loop or use.

Yes, understood.

								Honza

[RFC,7/7] libfs: Re-arrange locking in offset_iterate_dir()

Commit Message

Comments

Patch