diff mbox series

[8/8] xarray: Don't clear marks in xas_store()

Message ID 20200204142514.15826-9-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series mm: Speedup page cache truncation | expand

Commit Message

Jan Kara Feb. 4, 2020, 2:25 p.m. UTC
When storing NULL in xarray, xas_store() has been clearing all marks
because it could otherwise confuse xas_for_each_marked(). That is
however no longer true and no current user relies on this behavior.
Furthermore it seems as a cleaner API to not do clearing behind caller's
back in case we store NULL.

This provides a nice boost to truncate numbers due to saving unnecessary
tag initialization when clearing shadow entries. Sample benchmark
showing time to truncate 128 files 1GB each on machine with 64GB of RAM
(so about half of entries are shadow entries):

         AVG      STDDEV
Vanilla  4.825s   0.036
Patched  4.516s   0.014

So we can see about 6% reduction in overall truncate time.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 lib/xarray.c | 9 ---------
 1 file changed, 9 deletions(-)

Comments

Jason Gunthorpe Feb. 5, 2020, 6:43 p.m. UTC | #1
On Tue, Feb 04, 2020 at 03:25:14PM +0100, Jan Kara wrote:
> When storing NULL in xarray, xas_store() has been clearing all marks
> because it could otherwise confuse xas_for_each_marked(). That is
> however no longer true and no current user relies on this behavior.
> Furthermore it seems as a cleaner API to not do clearing behind caller's
> back in case we store NULL.
> 
> This provides a nice boost to truncate numbers due to saving unnecessary
> tag initialization when clearing shadow entries. Sample benchmark
> showing time to truncate 128 files 1GB each on machine with 64GB of RAM
> (so about half of entries are shadow entries):
> 
>          AVG      STDDEV
> Vanilla  4.825s   0.036
> Patched  4.516s   0.014
> 
> So we can see about 6% reduction in overall truncate time.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
>  lib/xarray.c | 9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 4e32497c51bd..f165e83652f1 100644
> +++ b/lib/xarray.c
> @@ -799,17 +799,8 @@ void *xas_store(struct xa_state *xas, void *entry)
>  		if (xas->xa_sibs)
>  			xas_squash_marks(xas);
>  	}
> -	if (!entry)
> -		xas_init_marks(xas);
>  
>  	for (;;) {
> -		/*
> -		 * Must clear the marks before setting the entry to NULL,
> -		 * otherwise xas_for_each_marked may find a NULL entry and
> -		 * stop early.  rcu_assign_pointer contains a release barrier
> -		 * so the mark clearing will appear to happen before the
> -		 * entry is set to NULL.
> -		 */
>  		rcu_assign_pointer(*slot, entry);

The above removed comment doesn't sound right (the release is paired
with READ_ONCE, which is only an acquire for data dependent accesses),
is this a reflection of the original bug in this thread?

How is RCU mark reading used anyhow?

There is no guarenteed ordering of the mark and the value, so nothing
iterating under RCU over marks can rely on the marks being accurate.

Are the algorithms using this tolerant of that, or provide some kind
of external locking?

This series looks good to me, and does seem to be an improvement.

Actually the clearing of marks by xa_store(, NULL) is creating a very
subtle bug in drivers/infiniband/core/device.c :( Can you add a Fixes
line too:

ib_set_client_data() is assuming the marks for the entry will not
change, but if the caller passed in NULL they get wrongly reset, and
three call sites pass in NULL:
 drivers/infiniband/ulp/srpt/ib_srpt.c
 net/rds/ib.c
 net/smc/smc_ib.c
Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data")

Thanks,
Jason
Matthew Wilcox Feb. 5, 2020, 9:59 p.m. UTC | #2
On Wed, Feb 05, 2020 at 02:43:44PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2020 at 03:25:14PM +0100, Jan Kara wrote:
> > When storing NULL in xarray, xas_store() has been clearing all marks
> > because it could otherwise confuse xas_for_each_marked(). That is
> > however no longer true and no current user relies on this behavior.
> > Furthermore it seems as a cleaner API to not do clearing behind caller's
> > back in case we store NULL.
> > 
> > This provides a nice boost to truncate numbers due to saving unnecessary
> > tag initialization when clearing shadow entries. Sample benchmark
> > showing time to truncate 128 files 1GB each on machine with 64GB of RAM
> > (so about half of entries are shadow entries):
> > 
> >          AVG      STDDEV
> > Vanilla  4.825s   0.036
> > Patched  4.516s   0.014
> > 
> > So we can see about 6% reduction in overall truncate time.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> >  lib/xarray.c | 9 ---------
> >  1 file changed, 9 deletions(-)
> > 
> > diff --git a/lib/xarray.c b/lib/xarray.c
> > index 4e32497c51bd..f165e83652f1 100644
> > +++ b/lib/xarray.c
> > @@ -799,17 +799,8 @@ void *xas_store(struct xa_state *xas, void *entry)
> >  		if (xas->xa_sibs)
> >  			xas_squash_marks(xas);
> >  	}
> > -	if (!entry)
> > -		xas_init_marks(xas);
> >  
> >  	for (;;) {
> > -		/*
> > -		 * Must clear the marks before setting the entry to NULL,
> > -		 * otherwise xas_for_each_marked may find a NULL entry and
> > -		 * stop early.  rcu_assign_pointer contains a release barrier
> > -		 * so the mark clearing will appear to happen before the
> > -		 * entry is set to NULL.
> > -		 */
> >  		rcu_assign_pointer(*slot, entry);
> 
> The above removed comment doesn't sound right (the release is paired
> with READ_ONCE, which is only an acquire for data dependent accesses),
> is this a reflection of the original bug in this thread?

Yes.  I was thinking about a classical race like so:

read mark
			clear mark
load entry
			store NULL

but of course CPUs can execute many instructions asynchronously with
each other, and

read mark
			clear mark
			store NULL
load entry

can't be prevented against for an RCU reader.

> How is RCU mark reading used anyhow?

We iterate over pages in the page cache with, eg, the dirty bit set.
This bug will lead to the loop terminating early and failing to find
dirty pages that it should.

> Actually the clearing of marks by xa_store(, NULL) is creating a very
> subtle bug in drivers/infiniband/core/device.c :( Can you add a Fixes
> line too:
> 
> ib_set_client_data() is assuming the marks for the entry will not
> change, but if the caller passed in NULL they get wrongly reset, and
> three call sites pass in NULL:
>  drivers/infiniband/ulp/srpt/ib_srpt.c
>  net/rds/ib.c
>  net/smc/smc_ib.c
> Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data")

There's no bug here.

If you're actually storing NULL in the array, then the marks would go
away.  That's inherent -- imagine you have an array with a single entry
at 64.  Then you store NULL there.  That causes the node to be deleted,
and the marks must necessarily disappear with it -- there's nowhere to
store them!

But you aren't storing NULL in the array.  I mean, you think you are,
and if you load back the entry from the array, you'll get a NULL.
But this is an allocating array, and so when you go to store NULL in
the array it _actually_ stores an XA_ZERO_ENTRY.  Which is converted
back to NULL when you load it.

You have to call xa_erase() to make an entry disappear from an allocating
array.  Just storing NULL isn't going to do it.
John Hubbard Feb. 5, 2020, 10:19 p.m. UTC | #3
On 2/4/20 6:25 AM, Jan Kara wrote:
> When storing NULL in xarray, xas_store() has been clearing all marks
> because it could otherwise confuse xas_for_each_marked(). That is
> however no longer true and no current user relies on this behavior.


However, let's not forget that the API was also documented to behave
in this way--it's not an accidental detail. Below...


> Furthermore it seems as a cleaner API to not do clearing behind caller's
> back in case we store NULL.
> 
> This provides a nice boost to truncate numbers due to saving unnecessary
> tag initialization when clearing shadow entries. Sample benchmark
> showing time to truncate 128 files 1GB each on machine with 64GB of RAM
> (so about half of entries are shadow entries):
> 
>          AVG      STDDEV
> Vanilla  4.825s   0.036
> Patched  4.516s   0.014
> 
> So we can see about 6% reduction in overall truncate time.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  lib/xarray.c | 9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 4e32497c51bd..f165e83652f1 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -799,17 +799,8 @@ void *xas_store(struct xa_state *xas, void *entry)
>  		if (xas->xa_sibs)
>  			xas_squash_marks(xas);
>  	}
> -	if (!entry)
> -		xas_init_marks(xas);
>  
>  	for (;;) {
> -		/*
> -		 * Must clear the marks before setting the entry to NULL,
> -		 * otherwise xas_for_each_marked may find a NULL entry and
> -		 * stop early.  rcu_assign_pointer contains a release barrier
> -		 * so the mark clearing will appear to happen before the
> -		 * entry is set to NULL.
> -		 */

So if we do this, I think we'd also want something like this (probably with
better wording, this is just a first draft):

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index 640934b6f7b4..8adeaa8c012e 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -66,10 +66,11 @@ pointer at every index.
 You can then set entries using xa_store() and get entries
 using xa_load().  xa_store will overwrite any entry with the
 new entry and return the previous entry stored at that index.  You can
-use xa_erase() instead of calling xa_store() with a
+use xa_erase() plus xas_init_marks(), instead of calling xa_store() with a
 ``NULL`` entry.  There is no difference between an entry that has never
-been stored to, one that has been erased and one that has most recently
-had ``NULL`` stored to it.
+been stored to and one that has been erased. Those, in turn, are the same
+as an entry that has had ``NULL`` stored to it and also had its marks
+erased via xas_init_marks().
 
 You can conditionally replace an entry at an index by using
 xa_cmpxchg().  Like cmpxchg(), it will only succeed if



>  		rcu_assign_pointer(*slot, entry);
>  		if (xa_is_node(next) && (!node || node->shift))
>  			xas_free_nodes(xas, xa_to_node(next));
> 



thanks,
Matthew Wilcox Feb. 6, 2020, 2:21 a.m. UTC | #4
On Wed, Feb 05, 2020 at 02:19:34PM -0800, John Hubbard wrote:
> So if we do this, I think we'd also want something like this (probably with
> better wording, this is just a first draft):
> 
> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
> index 640934b6f7b4..8adeaa8c012e 100644
> --- a/Documentation/core-api/xarray.rst
> +++ b/Documentation/core-api/xarray.rst
> @@ -66,10 +66,11 @@ pointer at every index.
>  You can then set entries using xa_store() and get entries
>  using xa_load().  xa_store will overwrite any entry with the
>  new entry and return the previous entry stored at that index.  You can
> -use xa_erase() instead of calling xa_store() with a
> +use xa_erase() plus xas_init_marks(), instead of calling xa_store() with a

Woah, woah, woah.  xa_erase() re-initialises the marks.  Nobody's going
to change that.  Don't confuse the porcelain and plumbing APIs.
John Hubbard Feb. 6, 2020, 3:48 a.m. UTC | #5
On 2/5/20 6:21 PM, Matthew Wilcox wrote:
> On Wed, Feb 05, 2020 at 02:19:34PM -0800, John Hubbard wrote:
>> So if we do this, I think we'd also want something like this (probably with
>> better wording, this is just a first draft):
>>
>> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
>> index 640934b6f7b4..8adeaa8c012e 100644
>> --- a/Documentation/core-api/xarray.rst
>> +++ b/Documentation/core-api/xarray.rst
>> @@ -66,10 +66,11 @@ pointer at every index.
>>   You can then set entries using xa_store() and get entries
>>   using xa_load().  xa_store will overwrite any entry with the
>>   new entry and return the previous entry stored at that index.  You can
>> -use xa_erase() instead of calling xa_store() with a
>> +use xa_erase() plus xas_init_marks(), instead of calling xa_store() with a
> 
> Woah, woah, woah.  xa_erase() re-initialises the marks.  Nobody's going


Yes, I get that. But I mis-wrote it, it should have read more like:

You can then set entries using xa_store() and get entries
using xa_load().  xa_store will overwrite any entry with the
new entry and return the previous entry stored at that index.  You can
use xa_erase(), instead of calling xa_store() with a
``NULL`` entry followed by xas_init_marks().  There is no difference between
an entry that has never been stored to and one that has been erased. Those,
in turn, are the same as an entry that has had ``NULL`` stored to it and
also had its marks erased via xas_init_marks().


> to change that.  Don't confuse the porcelain and plumbing APIs.
> 

The API still documents a different behavior than this patchset changes it
to, despite my imperfect attempts to update the documentation. So let's
please keep the porcelain aligned with the plumbing. :)

thanks,
Matthew Wilcox Feb. 6, 2020, 4:28 a.m. UTC | #6
On Wed, Feb 05, 2020 at 07:48:57PM -0800, John Hubbard wrote:
> You can then set entries using xa_store() and get entries
> using xa_load().  xa_store will overwrite any entry with the
> new entry and return the previous entry stored at that index.  You can
> use xa_erase(), instead of calling xa_store() with a
> ``NULL`` entry followed by xas_init_marks().  There is no difference between
> an entry that has never been stored to and one that has been erased. Those,
> in turn, are the same as an entry that has had ``NULL`` stored to it and
> also had its marks erased via xas_init_marks().

There's a fundamental misunderstanding here.  If you store a NULL, the
marks go away.  There is no such thing as a marked NULL entry.  If you
observe such a thing, it can only exist through some kind of permitted
RCU race, and the entry must be ignored.  If you're holding the xa_lock,
there is no way to observe a NULL entry with a search mark set.

What Jan is trying to do is allow code that knows what it's doing
the ability to say "Skip clearing the marks for performance reasons.
The marks are already clear."

I'm still mulling over the patches from Jan.  There's something I don't
like about them, but I can't articulate it in a useful way yet.  I'm on
board with the general principle, and obviously the xas_for_each_marked()
bug needs to be fixed.
John Hubbard Feb. 6, 2020, 4:37 a.m. UTC | #7
On 2/5/20 8:28 PM, Matthew Wilcox wrote:
> On Wed, Feb 05, 2020 at 07:48:57PM -0800, John Hubbard wrote:
>> You can then set entries using xa_store() and get entries
>> using xa_load().  xa_store will overwrite any entry with the
>> new entry and return the previous entry stored at that index.  You can
>> use xa_erase(), instead of calling xa_store() with a
>> ``NULL`` entry followed by xas_init_marks().  There is no difference between
>> an entry that has never been stored to and one that has been erased. Those,
>> in turn, are the same as an entry that has had ``NULL`` stored to it and
>> also had its marks erased via xas_init_marks().
> 
> There's a fundamental misunderstanding here.  If you store a NULL, the
> marks go away.  There is no such thing as a marked NULL entry.  If you
> observe such a thing, it can only exist through some kind of permitted
> RCU race, and the entry must be ignored.  If you're holding the xa_lock,
> there is no way to observe a NULL entry with a search mark set.


aha, the key point. Thanks for explaining it.

thanks,
Jan Kara Feb. 6, 2020, 8:04 a.m. UTC | #8
On Wed 05-02-20 14:19:34, John Hubbard wrote:
> On 2/4/20 6:25 AM, Jan Kara wrote:
> > When storing NULL in xarray, xas_store() has been clearing all marks
> > because it could otherwise confuse xas_for_each_marked(). That is
> > however no longer true and no current user relies on this behavior.
> 
> 
> However, let's not forget that the API was also documented to behave
> in this way--it's not an accidental detail. Below...

Yeah, we'll need to sort out details how we want the API to be used but
thanks for reminding me I should update the documentation :). It was on my
radar but then I forgot...

								Honza
Jan Kara Feb. 6, 2020, 8:36 a.m. UTC | #9
On Wed 05-02-20 20:28:01, Matthew Wilcox wrote:
> On Wed, Feb 05, 2020 at 07:48:57PM -0800, John Hubbard wrote:
> > You can then set entries using xa_store() and get entries
> > using xa_load().  xa_store will overwrite any entry with the
> > new entry and return the previous entry stored at that index.  You can
> > use xa_erase(), instead of calling xa_store() with a
> > ``NULL`` entry followed by xas_init_marks().  There is no difference between
> > an entry that has never been stored to and one that has been erased. Those,
> > in turn, are the same as an entry that has had ``NULL`` stored to it and
> > also had its marks erased via xas_init_marks().
> 
> There's a fundamental misunderstanding here.  If you store a NULL, the
> marks go away.  There is no such thing as a marked NULL entry.  If you
> observe such a thing, it can only exist through some kind of permitted
> RCU race, and the entry must be ignored.  If you're holding the xa_lock,
> there is no way to observe a NULL entry with a search mark set.
> 
> What Jan is trying to do is allow code that knows what it's doing
> the ability to say "Skip clearing the marks for performance reasons.
> The marks are already clear."
> 
> I'm still mulling over the patches from Jan.  There's something I don't
> like about them, but I can't articulate it in a useful way yet.  I'm on
> board with the general principle, and obviously the xas_for_each_marked()
> bug needs to be fixed.

There are different ways how to look at what I'm doing :) I was thinking
about it more like "xas_store() is for storing value at some index",
"xas_erase() is when I want the value at some index removed from the data
structure". Because these are principially different operations for any
data structure (as much as erasing can be *implemented* by just storing
NULL at some index). You seem to recognize this for xa_ functions but you
probably considered xas_ functions internal enough that they follow more
the "how it is implemented" way of thinking.

Now I agree that there are holes in my way of thinking about xas_store()
because if you happen to store NULL at some index, marks may get destroyed
as a side-effect. And some users of __xa_cmpxchg() (BTW nobody seems to be
using xa_cmpxchg_bh()) do use the fact that storing NULL does effectively
erase an entry which is BTW inconsistent with xa_store() itself as well...

You've been probably thinking more about xarray API semantics than I was so
I can be convinced otherwise but at this point, I'd rather move the API
more towards "erase is different from storing NULL".

									Honza
Jason Gunthorpe Feb. 6, 2020, 1:49 p.m. UTC | #10
On Wed, Feb 05, 2020 at 01:59:04PM -0800, Matthew Wilcox wrote:

> > How is RCU mark reading used anyhow?
> 
> We iterate over pages in the page cache with, eg, the dirty bit set.
> This bug will lead to the loop terminating early and failing to find
> dirty pages that it should.

But the inhernetly weak sync of marks and pointers means that
iteration will miss some dirty pages and return some clean pages. It
is all OK for some reason?
 
> > Actually the clearing of marks by xa_store(, NULL) is creating a very
> > subtle bug in drivers/infiniband/core/device.c :( Can you add a Fixes
> > line too:
> > 
> > ib_set_client_data() is assuming the marks for the entry will not
> > change, but if the caller passed in NULL they get wrongly reset, and
> > three call sites pass in NULL:
> >  drivers/infiniband/ulp/srpt/ib_srpt.c
> >  net/rds/ib.c
> >  net/smc/smc_ib.c
> > Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data")
> 
> There's no bug here.
> 
> If you're actually storing NULL in the array, then the marks would go
> away.  That's inherent -- imagine you have an array with a single entry
> at 64.  Then you store NULL there.  That causes the node to be deleted,
> and the marks must necessarily disappear with it -- there's nowhere to
> store them!

Ah, it is allocating! These little behavior differences are tricky to
remember over with infrequent use :(

Jason
Jan Kara Feb. 6, 2020, 2:36 p.m. UTC | #11
On Thu 06-02-20 09:49:04, Jason Gunthorpe wrote:
> On Wed, Feb 05, 2020 at 01:59:04PM -0800, Matthew Wilcox wrote:
> 
> > > How is RCU mark reading used anyhow?
> > 
> > We iterate over pages in the page cache with, eg, the dirty bit set.
> > This bug will lead to the loop terminating early and failing to find
> > dirty pages that it should.
> 
> But the inhernetly weak sync of marks and pointers means that
> iteration will miss some dirty pages and return some clean pages. It
> is all OK for some reason?

Yes. The reason is - the definitive source of dirtiness is in page flags so
if iteration returns more pages, we don't care. WRT missing pages we only
need to make sure pages that were dirty before the iteration started are
returned and the current code fulfills that.

> > > Actually the clearing of marks by xa_store(, NULL) is creating a very
> > > subtle bug in drivers/infiniband/core/device.c :( Can you add a Fixes
> > > line too:
> > > 
> > > ib_set_client_data() is assuming the marks for the entry will not
> > > change, but if the caller passed in NULL they get wrongly reset, and
> > > three call sites pass in NULL:
> > >  drivers/infiniband/ulp/srpt/ib_srpt.c
> > >  net/rds/ib.c
> > >  net/smc/smc_ib.c
> > > Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data")
> > 
> > There's no bug here.
> > 
> > If you're actually storing NULL in the array, then the marks would go
> > away.  That's inherent -- imagine you have an array with a single entry
> > at 64.  Then you store NULL there.  That causes the node to be deleted,
> > and the marks must necessarily disappear with it -- there's nowhere to
> > store them!
> 
> Ah, it is allocating! These little behavior differences are tricky to
> remember over with infrequent use :(

Yeah, that's why I'd prefer if NULL was not "special value" at all and if
someone wanted to remove index from xarray he'd always have to use a
special function. My patches go towards that direction but not the full way
because there's still xa_cmpxchg() whose users use the fact that NULL is in
fact 'erase'.

								Honza
Jason Gunthorpe Feb. 6, 2020, 2:49 p.m. UTC | #12
On Thu, Feb 06, 2020 at 03:36:27PM +0100, Jan Kara wrote:

> Yeah, that's why I'd prefer if NULL was not "special value" at all and if
> someone wanted to remove index from xarray he'd always have to use a
> special function. My patches go towards that direction but not the full way
> because there's still xa_cmpxchg() whose users use the fact that NULL is in
> fact 'erase'.

IMHO, this is more appealing. The fact that xa_store(NULL) on
non-allocating arrays changes marks seems very surprising/counter
intuitive. It feels wise to avoid subtle differences like this between
allocating/non-allocating mode.

So, it would be more uniform if xa_store and xa_cmpxchg never altered
marks. I suppose in practice this means that xa_store(NULL) has to
store a XA_ZERO_ENTRY even for non-allocating arrays, and related.

Perhaps xa_cmp_erase() could be introduced to take the place of
cmpxchg(NULL), and the distinction between erase and store NULL is
that erase has the mark-destroying property and guarentees the tree
can be pruned.

Jason
diff mbox series

Patch

diff --git a/lib/xarray.c b/lib/xarray.c
index 4e32497c51bd..f165e83652f1 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -799,17 +799,8 @@  void *xas_store(struct xa_state *xas, void *entry)
 		if (xas->xa_sibs)
 			xas_squash_marks(xas);
 	}
-	if (!entry)
-		xas_init_marks(xas);
 
 	for (;;) {
-		/*
-		 * Must clear the marks before setting the entry to NULL,
-		 * otherwise xas_for_each_marked may find a NULL entry and
-		 * stop early.  rcu_assign_pointer contains a release barrier
-		 * so the mark clearing will appear to happen before the
-		 * entry is set to NULL.
-		 */
 		rcu_assign_pointer(*slot, entry);
 		if (xa_is_node(next) && (!node || node->shift))
 			xas_free_nodes(xas, xa_to_node(next));