diff mbox series

mm, swap: Remove unnecessary smp_rmb() in swap_type_to_swap_info()

Message ID 20210513064837.3949064-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series mm, swap: Remove unnecessary smp_rmb() in swap_type_to_swap_info() | expand

Commit Message

Huang, Ying May 13, 2021, 6:48 a.m. UTC
Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
accesses to avoid NULL derefs"), the typical code to reference the
swap_info[] is as follows,

  type = swp_type(swp_entry);
  if (type >= nr_swapfiles)
          /* handle invalid swp_entry */;
  p = swap_info[type];
  /* access fields of *p.  OOPS! p may be NULL! */

Because the ordering isn't guaranteed, it's possible that "p" is read
before checking "type".  And that may result in NULL pointer
dereference.

So in commit c10d38cc8d3e, the code becomes,

  struct swap_info_struct *swap_type_to_swap_info(int type)
  {
	  if (type >= READ_ONCE(nr_swapfiles))
		  return NULL;
	  smp_rmb();
	  return READ_ONCE(swap_info[type]);
  }

  /* users */
  type = swp_type(swp_entry);
  p = swap_type_to_swap_info(type);
  if (!p)
	  /* handle invalid swp_entry */;
  /* access fields of *p */

Because "p" is checked to be non-zero before dereference, smp_rmb()
isn't needed anymore.

We still need to guarantee swap_info[type] is read before dereference.
That can be satisfied via the data dependency ordering of
READ_ONCE(swap_info[type]).  The corresponding smp_wmb() is adjusted
in alloc_swap_info() too.

And, we don't need to read "nr_swapfiles" too.  Because if
"type >= nr_swapfiles", swap_info[type] will be NULL.  We just need
to make sure we will not access out of the boundary of the array.
With that change, nr_swapfiles will only be accessed with swap_lock
held, except in swapcache_free_entries().  Where the absolute
correctness of the value isn't needed, as described in the comments.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Paul McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/swapfile.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Miaohe Lin May 13, 2021, 8:49 a.m. UTC | #1
On 2021/5/13 14:48, Huang Ying wrote:
> Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
> accesses to avoid NULL derefs"), the typical code to reference the
> swap_info[] is as follows,
> 
>   type = swp_type(swp_entry);
>   if (type >= nr_swapfiles)
>           /* handle invalid swp_entry */;
>   p = swap_info[type];
>   /* access fields of *p.  OOPS! p may be NULL! */
> 
> Because the ordering isn't guaranteed, it's possible that "p" is read
> before checking "type".  And that may result in NULL pointer
> dereference.
> 
> So in commit c10d38cc8d3e, the code becomes,
> 
>   struct swap_info_struct *swap_type_to_swap_info(int type)
>   {
> 	  if (type >= READ_ONCE(nr_swapfiles))
> 		  return NULL;
> 	  smp_rmb();
> 	  return READ_ONCE(swap_info[type]);
>   }
> 
>   /* users */
>   type = swp_type(swp_entry);
>   p = swap_type_to_swap_info(type);
>   if (!p)
> 	  /* handle invalid swp_entry */;
>   /* access fields of *p */
> 
> Because "p" is checked to be non-zero before dereference, smp_rmb()
> isn't needed anymore.
> 
> We still need to guarantee swap_info[type] is read before dereference.
> That can be satisfied via the data dependency ordering of
> READ_ONCE(swap_info[type]).  The corresponding smp_wmb() is adjusted
> in alloc_swap_info() too.
> 
> And, we don't need to read "nr_swapfiles" too.  Because if
> "type >= nr_swapfiles", swap_info[type] will be NULL.  We just need
> to make sure we will not access out of the boundary of the array.
> With that change, nr_swapfiles will only be accessed with swap_lock
> held, except in swapcache_free_entries().  Where the absolute
> correctness of the value isn't needed, as described in the comments.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Paul McKenney <paulmck@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/swapfile.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2aad85751991..4c1fb28bbe0e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>  
>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>  {
> -	if (type >= READ_ONCE(nr_swapfiles))
> +	if (type >= MAX_SWAPFILES)
>  		return NULL;
>  
> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> +	/*
> +	 * The data dependency ordering from the READ_ONCE() pairs
> +	 * with smp_wmb() in alloc_swap_info() to guarantee the
> +	 * swap_info_struct fields are read after swap_info[type].
> +	 */
>  	return READ_ONCE(swap_info[type]);
>  }
>  
> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		WRITE_ONCE(swap_info[type], p);
> -		/*
> -		 * Write swap_info[type] before nr_swapfiles, in case a
> -		 * racing procfs swap_start() or swap_next() is reading them.
> -		 * (We never shrink nr_swapfiles, we never free this entry.)
> -		 */
> +		/* Paired with READ_ONCE() in swap_type_to_swap_info() */
>  		smp_wmb();

Many thank for your patch. The patch looks fine to me. There is one question:

There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ?
Could you please have a explanation ?

Thanks again!

> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> +		WRITE_ONCE(swap_info[type], p);
> +		nr_swapfiles++;
>  	} else {
>  		defer = p;
>  		p = swap_info[type];
>
Muchun Song May 13, 2021, 9:54 a.m. UTC | #2
On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2021/5/13 14:48, Huang Ying wrote:
> > Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
> > accesses to avoid NULL derefs"), the typical code to reference the
> > swap_info[] is as follows,
> >
> >   type = swp_type(swp_entry);
> >   if (type >= nr_swapfiles)
> >           /* handle invalid swp_entry */;
> >   p = swap_info[type];
> >   /* access fields of *p.  OOPS! p may be NULL! */
> >
> > Because the ordering isn't guaranteed, it's possible that "p" is read
> > before checking "type".  And that may result in NULL pointer
> > dereference.
> >
> > So in commit c10d38cc8d3e, the code becomes,
> >
> >   struct swap_info_struct *swap_type_to_swap_info(int type)
> >   {
> >         if (type >= READ_ONCE(nr_swapfiles))
> >                 return NULL;
> >         smp_rmb();
> >         return READ_ONCE(swap_info[type]);
> >   }
> >
> >   /* users */
> >   type = swp_type(swp_entry);
> >   p = swap_type_to_swap_info(type);
> >   if (!p)
> >         /* handle invalid swp_entry */;
> >   /* access fields of *p */
> >
> > Because "p" is checked to be non-zero before dereference, smp_rmb()
> > isn't needed anymore.
> >
> > We still need to guarantee swap_info[type] is read before dereference.
> > That can be satisfied via the data dependency ordering of
> > READ_ONCE(swap_info[type]).  The corresponding smp_wmb() is adjusted
> > in alloc_swap_info() too.
> >
> > And, we don't need to read "nr_swapfiles" too.  Because if
> > "type >= nr_swapfiles", swap_info[type] will be NULL.  We just need
> > to make sure we will not access out of the boundary of the array.
> > With that change, nr_swapfiles will only be accessed with swap_lock
> > held, except in swapcache_free_entries().  Where the absolute
> > correctness of the value isn't needed, as described in the comments.
> >
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Paul McKenney <paulmck@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/swapfile.c | 18 +++++++++---------
> >  1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 2aad85751991..4c1fb28bbe0e 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
> >
> >  static struct swap_info_struct *swap_type_to_swap_info(int type)
> >  {
> > -     if (type >= READ_ONCE(nr_swapfiles))
> > +     if (type >= MAX_SWAPFILES)
> >               return NULL;
> >
> > -     smp_rmb();      /* Pairs with smp_wmb in alloc_swap_info. */
> > +     /*
> > +      * The data dependency ordering from the READ_ONCE() pairs
> > +      * with smp_wmb() in alloc_swap_info() to guarantee the
> > +      * swap_info_struct fields are read after swap_info[type].
> > +      */
> >       return READ_ONCE(swap_info[type]);
> >  }
> >
> > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
> >       }
> >       if (type >= nr_swapfiles) {
> >               p->type = type;
> > -             WRITE_ONCE(swap_info[type], p);
> > -             /*
> > -              * Write swap_info[type] before nr_swapfiles, in case a
> > -              * racing procfs swap_start() or swap_next() is reading them.
> > -              * (We never shrink nr_swapfiles, we never free this entry.)
> > -              */
> > +             /* Paired with READ_ONCE() in swap_type_to_swap_info() */
> >               smp_wmb();
>
> Many thank for your patch. The patch looks fine to me. There is one question:
>
> There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ?
> Could you please have a explanation ?

The comment is very clear, it matches READ_ONCE() which implies a
data dependence barrier on some archs.

Thanks.

>
> Thanks again!
>
> > -             WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> > +             WRITE_ONCE(swap_info[type], p);
> > +             nr_swapfiles++;
> >       } else {
> >               defer = p;
> >               p = swap_info[type];
> >
>
Miaohe Lin May 13, 2021, 11:27 a.m. UTC | #3
On 2021/5/13 17:54, Muchun Song wrote:
> On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2021/5/13 14:48, Huang Ying wrote:
>>> Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
>>> accesses to avoid NULL derefs"), the typical code to reference the
>>> swap_info[] is as follows,
>>>
>>>   type = swp_type(swp_entry);
>>>   if (type >= nr_swapfiles)
>>>           /* handle invalid swp_entry */;
>>>   p = swap_info[type];
>>>   /* access fields of *p.  OOPS! p may be NULL! */
>>>
>>> Because the ordering isn't guaranteed, it's possible that "p" is read
>>> before checking "type".  And that may result in NULL pointer
>>> dereference.
>>>
>>> So in commit c10d38cc8d3e, the code becomes,
>>>
>>>   struct swap_info_struct *swap_type_to_swap_info(int type)
>>>   {
>>>         if (type >= READ_ONCE(nr_swapfiles))
>>>                 return NULL;
>>>         smp_rmb();
>>>         return READ_ONCE(swap_info[type]);
>>>   }
>>>
>>>   /* users */
>>>   type = swp_type(swp_entry);
>>>   p = swap_type_to_swap_info(type);
>>>   if (!p)
>>>         /* handle invalid swp_entry */;
>>>   /* access fields of *p */
>>>
>>> Because "p" is checked to be non-zero before dereference, smp_rmb()
>>> isn't needed anymore.
>>>
>>> We still need to guarantee swap_info[type] is read before dereference.
>>> That can be satisfied via the data dependency ordering of
>>> READ_ONCE(swap_info[type]).  The corresponding smp_wmb() is adjusted
>>> in alloc_swap_info() too.
>>>
>>> And, we don't need to read "nr_swapfiles" too.  Because if
>>> "type >= nr_swapfiles", swap_info[type] will be NULL.  We just need
>>> to make sure we will not access out of the boundary of the array.
>>> With that change, nr_swapfiles will only be accessed with swap_lock
>>> held, except in swapcache_free_entries().  Where the absolute
>>> correctness of the value isn't needed, as described in the comments.
>>>
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>>> Cc: Dan Carpenter <dan.carpenter@oracle.com>
>>> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Cc: Andi Kleen <ak@linux.intel.com>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: Omar Sandoval <osandov@fb.com>
>>> Cc: Paul McKenney <paulmck@kernel.org>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> Cc: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/swapfile.c | 18 +++++++++---------
>>>  1 file changed, 9 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 2aad85751991..4c1fb28bbe0e 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>>>
>>>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>>>  {
>>> -     if (type >= READ_ONCE(nr_swapfiles))
>>> +     if (type >= MAX_SWAPFILES)
>>>               return NULL;
>>>
>>> -     smp_rmb();      /* Pairs with smp_wmb in alloc_swap_info. */
>>> +     /*
>>> +      * The data dependency ordering from the READ_ONCE() pairs
>>> +      * with smp_wmb() in alloc_swap_info() to guarantee the
>>> +      * swap_info_struct fields are read after swap_info[type].
>>> +      */
>>>       return READ_ONCE(swap_info[type]);
>>>  }
>>>
>>> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
>>>       }
>>>       if (type >= nr_swapfiles) {
>>>               p->type = type;
>>> -             WRITE_ONCE(swap_info[type], p);
>>> -             /*
>>> -              * Write swap_info[type] before nr_swapfiles, in case a
>>> -              * racing procfs swap_start() or swap_next() is reading them.
>>> -              * (We never shrink nr_swapfiles, we never free this entry.)
>>> -              */
>>> +             /* Paired with READ_ONCE() in swap_type_to_swap_info() */
>>>               smp_wmb();
>>
>> Many thank for your patch. The patch looks fine to me. There is one question:
>>
>> There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ?
>> Could you please have a explanation ?
> 
> The comment is very clear, it matches READ_ONCE() which implies a
> data dependence barrier on some archs.
> 
> Thanks.

Got it. I misunderstood it... Thanks!

> 
>>
>> Thanks again!
>>
>>> -             WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
>>> +             WRITE_ONCE(swap_info[type], p);
>>> +             nr_swapfiles++;
>>>       } else {
>>>               defer = p;
>>>               p = swap_info[type];
>>>
>>
> .
>
Peter Zijlstra May 13, 2021, 12:34 p.m. UTC | #4
On Thu, May 13, 2021 at 05:54:42PM +0800, Muchun Song wrote:
> On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > On 2021/5/13 14:48, Huang Ying wrote:

> > >  mm/swapfile.c | 18 +++++++++---------
> > >  1 file changed, 9 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 2aad85751991..4c1fb28bbe0e 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
> > >
> > >  static struct swap_info_struct *swap_type_to_swap_info(int type)
> > >  {
> > > -     if (type >= READ_ONCE(nr_swapfiles))
> > > +     if (type >= MAX_SWAPFILES)
> > >               return NULL;
> > >
> > > -     smp_rmb();      /* Pairs with smp_wmb in alloc_swap_info. */
> > > +     /*
> > > +      * The data dependency ordering from the READ_ONCE() pairs
> > > +      * with smp_wmb() in alloc_swap_info() to guarantee the
> > > +      * swap_info_struct fields are read after swap_info[type].
> > > +      */
> > >       return READ_ONCE(swap_info[type]);
> > >  }
> > >
> > > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
> > >       }
> > >       if (type >= nr_swapfiles) {
> > >               p->type = type;
> > > -             WRITE_ONCE(swap_info[type], p);
> > > -             /*
> > > -              * Write swap_info[type] before nr_swapfiles, in case a
> > > -              * racing procfs swap_start() or swap_next() is reading them.
> > > -              * (We never shrink nr_swapfiles, we never free this entry.)
> > > -              */
> > > +             /* Paired with READ_ONCE() in swap_type_to_swap_info() */
> > >               smp_wmb();
> >
> > Many thank for your patch. The patch looks fine to me. There is one question:
> >
> > There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ?
> > Could you please have a explanation ?
> 
> The comment is very clear, it matches READ_ONCE() which implies a
> data dependence barrier on some archs.

This statement doesn't make sense; this isn't code that needs to be
correct on 'some' archs, it needs to be unconditionally correct.

Also, you cannot pair with a single memop, there is no order in a set of
one element.

And if you depend on a data dependency, you need a store order; but you
just removed the store order. in which case the data dependency is also
moot.

All of this is utter confusion. Possibly correct, but a complete
trainwreck non-the-less.

Either you say ordering is irrelevant, because we only ever increase the
number of swapfiles and therefore any load is either NULL or the correct
pointer, as guaranteed by WRITE_ONCE()/READ_ONCE() avoiding load/store
tearing.

Or you need the data dependency, but then you also need the store order
like:

	CPU0					CPU1

	if (type >= READ_ONCE(nr_swapfiles))	WRITE_ONCE(swap_info[type], p);
		return NULL;
	/* data-dependency on type */		smp_wmb();
	return READ_ONCE(swap_info[type]);	WRITE_ONCE(nr_swapfiles, nr_swapfiles+1);

But you cannot have half of both and expect any of it to make sense.
Peter Zijlstra May 13, 2021, 12:46 p.m. UTC | #5
On Thu, May 13, 2021 at 02:48:37PM +0800, Huang Ying wrote:
>  mm/swapfile.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2aad85751991..4c1fb28bbe0e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>  
>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>  {
> -	if (type >= READ_ONCE(nr_swapfiles))
> +	if (type >= MAX_SWAPFILES)
>  		return NULL;
>  
> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> +	/*
> +	 * The data dependency ordering from the READ_ONCE() pairs
> +	 * with smp_wmb() in alloc_swap_info() to guarantee the
> +	 * swap_info_struct fields are read after swap_info[type].
> +	 */
>  	return READ_ONCE(swap_info[type]);
>  }
>  
> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		WRITE_ONCE(swap_info[type], p);
> -		/*
> -		 * Write swap_info[type] before nr_swapfiles, in case a
> -		 * racing procfs swap_start() or swap_next() is reading them.
> -		 * (We never shrink nr_swapfiles, we never free this entry.)
> -		 */
> +		/* Paired with READ_ONCE() in swap_type_to_swap_info() */
>  		smp_wmb();
> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> +		WRITE_ONCE(swap_info[type], p);
> +		nr_swapfiles++;

Ah, I think I see what you meant to say, it would perhaps help if you
write it like so:


diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..94735248dcd2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
-	if (type >= READ_ONCE(nr_swapfiles))
+	if (type >= MAX_SWAPFILES)
 		return NULL;
 
-	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
-	return READ_ONCE(swap_info[type]);
+	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
 static inline unsigned char swap_count(unsigned char ent)
@@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void)
 	}
 	if (type >= nr_swapfiles) {
 		p->type = type;
-		WRITE_ONCE(swap_info[type], p);
 		/*
-		 * Write swap_info[type] before nr_swapfiles, in case a
-		 * racing procfs swap_start() or swap_next() is reading them.
-		 * (We never shrink nr_swapfiles, we never free this entry.)
+		 * Publish the swap_info_struct.
 		 */
-		smp_wmb();
-		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
+		smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */
+		nr_swapfiles++;
 	} else {
 		defer = p;
 		p = swap_info[type];
Daniel Jordan May 14, 2021, 1:59 a.m. UTC | #6
On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote:
> Ah, I think I see what you meant to say, it would perhaps help if you
> write it like so:
> 
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..94735248dcd2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>  
>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>  {
> -	if (type >= READ_ONCE(nr_swapfiles))
> +	if (type >= MAX_SWAPFILES)
>  		return NULL;
>  
> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> -	return READ_ONCE(swap_info[type]);
> +	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
>  }
>  
>  static inline unsigned char swap_count(unsigned char ent)
> @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		WRITE_ONCE(swap_info[type], p);
>  		/*
> -		 * Write swap_info[type] before nr_swapfiles, in case a
> -		 * racing procfs swap_start() or swap_next() is reading them.
> -		 * (We never shrink nr_swapfiles, we never free this entry.)
> +		 * Publish the swap_info_struct.
>  		 */
> -		smp_wmb();
> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> +		smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */
> +		nr_swapfiles++;

Yes, this does help, I didn't understand why smp_wmb stayed around in
the original post.

I think the only access smp_store_release() orders is p->type.  Wouldn't
it be kinda inconsistent to only initialize that one field before
publishing when many others would be done at the end of
alloc_swap_info() after the fact?  p->type doesn't seem special.  For
instance, get_swap_page_of_type() touches si->lock soon after it calls
swap_type_to_swap_info(), so there could be a small window where there's
a non-NULL si with an uninitialized lock.

It's not as if this is likely to be a problem in practice, it would just
make it harder to understand why smp_store_release is there.  Maybe all
we need is a WRITE_ONCE, or if it's really necessary for certain fields
to be set before publication then move them up and explain?
Huang, Ying May 14, 2021, 3:27 a.m. UTC | #7
Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, May 13, 2021 at 02:48:37PM +0800, Huang Ying wrote:
>>  mm/swapfile.c | 18 +++++++++---------
>>  1 file changed, 9 insertions(+), 9 deletions(-)
>> 
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 2aad85751991..4c1fb28bbe0e 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>>  
>>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>>  {
>> -	if (type >= READ_ONCE(nr_swapfiles))
>> +	if (type >= MAX_SWAPFILES)
>>  		return NULL;
>>  
>> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
>> +	/*
>> +	 * The data dependency ordering from the READ_ONCE() pairs
>> +	 * with smp_wmb() in alloc_swap_info() to guarantee the
>> +	 * swap_info_struct fields are read after swap_info[type].
>> +	 */
>>  	return READ_ONCE(swap_info[type]);
>>  }
>>  
>> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	}
>>  	if (type >= nr_swapfiles) {
>>  		p->type = type;
>> -		WRITE_ONCE(swap_info[type], p);
>> -		/*
>> -		 * Write swap_info[type] before nr_swapfiles, in case a
>> -		 * racing procfs swap_start() or swap_next() is reading them.
>> -		 * (We never shrink nr_swapfiles, we never free this entry.)
>> -		 */
>> +		/* Paired with READ_ONCE() in swap_type_to_swap_info() */
>>  		smp_wmb();
>> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
>> +		WRITE_ONCE(swap_info[type], p);
>> +		nr_swapfiles++;
>
> Ah, I think I see what you meant to say, it would perhaps help if you
> write it like so:
>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..94735248dcd2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>  
>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>  {
> -	if (type >= READ_ONCE(nr_swapfiles))
> +	if (type >= MAX_SWAPFILES)
>  		return NULL;
>  
> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> -	return READ_ONCE(swap_info[type]);
> +	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
>  }
>  
>  static inline unsigned char swap_count(unsigned char ent)
> @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		WRITE_ONCE(swap_info[type], p);
>  		/*
> -		 * Write swap_info[type] before nr_swapfiles, in case a
> -		 * racing procfs swap_start() or swap_next() is reading them.
> -		 * (We never shrink nr_swapfiles, we never free this entry.)
> +		 * Publish the swap_info_struct.
>  		 */
> -		smp_wmb();
> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> +		smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */
> +		nr_swapfiles++;
>  	} else {
>  		defer = p;
>  		p = swap_info[type];

OK.  It seems that this helps people to understand.  I will use this in
the next version.

Best Regards,
Huang, Ying
Huang, Ying May 14, 2021, 4:02 a.m. UTC | #8
Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote:
>> Ah, I think I see what you meant to say, it would perhaps help if you
>> write it like so:
>> 
>> 
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 149e77454e3c..94735248dcd2 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>>  
>>  static struct swap_info_struct *swap_type_to_swap_info(int type)
>>  {
>> -	if (type >= READ_ONCE(nr_swapfiles))
>> +	if (type >= MAX_SWAPFILES)
>>  		return NULL;
>>  
>> -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
>> -	return READ_ONCE(swap_info[type]);
>> +	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
>>  }
>>  
>>  static inline unsigned char swap_count(unsigned char ent)
>> @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	}
>>  	if (type >= nr_swapfiles) {
>>  		p->type = type;
>> -		WRITE_ONCE(swap_info[type], p);
>>  		/*
>> -		 * Write swap_info[type] before nr_swapfiles, in case a
>> -		 * racing procfs swap_start() or swap_next() is reading them.
>> -		 * (We never shrink nr_swapfiles, we never free this entry.)
>> +		 * Publish the swap_info_struct.
>>  		 */
>> -		smp_wmb();
>> -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
>> +		smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */
>> +		nr_swapfiles++;
>
> Yes, this does help, I didn't understand why smp_wmb stayed around in
> the original post.
>
> I think the only access smp_store_release() orders is p->type.  Wouldn't
> it be kinda inconsistent to only initialize that one field before
> publishing when many others would be done at the end of
> alloc_swap_info() after the fact?

In addition to p->type, *p is zeroed via kvzalloc().

> p->type doesn't seem special.  For
> instance, get_swap_page_of_type() touches si->lock soon after it calls
> swap_type_to_swap_info(), so there could be a small window where there's
> a non-NULL si with an uninitialized lock.

We usually check the state of swap_info_struct before other operations.
For example, we check si->swap_map in swap_start().

> It's not as if this is likely to be a problem in practice, it would just
> make it harder to understand why smp_store_release is there.  Maybe all
> we need is a WRITE_ONCE, or if it's really necessary for certain fields
> to be set before publication then move them up and explain?

I think we have initialized all fields before publication :-).

Best Regards,
Huang, Ying
Peter Zijlstra May 14, 2021, 12:04 p.m. UTC | #9
On Thu, May 13, 2021 at 09:59:46PM -0400, Daniel Jordan wrote:
> On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote:
> > Ah, I think I see what you meant to say, it would perhaps help if you
> > write it like so:
> > 
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 149e77454e3c..94735248dcd2 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0);
> >  
> >  static struct swap_info_struct *swap_type_to_swap_info(int type)
> >  {
> > -	if (type >= READ_ONCE(nr_swapfiles))
> > +	if (type >= MAX_SWAPFILES)
> >  		return NULL;
> >  
> > -	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> > -	return READ_ONCE(swap_info[type]);
> > +	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
> >  }
> >  
> >  static inline unsigned char swap_count(unsigned char ent)
> > @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void)
> >  	}
> >  	if (type >= nr_swapfiles) {
> >  		p->type = type;
> > -		WRITE_ONCE(swap_info[type], p);
> >  		/*
> > -		 * Write swap_info[type] before nr_swapfiles, in case a
> > -		 * racing procfs swap_start() or swap_next() is reading them.
> > -		 * (We never shrink nr_swapfiles, we never free this entry.)
> > +		 * Publish the swap_info_struct.
> >  		 */
> > -		smp_wmb();
> > -		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> > +		smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */
> > +		nr_swapfiles++;
> 
> Yes, this does help, I didn't understand why smp_wmb stayed around in
> the original post.
> 
> I think the only access smp_store_release() orders is p->type.  Wouldn't
> it be kinda inconsistent to only initialize that one field before
> publishing when many others would be done at the end of
> alloc_swap_info() after the fact?  p->type doesn't seem special.  For
> instance, get_swap_page_of_type() touches si->lock soon after it calls
> swap_type_to_swap_info(), so there could be a small window where there's
> a non-NULL si with an uninitialized lock.
> 
> It's not as if this is likely to be a problem in practice, it would just
> make it harder to understand why smp_store_release is there.  Maybe all
> we need is a WRITE_ONCE, or if it's really necessary for certain fields
> to be set before publication then move them up and explain?

You also care about the zero fill from kvzalloc(). Without the
smp_store_release() the zero-fill from the memset() might only be
visible 'late'.

Unless that also isn't a problem?
Daniel Jordan May 14, 2021, 8:49 p.m. UTC | #10
On Fri, May 14, 2021 at 12:02:05PM +0800, Huang, Ying wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
> > Yes, this does help, I didn't understand why smp_wmb stayed around in
> > the original post.
> >
> > I think the only access smp_store_release() orders is p->type.  Wouldn't
> > it be kinda inconsistent to only initialize that one field before
> > publishing when many others would be done at the end of
> > alloc_swap_info() after the fact?
> 
> In addition to p->type, *p is zeroed via kvzalloc().

So it is, good point.

> > p->type doesn't seem special.  For
> > instance, get_swap_page_of_type() touches si->lock soon after it calls
> > swap_type_to_swap_info(), so there could be a small window where there's
> > a non-NULL si with an uninitialized lock.
> 
> We usually check the state of swap_info_struct before other operations.
> For example, we check si->swap_map in swap_start().

Yes, we usually do.

> > It's not as if this is likely to be a problem in practice, it would just
> > make it harder to understand why smp_store_release is there.  Maybe all
> > we need is a WRITE_ONCE, or if it's really necessary for certain fields
> > to be set before publication then move them up and explain?
> 
> I think we have initialized all fields before publication :-).

Probably all the ones that matter in practice, yes :-)

Still feeling slightly uneasy about the theoretical p->lock, but that
was possible before this change too so it's out of scope.

A comment explaining the pairing and that we care mostly about the zero
init would be nice.
Daniel Jordan May 14, 2021, 8:51 p.m. UTC | #11
On Fri, May 14, 2021 at 02:04:14PM +0200, Peter Zijlstra wrote:
> On Thu, May 13, 2021 at 09:59:46PM -0400, Daniel Jordan wrote:
> > Yes, this does help, I didn't understand why smp_wmb stayed around in
> > the original post.
> > 
> > I think the only access smp_store_release() orders is p->type.  Wouldn't
> > it be kinda inconsistent to only initialize that one field before
> > publishing when many others would be done at the end of
> > alloc_swap_info() after the fact?  p->type doesn't seem special.  For
> > instance, get_swap_page_of_type() touches si->lock soon after it calls
> > swap_type_to_swap_info(), so there could be a small window where there's
> > a non-NULL si with an uninitialized lock.
> > 
> > It's not as if this is likely to be a problem in practice, it would just
> > make it harder to understand why smp_store_release is there.  Maybe all
> > we need is a WRITE_ONCE, or if it's really necessary for certain fields
> > to be set before publication then move them up and explain?
> 
> You also care about the zero fill from kvzalloc(). Without the
> smp_store_release() the zero-fill from the memset() might only be
> visible 'late'.

Aha, yes, didn't consider that!

> Unless that also isn't a problem?

No, you're right, we need that for p->flags at least.
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2aad85751991..4c1fb28bbe0e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -100,10 +100,14 @@  atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
-	if (type >= READ_ONCE(nr_swapfiles))
+	if (type >= MAX_SWAPFILES)
 		return NULL;
 
-	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
+	/*
+	 * The data dependency ordering from the READ_ONCE() pairs
+	 * with smp_wmb() in alloc_swap_info() to guarantee the
+	 * swap_info_struct fields are read after swap_info[type].
+	 */
 	return READ_ONCE(swap_info[type]);
 }
 
@@ -2884,14 +2888,10 @@  static struct swap_info_struct *alloc_swap_info(void)
 	}
 	if (type >= nr_swapfiles) {
 		p->type = type;
-		WRITE_ONCE(swap_info[type], p);
-		/*
-		 * Write swap_info[type] before nr_swapfiles, in case a
-		 * racing procfs swap_start() or swap_next() is reading them.
-		 * (We never shrink nr_swapfiles, we never free this entry.)
-		 */
+		/* Paired with READ_ONCE() in swap_type_to_swap_info() */
 		smp_wmb();
-		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
+		WRITE_ONCE(swap_info[type], p);
+		nr_swapfiles++;
 	} else {
 		defer = p;
 		p = swap_info[type];