Convert struct pid count to refcount_t
diff mbox series

Message ID 20190327145331.215360-1-joel@joelfernandes.org
State New
Headers show
Series
  • Convert struct pid count to refcount_t
Related show

Commit Message

Joel Fernandes March 27, 2019, 2:53 p.m. UTC
struct pid's count is an atomic_t field used as a refcount. Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs. No change in behavior if
CONFIG_REFCOUNT_FULL=n.

Cc: keescook@chromium.org
Cc: kernel-team@android.com
Cc: kernel-hardening@lists.openwall.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
 include/linux/pid.h | 5 +++--
 kernel/pid.c        | 8 ++++----
 2 files changed, 7 insertions(+), 6 deletions(-)

Comments

Kees Cook March 28, 2019, 12:06 a.m. UTC | #1
On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> struct pid's count is an atomic_t field used as a refcount. Use
> refcount_t for it which is basically atomic_t but does additional
> checking to prevent use-after-free bugs. No change in behavior if
> CONFIG_REFCOUNT_FULL=n.
>
> Cc: keescook@chromium.org
> Cc: kernel-team@android.com
> Cc: kernel-hardening@lists.openwall.com
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> [...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..2095c7da644d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -37,7 +37,7 @@
>  #include <linux/init_task.h>
>  #include <linux/syscalls.h>
>  #include <linux/proc_ns.h>
> -#include <linux/proc_fs.h>
> +#include <linux/refcount.h>
>  #include <linux/sched/task.h>
>  #include <linux/idr.h>
>
> @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
>                 return;
>
>         ns = pid->numbers[pid->level].ns;
> -       if ((atomic_read(&pid->count) == 1) ||
> -            atomic_dec_and_test(&pid->count)) {
> +       if ((refcount_read(&pid->count) == 1) ||
> +            refcount_dec_and_test(&pid->count)) {

Why is this (and the original code) safe in the face of a race against
get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
don't see this code pattern anywhere else in the kernel.

Beyond that, yes please. :)

-Kees


>                 kmem_cache_free(ns->pid_cachep, pid);
>                 put_pid_ns(ns);
>         }
> @@ -210,7 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>         }
>
>         get_pid_ns(ns);
> -       atomic_set(&pid->count, 1);
> +       refcount_set(&pid->count, 1);
>         for (type = 0; type < PIDTYPE_MAX; ++type)
>                 INIT_HLIST_HEAD(&pid->tasks[type]);
>
> --
> 2.21.0.392.gf8f6787159e-goog
>


--
Kees Cook
Jann Horn March 28, 2019, 12:59 a.m. UTC | #2
On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > struct pid's count is an atomic_t field used as a refcount. Use
> > refcount_t for it which is basically atomic_t but does additional
> > checking to prevent use-after-free bugs. No change in behavior if
> > CONFIG_REFCOUNT_FULL=n.
> >
> > Cc: keescook@chromium.org
> > Cc: kernel-team@android.com
> > Cc: kernel-hardening@lists.openwall.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..2095c7da644d 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -37,7 +37,7 @@
> >  #include <linux/init_task.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/proc_ns.h>
> > -#include <linux/proc_fs.h>
> > +#include <linux/refcount.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/idr.h>
> >
> > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> >                 return;
> >
> >         ns = pid->numbers[pid->level].ns;
> > -       if ((atomic_read(&pid->count) == 1) ||
> > -            atomic_dec_and_test(&pid->count)) {
> > +       if ((refcount_read(&pid->count) == 1) ||
> > +            refcount_dec_and_test(&pid->count)) {
>
> Why is this (and the original code) safe in the face of a race against
> get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> don't see this code pattern anywhere else in the kernel.

Semantically, it doesn't make a difference whether you do this or
leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
refcount_read(), then you have the only reference to "struct pid", and
therefore you want to free it. If you don't get a 1, you have to
atomically drop a reference, which, if someone else is concurrently
also dropping a reference, may leave you with the last reference (in
the case where refcount_dec_and_test() returns true), in which case
you still have to take care of freeing it.

My guess is that the goal of this is to make the "drop last reference"
case a little bit faster by avoiding the cacheline dirtying and the
atomic op, at the expense of an extra memory op and branch every time
we drop a non-final reference. But that's a pretty low-level
optimization, and forking by itself isn't exactly fast... I think the
clean thing to do would be to either move this detail into the
refcount implementation (if it turns out to actually be valuable in at
least a microbenchmark), or just get rid of it. Given the overhead of
fork()/clone(), I would be surprised if you could actually measure
this effect here.

Eric, can you remember the rationale for doing it that way in commit
92476d7fc0326a409ab1d3864a04093a6be9aca7? Am I guessing correctly?
Joel Fernandes March 28, 2019, 2:34 a.m. UTC | #3
On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > struct pid's count is an atomic_t field used as a refcount. Use
> > > refcount_t for it which is basically atomic_t but does additional
> > > checking to prevent use-after-free bugs. No change in behavior if
> > > CONFIG_REFCOUNT_FULL=n.
> > >
> > > Cc: keescook@chromium.org
> > > Cc: kernel-team@android.com
> > > Cc: kernel-hardening@lists.openwall.com
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > [...]
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..2095c7da644d 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -37,7 +37,7 @@
> > >  #include <linux/init_task.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/proc_ns.h>
> > > -#include <linux/proc_fs.h>
> > > +#include <linux/refcount.h>
> > >  #include <linux/sched/task.h>
> > >  #include <linux/idr.h>
> > >
> > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > >                 return;
> > >
> > >         ns = pid->numbers[pid->level].ns;
> > > -       if ((atomic_read(&pid->count) == 1) ||
> > > -            atomic_dec_and_test(&pid->count)) {
> > > +       if ((refcount_read(&pid->count) == 1) ||
> > > +            refcount_dec_and_test(&pid->count)) {
> >
> > Why is this (and the original code) safe in the face of a race against
> > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > don't see this code pattern anywhere else in the kernel.
> 
> Semantically, it doesn't make a difference whether you do this or
> leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> refcount_read(), then you have the only reference to "struct pid", and
> therefore you want to free it. If you don't get a 1, you have to
> atomically drop a reference, which, if someone else is concurrently
> also dropping a reference, may leave you with the last reference (in
> the case where refcount_dec_and_test() returns true), in which case
> you still have to take care of freeing it.

Also, based on Kees comment, I think it appears to me that get_pid and
put_pid can race in this way in the original code right?

get_pid			put_pid

			atomic_dec_and_test returns 1
atomic_inc
			kfree

deref pid /* boom */
-------------------------------------------------

I think get_pid needs to call atomic_inc_not_zero() and put_pid should
not test for pid->count == 1 as condition for freeing, but rather just do
atomic_dec_and_test. So something like the following diff. (And I see a
similar pattern used in drivers/net/mac.c)

Is the above scenario valid? I didn't see any locking around get_pid or
pud_pid to avoid such a race.

---8<-----------------------

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8cb86d377ff5..3d79834e3180 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
 
 static inline struct pid *get_pid(struct pid *pid)
 {
-	if (pid)
-		refcount_inc(&pid->count);
+	if (!pid || !refcount_inc_not_zero(&pid->count))
+		return NULL;
 	return pid;
 }
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 2095c7da644d..89c4849fab5d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -106,8 +106,7 @@ void put_pid(struct pid *pid)
 		return;
 
 	ns = pid->numbers[pid->level].ns;
-	if ((refcount_read(&pid->count) == 1) ||
-	     refcount_dec_and_test(&pid->count)) {
+	if (refcount_dec_and_test(&pid->count)) {
 		kmem_cache_free(ns->pid_cachep, pid);
 		put_pid_ns(ns);
 	}
Jann Horn March 28, 2019, 2:57 a.m. UTC | #4
On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > refcount_t for it which is basically atomic_t but does additional
> > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > CONFIG_REFCOUNT_FULL=n.
> > > >
> > > > Cc: keescook@chromium.org
> > > > Cc: kernel-team@android.com
> > > > Cc: kernel-hardening@lists.openwall.com
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > [...]
> > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > index 20881598bdfa..2095c7da644d 100644
> > > > --- a/kernel/pid.c
> > > > +++ b/kernel/pid.c
> > > > @@ -37,7 +37,7 @@
> > > >  #include <linux/init_task.h>
> > > >  #include <linux/syscalls.h>
> > > >  #include <linux/proc_ns.h>
> > > > -#include <linux/proc_fs.h>
> > > > +#include <linux/refcount.h>
> > > >  #include <linux/sched/task.h>
> > > >  #include <linux/idr.h>
> > > >
> > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > >                 return;
> > > >
> > > >         ns = pid->numbers[pid->level].ns;
> > > > -       if ((atomic_read(&pid->count) == 1) ||
> > > > -            atomic_dec_and_test(&pid->count)) {
> > > > +       if ((refcount_read(&pid->count) == 1) ||
> > > > +            refcount_dec_and_test(&pid->count)) {
> > >
> > > Why is this (and the original code) safe in the face of a race against
> > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > don't see this code pattern anywhere else in the kernel.
> >
> > Semantically, it doesn't make a difference whether you do this or
> > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > refcount_read(), then you have the only reference to "struct pid", and
> > therefore you want to free it. If you don't get a 1, you have to
> > atomically drop a reference, which, if someone else is concurrently
> > also dropping a reference, may leave you with the last reference (in
> > the case where refcount_dec_and_test() returns true), in which case
> > you still have to take care of freeing it.
>
> Also, based on Kees comment, I think it appears to me that get_pid and
> put_pid can race in this way in the original code right?
>
> get_pid                 put_pid
>
>                         atomic_dec_and_test returns 1

This can't happen. get_pid() can only be called on an existing
reference. If you are calling get_pid() on an existing reference, and
someone else is dropping another reference with put_pid(), then when
both functions start running, the refcount must be at least 2.

> atomic_inc
>                         kfree
>
> deref pid /* boom */
> -------------------------------------------------
>
> I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> not test for pid->count == 1 as condition for freeing, but rather just do
> atomic_dec_and_test. So something like the following diff. (And I see a
> similar pattern used in drivers/net/mac.c)

get_pid() can only be called when you already have a refcounted
reference; in other words, when the reference count is at least one.
The lifetime management of struct pid differs from the lifetime
management of most other objects in the kernel; the usual patterns
don't quite apply here.

Look at put_pid(): When the refcount has reached zero, there is no RCU
grace period (unlike most other objects with RCU-managed lifetimes).
Instead, free_pid() has an RCU grace period *before* it invokes
delayed_put_pid() to drop a reference; and free_pid() is also the
function that removes a PID from the namespace's IDR, and it is used
by __change_pid() when a task loses its reference on a PID.

In other words: Most refcounted objects with RCU guarantee that the
object waits for a grace period after its refcount has reached zero;
and during the grace period, the refcount is zero and you're not
allowed to increment it again. But for struct pid, the guarantee is
instead that there is an RCU grace period after it has been removed
from the IDRs and the task, and during the grace period, refcounting
is guaranteed to still work normally.

> Is the above scenario valid? I didn't see any locking around get_pid or
> pud_pid to avoid such a race.
>
> ---8<-----------------------
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 8cb86d377ff5..3d79834e3180 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
>
>  static inline struct pid *get_pid(struct pid *pid)
>  {
> -       if (pid)
> -               refcount_inc(&pid->count);
> +       if (!pid || !refcount_inc_not_zero(&pid->count))
> +               return NULL;
>         return pid;
>  }

Nope, this is wrong. Once the refcount is zero, the object goes away,
refcount_inc_not_zero() makes no sense here.

>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2095c7da644d..89c4849fab5d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -106,8 +106,7 @@ void put_pid(struct pid *pid)
>                 return;
>
>         ns = pid->numbers[pid->level].ns;
> -       if ((refcount_read(&pid->count) == 1) ||
> -            refcount_dec_and_test(&pid->count)) {
> +       if (refcount_dec_and_test(&pid->count)) {
>                 kmem_cache_free(ns->pid_cachep, pid);
>                 put_pid_ns(ns);
>         }
>
>
Oleg Nesterov March 28, 2019, 2:26 p.m. UTC | #5
On 03/27, Joel Fernandes wrote:
>
> Also, based on Kees comment, I think it appears to me that get_pid and
> put_pid can race in this way in the original code right?
>
> get_pid			put_pid
>
> 			atomic_dec_and_test returns 1
> atomic_inc
> 			kfree
>
> deref pid /* boom */
> -------------------------------------------------
>
> I think get_pid needs to call atomic_inc_not_zero()

No.

get_pid() should only be used if you already have a reference or you do
something like

	rcu_read_lock();
	pid = find_vpid();
	get_pid();
	rcu_read_lock();

in this case we rely on call_rcu(delayed_put_pid) which drops the initial
reference.

If put_pid() sees pid->count == 1, then a) nobody else has a reference and
b) nobody else can find this pid on rcu-protected lists, so it is safe to
free it.

Oleg.
Joel Fernandes March 28, 2019, 2:37 p.m. UTC | #6
On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > <joel@joelfernandes.org> wrote:
> > > > >
> > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > CONFIG_REFCOUNT_FULL=n.
> > > > >
> > > > > Cc: keescook@chromium.org
> > > > > Cc: kernel-team@android.com
> > > > > Cc: kernel-hardening@lists.openwall.com
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > [...]
> > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > --- a/kernel/pid.c
> > > > > +++ b/kernel/pid.c
> > > > > @@ -37,7 +37,7 @@
> > > > >  #include <linux/init_task.h>
> > > > >  #include <linux/syscalls.h>
> > > > >  #include <linux/proc_ns.h>
> > > > > -#include <linux/proc_fs.h>
> > > > > +#include <linux/refcount.h>
> > > > >  #include <linux/sched/task.h>
> > > > >  #include <linux/idr.h>
> > > > >
> > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > >                 return;
> > > > >
> > > > >         ns = pid->numbers[pid->level].ns;
> > > > > -       if ((atomic_read(&pid->count) == 1) ||
> > > > > -            atomic_dec_and_test(&pid->count)) {
> > > > > +       if ((refcount_read(&pid->count) == 1) ||
> > > > > +            refcount_dec_and_test(&pid->count)) {
> > > >
> > > > Why is this (and the original code) safe in the face of a race against
> > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > don't see this code pattern anywhere else in the kernel.
> > >
> > > Semantically, it doesn't make a difference whether you do this or
> > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > > refcount_read(), then you have the only reference to "struct pid", and
> > > therefore you want to free it. If you don't get a 1, you have to
> > > atomically drop a reference, which, if someone else is concurrently
> > > also dropping a reference, may leave you with the last reference (in
> > > the case where refcount_dec_and_test() returns true), in which case
> > > you still have to take care of freeing it.
> >
> > Also, based on Kees comment, I think it appears to me that get_pid and
> > put_pid can race in this way in the original code right?
> >
> > get_pid                 put_pid
> >
> >                         atomic_dec_and_test returns 1
> 
> This can't happen. get_pid() can only be called on an existing
> reference. If you are calling get_pid() on an existing reference, and
> someone else is dropping another reference with put_pid(), then when
> both functions start running, the refcount must be at least 2.

Sigh, you are right. Ok. I was quite tired last night when I wrote this.
Obviously, I should have waited a bit and thought it through.

Kees can you describe more the race you had in mind?

> > atomic_inc
> >                         kfree
> >
> > deref pid /* boom */
> > -------------------------------------------------
> >
> > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > not test for pid->count == 1 as condition for freeing, but rather just do
> > atomic_dec_and_test. So something like the following diff. (And I see a
> > similar pattern used in drivers/net/mac.c)
> 
> get_pid() can only be called when you already have a refcounted
> reference; in other words, when the reference count is at least one.
> The lifetime management of struct pid differs from the lifetime
> management of most other objects in the kernel; the usual patterns
> don't quite apply here.
> 
> Look at put_pid(): When the refcount has reached zero, there is no RCU
> grace period (unlike most other objects with RCU-managed lifetimes).
> Instead, free_pid() has an RCU grace period *before* it invokes
> delayed_put_pid() to drop a reference; and free_pid() is also the
> function that removes a PID from the namespace's IDR, and it is used
> by __change_pid() when a task loses its reference on a PID.
> 
> In other words: Most refcounted objects with RCU guarantee that the
> object waits for a grace period after its refcount has reached zero;
> and during the grace period, the refcount is zero and you're not
> allowed to increment it again.

Can you give an example of this "most refcounted objects with RCU" usecase?
I could not find any good examples of such. I want to document this pattern
and possibly submit to Documentation/RCU.

> But for struct pid, the guarantee is
> instead that there is an RCU grace period after it has been removed
> from the IDRs and the task, and during the grace period, refcounting
> is guaranteed to still work normally.

Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct
pid flow is something like (replaced "pid" with data");

get_data:
	atomic_inc(data->refcount);

some_user_of_data:
	rcu_read_lock();
	From X, obtain a ptr to data using rcu_dereference.
	get_data(data);
	rcu_read_unlock();

free_data:
	remove all references to data in all places in X
	call_rcu(put_data)

put_data:
	if (atomic_dec_and_test(data->refcount)) {
		free(data);
	}

create_data:
	data = alloc(..)
	atomic_set(data->refcount, 1);
	set pointers to data in X.

> > pud_pid to avoid such a race.
> >
> > ---8<-----------------------
> >
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 8cb86d377ff5..3d79834e3180 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
> >
> >  static inline struct pid *get_pid(struct pid *pid)
> >  {
> > -       if (pid)
> > -               refcount_inc(&pid->count);
> > +       if (!pid || !refcount_inc_not_zero(&pid->count))
> > +               return NULL;
> >         return pid;
> >  }
> 
> Nope, this is wrong. Once the refcount is zero, the object goes away,
> refcount_inc_not_zero() makes no sense here.

Yeah ok, I think what you meant here is that references to the object from
all places go away before the grace period starts, so a get_pid on an object
with refcount of zero is impossible since there's no way to *get* to that
object after the grace-period ends.

So, yes you are right that refcount_inc is all that's needed.

Also note to the on looker, the original patch I sent is not wrong, that
still applies and is correct. We are just discussing here any possible issues
with the *existing* code.

thanks!

 - Joel
Joel Fernandes March 28, 2019, 2:39 p.m. UTC | #7
On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote:
> On 03/27, Joel Fernandes wrote:
> >
> > Also, based on Kees comment, I think it appears to me that get_pid and
> > put_pid can race in this way in the original code right?
> >
> > get_pid			put_pid
> >
> > 			atomic_dec_and_test returns 1
> > atomic_inc
> > 			kfree
> >
> > deref pid /* boom */
> > -------------------------------------------------
> >
> > I think get_pid needs to call atomic_inc_not_zero()
> 
> No.
> 
> get_pid() should only be used if you already have a reference or you do
> something like
> 
> 	rcu_read_lock();
> 	pid = find_vpid();
> 	get_pid();
> 	rcu_read_lock();
> 
> in this case we rely on call_rcu(delayed_put_pid) which drops the initial
> reference.
> 
> If put_pid() sees pid->count == 1, then a) nobody else has a reference and
> b) nobody else can find this pid on rcu-protected lists, so it is safe to
> free it.

I agree. Check my reply to Jann, I already replied to him about this. thanks!
Jann Horn March 28, 2019, 3:17 p.m. UTC | #8
Since we're just talking about RCU stuff now, adding Paul McKenney to
the thread.

On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > > <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > >
> > > > > > Cc: keescook@chromium.org
> > > > > > Cc: kernel-team@android.com
> > > > > > Cc: kernel-hardening@lists.openwall.com
> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > [...]
> > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > --- a/kernel/pid.c
> > > > > > +++ b/kernel/pid.c
> > > > > > @@ -37,7 +37,7 @@
> > > > > >  #include <linux/init_task.h>
> > > > > >  #include <linux/syscalls.h>
> > > > > >  #include <linux/proc_ns.h>
> > > > > > -#include <linux/proc_fs.h>
> > > > > > +#include <linux/refcount.h>
> > > > > >  #include <linux/sched/task.h>
> > > > > >  #include <linux/idr.h>
> > > > > >
> > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > >                 return;
> > > > > >
> > > > > >         ns = pid->numbers[pid->level].ns;
> > > > > > -       if ((atomic_read(&pid->count) == 1) ||
> > > > > > -            atomic_dec_and_test(&pid->count)) {
> > > > > > +       if ((refcount_read(&pid->count) == 1) ||
> > > > > > +            refcount_dec_and_test(&pid->count)) {
> > > > >
> > > > > Why is this (and the original code) safe in the face of a race against
> > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > don't see this code pattern anywhere else in the kernel.
> > > >
> > > > Semantically, it doesn't make a difference whether you do this or
> > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > > > refcount_read(), then you have the only reference to "struct pid", and
> > > > therefore you want to free it. If you don't get a 1, you have to
> > > > atomically drop a reference, which, if someone else is concurrently
> > > > also dropping a reference, may leave you with the last reference (in
> > > > the case where refcount_dec_and_test() returns true), in which case
> > > > you still have to take care of freeing it.
> > >
> > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > put_pid can race in this way in the original code right?
> > >
> > > get_pid                 put_pid
> > >
> > >                         atomic_dec_and_test returns 1
> >
> > This can't happen. get_pid() can only be called on an existing
> > reference. If you are calling get_pid() on an existing reference, and
> > someone else is dropping another reference with put_pid(), then when
> > both functions start running, the refcount must be at least 2.
>
> Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> Obviously, I should have waited a bit and thought it through.
>
> Kees can you describe more the race you had in mind?
>
> > > atomic_inc
> > >                         kfree
> > >
> > > deref pid /* boom */
> > > -------------------------------------------------
> > >
> > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > not test for pid->count == 1 as condition for freeing, but rather just do
> > > atomic_dec_and_test. So something like the following diff. (And I see a
> > > similar pattern used in drivers/net/mac.c)
> >
> > get_pid() can only be called when you already have a refcounted
> > reference; in other words, when the reference count is at least one.
> > The lifetime management of struct pid differs from the lifetime
> > management of most other objects in the kernel; the usual patterns
> > don't quite apply here.
> >
> > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > grace period (unlike most other objects with RCU-managed lifetimes).
> > Instead, free_pid() has an RCU grace period *before* it invokes
> > delayed_put_pid() to drop a reference; and free_pid() is also the
> > function that removes a PID from the namespace's IDR, and it is used
> > by __change_pid() when a task loses its reference on a PID.
> >
> > In other words: Most refcounted objects with RCU guarantee that the
> > object waits for a grace period after its refcount has reached zero;
> > and during the grace period, the refcount is zero and you're not
> > allowed to increment it again.
>
> Can you give an example of this "most refcounted objects with RCU" usecase?
> I could not find any good examples of such. I want to document this pattern
> and possibly submit to Documentation/RCU.

E.g. struct posix_acl is a relatively straightforward example:
posix_acl_release() is a wrapper around refcount_dec_and_test(); if
the refcount has dropped to zero, the object is released after an RCU
grace period using kfree_rcu().
get_cached_acl() takes an RCU read lock, does rcu_dereference() [with
a missing __rcu annotation, grmbl], and attempts to take a reference
with refcount_inc_not_zero().

> > But for struct pid, the guarantee is
> > instead that there is an RCU grace period after it has been removed
> > from the IDRs and the task, and during the grace period, refcounting
> > is guaranteed to still work normally.
>
> Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct
> pid flow is something like (replaced "pid" with data");
>
> get_data:
>         atomic_inc(data->refcount);
>
> some_user_of_data:
>         rcu_read_lock();
>         From X, obtain a ptr to data using rcu_dereference.
>         get_data(data);
>         rcu_read_unlock();
>
> free_data:
>         remove all references to data in all places in X
>         call_rcu(put_data)
>
> put_data:
>         if (atomic_dec_and_test(data->refcount)) {
>                 free(data);
>         }
>
> create_data:
>         data = alloc(..)
>         atomic_set(data->refcount, 1);
>         set pointers to data in X.
>
> > > pud_pid to avoid such a race.
> > >
> > > ---8<-----------------------
> > >
> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > index 8cb86d377ff5..3d79834e3180 100644
> > > --- a/include/linux/pid.h
> > > +++ b/include/linux/pid.h
> > > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
> > >
> > >  static inline struct pid *get_pid(struct pid *pid)
> > >  {
> > > -       if (pid)
> > > -               refcount_inc(&pid->count);
> > > +       if (!pid || !refcount_inc_not_zero(&pid->count))
> > > +               return NULL;
> > >         return pid;
> > >  }
> >
> > Nope, this is wrong. Once the refcount is zero, the object goes away,
> > refcount_inc_not_zero() makes no sense here.
>
> Yeah ok, I think what you meant here is that references to the object from
> all places go away before the grace period starts, so a get_pid on an object
> with refcount of zero is impossible since there's no way to *get* to that
> object after the grace-period ends.
>
> So, yes you are right that refcount_inc is all that's needed.
>
> Also note to the on looker, the original patch I sent is not wrong, that
> still applies and is correct. We are just discussing here any possible issues
> with the *existing* code.
>
> thanks!
>
>  - Joel
>
Oleg Nesterov March 28, 2019, 4:26 p.m. UTC | #9
On 03/28, Jann Horn wrote:
>
> Since we're just talking about RCU stuff now, adding Paul McKenney to
> the thread.

Since you added Paul let me add more confusion to this thread ;)

There were some concerns about the lack of barriers in put_pid(), but I can't
find that old discussion and I forgot the result of that discussion...

Paul, could you confirm that this code

	CPU_0		CPU_1

	X = 1;		if (READ_ONCE(Y))
	mb();			X = 2;
	Y = 1;		BUG_ON(X != 2);


is correct? I think it is, control dependency pairs with mb(), right?

If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
discussion.

Oleg.
Kees Cook March 28, 2019, 4:52 p.m. UTC | #10
On Thu, Mar 28, 2019 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> Kees can you describe more the race you had in mind?

I just didn't see any barriers, so it seemed racey to me.

> Also note to the on looker, the original patch I sent is not wrong, that
> still applies and is correct. We are just discussing here any possible issues
> with the *existing* code.

Correct. And akpm has (correctly) taken this for -mm.
Paul E. McKenney March 28, 2019, 5:37 p.m. UTC | #11
On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> On 03/28, Jann Horn wrote:
> >
> > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > the thread.
> 
> Since you added Paul let me add more confusion to this thread ;)

Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

> There were some concerns about the lack of barriers in put_pid(), but I can't
> find that old discussion and I forgot the result of that discussion...
> 
> Paul, could you confirm that this code
> 
> 	CPU_0		CPU_1
> 
> 	X = 1;		if (READ_ONCE(Y))
> 	mb();			X = 2;
> 	Y = 1;		BUG_ON(X != 2);
> 
> 
> is correct? I think it is, control dependency pairs with mb(), right?

The BUG_ON() is supposed to happen at the end of time, correct?
As written, there is (in the strict sense) a data race between the load
of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
you could of course argue that this data race is harmless, especially
if X is a single byte.  But the more I talk to compiler writers, the
less comfortable I become with data races in general.  :-/

So I would also feel better if the "Y = 1" was WRITE_ONCE().

On the other hand, this is a great opportunity to try out Alan Stern's
prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!

https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org

Also adding Alan on CC.

Here is what I believe is the litmus test that your are interested in:

------------------------------------------------------------------------
C OlegNesterov-put_pid

{}

P0(int *x, int *y)
{
	*x = 1;
	smp_mb();
	*y = 1;
}

P1(int *x, int *y)
{
	int r1;

	r1 = READ_ONCE(*y);
	if (r1)
		*x = 2;
}

exists (1:r1=1 /\ ~x=2)
------------------------------------------------------------------------

Running this through herd with Alan's patch detects the data race
and says that the undesired outcome is allowed:

	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
	Test OlegNesterov-put_pid Allowed
	States 3
	1:r1=0; x=1;
	1:r1=1; x=1;
	1:r1=1; x=2;
	Ok
	Witnesses
	Positive: 1 Negative: 2
	Flag data-race
	Condition exists (1:r1=1 /\ not (x=2))
	Observation OlegNesterov-put_pid Sometimes 1 2
	Time OlegNesterov-put_pid 0.00
	Hash=a3e0043ad753effa860fea37eeba0a76

Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
although it does remove the "Flag data-race".

Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
gets rid of both the "Flag data-race" and the undesired outcome:

	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
	Test OlegNesterov-put_pid-WO-WO Allowed
	States 2
	1:r1=0; x=1;
	1:r1=1; x=2;
	No
	Witnesses
	Positive: 0 Negative: 2
	Condition exists (1:r1=1 /\ not (x=2))
	Observation OlegNesterov-put_pid-WO-WO Never 0 2
	Time OlegNesterov-put_pid-WO-WO 0.01
	Hash=6e1643e3c5e4739b590bde0a8e8a918e

Here is the corresponding litmus test, in case I messed something up:

------------------------------------------------------------------------
C OlegNesterov-put_pid-WO-WO

{}

P0(int *x, int *y)
{
	*x = 1;
	smp_mb();
	WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
	int r1;

	r1 = READ_ONCE(*y);
	if (r1)
		WRITE_ONCE(*x, 2);
}

exists (1:r1=1 /\ ~x=2)
------------------------------------------------------------------------

> If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> discussion.

Good point, let's try with smp_load_acquire() in P1():

	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
	Test OlegNesterov-put_pid-WO-sla Allowed
	States 2
	1:r1=0; x=1;
	1:r1=1; x=2;
	No
	Witnesses
	Positive: 0 Negative: 2
	Condition exists (1:r1=1 /\ not (x=2))
	Observation OlegNesterov-put_pid-WO-sla Never 0 2
	Time OlegNesterov-put_pid-WO-sla 0.01
	Hash=4fb0276eabf924793dec1970199db3a6

This also works.  Here is the litmus test:

------------------------------------------------------------------------
C OlegNesterov-put_pid-WO-sla

{}

P0(int *x, int *y)
{
	*x = 1;
	smp_mb();
	WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
	int r1;

	r1 = smp_load_acquire(y);
	if (r1)
		*x = 2;
}

exists (1:r1=1 /\ ~x=2)
------------------------------------------------------------------------

Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
smp_load_acquire() gets us a data race and allows the undesired
outcome:

	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
	Test OlegNesterov-put_pid-sla Allowed
	States 3
	1:r1=0; x=1;
	1:r1=1; x=1;
	1:r1=1; x=2;
	Ok
	Witnesses
	Positive: 1 Negative: 2
	Flag data-race
	Condition exists (1:r1=1 /\ not (x=2))
	Observation OlegNesterov-put_pid-sla Sometimes 1 2
	Time OlegNesterov-put_pid-sla 0.01
	Hash=ec6f71f3d9f7cd6e45a874c872e3d946

But what if you are certain that the compiler cannot mess up your use
of plain C-language loads and stores?  Then simply tell LKMM that they
are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
somewhat paranoid, but real C compilers really do tear stores of certain
constants on systems (like x86) that have store-immediate instructions,
so a bit of paranoia is not misplaced here.  ;-)

Plus please note that this patch to LKMM is prototype and thus subject
to change.

							Thanx, Paul
Joel Fernandes March 28, 2019, 8 p.m. UTC | #12
On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote:
> Since we're just talking about RCU stuff now, adding Paul McKenney to
> the thread.
> 
> On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > > > <joel@joelfernandes.org> wrote:
> > > > > > >
> > > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > > >
> > > > > > > Cc: keescook@chromium.org
> > > > > > > Cc: kernel-team@android.com
> > > > > > > Cc: kernel-hardening@lists.openwall.com
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > [...]
> > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > > --- a/kernel/pid.c
> > > > > > > +++ b/kernel/pid.c
> > > > > > > @@ -37,7 +37,7 @@
> > > > > > >  #include <linux/init_task.h>
> > > > > > >  #include <linux/syscalls.h>
> > > > > > >  #include <linux/proc_ns.h>
> > > > > > > -#include <linux/proc_fs.h>
> > > > > > > +#include <linux/refcount.h>
> > > > > > >  #include <linux/sched/task.h>
> > > > > > >  #include <linux/idr.h>
> > > > > > >
> > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > > >                 return;
> > > > > > >
> > > > > > >         ns = pid->numbers[pid->level].ns;
> > > > > > > -       if ((atomic_read(&pid->count) == 1) ||
> > > > > > > -            atomic_dec_and_test(&pid->count)) {
> > > > > > > +       if ((refcount_read(&pid->count) == 1) ||
> > > > > > > +            refcount_dec_and_test(&pid->count)) {
> > > > > >
> > > > > > Why is this (and the original code) safe in the face of a race against
> > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > > don't see this code pattern anywhere else in the kernel.
> > > > >
> > > > > Semantically, it doesn't make a difference whether you do this or
> > > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > > > > refcount_read(), then you have the only reference to "struct pid", and
> > > > > therefore you want to free it. If you don't get a 1, you have to
> > > > > atomically drop a reference, which, if someone else is concurrently
> > > > > also dropping a reference, may leave you with the last reference (in
> > > > > the case where refcount_dec_and_test() returns true), in which case
> > > > > you still have to take care of freeing it.
> > > >
> > > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > > put_pid can race in this way in the original code right?
> > > >
> > > > get_pid                 put_pid
> > > >
> > > >                         atomic_dec_and_test returns 1
> > >
> > > This can't happen. get_pid() can only be called on an existing
> > > reference. If you are calling get_pid() on an existing reference, and
> > > someone else is dropping another reference with put_pid(), then when
> > > both functions start running, the refcount must be at least 2.
> >
> > Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> > Obviously, I should have waited a bit and thought it through.
> >
> > Kees can you describe more the race you had in mind?
> >
> > > > atomic_inc
> > > >                         kfree
> > > >
> > > > deref pid /* boom */
> > > > -------------------------------------------------
> > > >
> > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > > not test for pid->count == 1 as condition for freeing, but rather just do
> > > > atomic_dec_and_test. So something like the following diff. (And I see a
> > > > similar pattern used in drivers/net/mac.c)
> > >
> > > get_pid() can only be called when you already have a refcounted
> > > reference; in other words, when the reference count is at least one.
> > > The lifetime management of struct pid differs from the lifetime
> > > management of most other objects in the kernel; the usual patterns
> > > don't quite apply here.
> > >
> > > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > > grace period (unlike most other objects with RCU-managed lifetimes).
> > > Instead, free_pid() has an RCU grace period *before* it invokes
> > > delayed_put_pid() to drop a reference; and free_pid() is also the
> > > function that removes a PID from the namespace's IDR, and it is used
> > > by __change_pid() when a task loses its reference on a PID.
> > >
> > > In other words: Most refcounted objects with RCU guarantee that the
> > > object waits for a grace period after its refcount has reached zero;
> > > and during the grace period, the refcount is zero and you're not
> > > allowed to increment it again.
> >
> > Can you give an example of this "most refcounted objects with RCU" usecase?
> > I could not find any good examples of such. I want to document this pattern
> > and possibly submit to Documentation/RCU.
> 
> E.g. struct posix_acl is a relatively straightforward example:
> posix_acl_release() is a wrapper around refcount_dec_and_test(); if
> the refcount has dropped to zero, the object is released after an RCU
> grace period using kfree_rcu().
> get_cached_acl() takes an RCU read lock, does rcu_dereference() [with
> a missing __rcu annotation, grmbl], and attempts to take a reference
> with refcount_inc_not_zero().

Ok I get it now. It is quite a subtle difference in usage, I have noted both
these usecases in my private notes for my own sanity ;-). I wonder if Paul
thinks this is too silly to document into Documentation/RCU/, or if I should
write-up something.

One thing I wonder is if one usage pattern is faster than the other.
Certainly in the {get,put}_pid case, it seems nice to be able to do a
get_pid even though free_pid's grace period has still not completed. Where as
in the posix_acl case, once the grace period starts then it is no longer
possible to get a reference as you pointed and its basically game-over for
that object.

thank you!

 - Joel
Joel Fernandes March 29, 2019, 2:24 a.m. UTC | #13
On Thu, Mar 28, 2019 at 04:00:52PM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote:
> > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > the thread.
> > 
> > On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > > > > <joel@joelfernandes.org> wrote:
> > > > > > > >
> > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > > > >
> > > > > > > > Cc: keescook@chromium.org
> > > > > > > > Cc: kernel-team@android.com
> > > > > > > > Cc: kernel-hardening@lists.openwall.com
> > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > > [...]
> > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > > > --- a/kernel/pid.c
> > > > > > > > +++ b/kernel/pid.c
> > > > > > > > @@ -37,7 +37,7 @@
> > > > > > > >  #include <linux/init_task.h>
> > > > > > > >  #include <linux/syscalls.h>
> > > > > > > >  #include <linux/proc_ns.h>
> > > > > > > > -#include <linux/proc_fs.h>
> > > > > > > > +#include <linux/refcount.h>
> > > > > > > >  #include <linux/sched/task.h>
> > > > > > > >  #include <linux/idr.h>
> > > > > > > >
> > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > > > >                 return;
> > > > > > > >
> > > > > > > >         ns = pid->numbers[pid->level].ns;
> > > > > > > > -       if ((atomic_read(&pid->count) == 1) ||
> > > > > > > > -            atomic_dec_and_test(&pid->count)) {
> > > > > > > > +       if ((refcount_read(&pid->count) == 1) ||
> > > > > > > > +            refcount_dec_and_test(&pid->count)) {
> > > > > > >
> > > > > > > Why is this (and the original code) safe in the face of a race against
> > > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > > > don't see this code pattern anywhere else in the kernel.
> > > > > >
> > > > > > Semantically, it doesn't make a difference whether you do this or
> > > > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > > > > > refcount_read(), then you have the only reference to "struct pid", and
> > > > > > therefore you want to free it. If you don't get a 1, you have to
> > > > > > atomically drop a reference, which, if someone else is concurrently
> > > > > > also dropping a reference, may leave you with the last reference (in
> > > > > > the case where refcount_dec_and_test() returns true), in which case
> > > > > > you still have to take care of freeing it.
> > > > >
> > > > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > > > put_pid can race in this way in the original code right?
> > > > >
> > > > > get_pid                 put_pid
> > > > >
> > > > >                         atomic_dec_and_test returns 1
> > > >
> > > > This can't happen. get_pid() can only be called on an existing
> > > > reference. If you are calling get_pid() on an existing reference, and
> > > > someone else is dropping another reference with put_pid(), then when
> > > > both functions start running, the refcount must be at least 2.
> > >
> > > Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> > > Obviously, I should have waited a bit and thought it through.
> > >
> > > Kees can you describe more the race you had in mind?
> > >
> > > > > atomic_inc
> > > > >                         kfree
> > > > >
> > > > > deref pid /* boom */
> > > > > -------------------------------------------------
> > > > >
> > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > > > not test for pid->count == 1 as condition for freeing, but rather just do
> > > > > atomic_dec_and_test. So something like the following diff. (And I see a
> > > > > similar pattern used in drivers/net/mac.c)
> > > >
> > > > get_pid() can only be called when you already have a refcounted
> > > > reference; in other words, when the reference count is at least one.
> > > > The lifetime management of struct pid differs from the lifetime
> > > > management of most other objects in the kernel; the usual patterns
> > > > don't quite apply here.
> > > >
> > > > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > > > grace period (unlike most other objects with RCU-managed lifetimes).
> > > > Instead, free_pid() has an RCU grace period *before* it invokes
> > > > delayed_put_pid() to drop a reference; and free_pid() is also the
> > > > function that removes a PID from the namespace's IDR, and it is used
> > > > by __change_pid() when a task loses its reference on a PID.
> > > >
> > > > In other words: Most refcounted objects with RCU guarantee that the
> > > > object waits for a grace period after its refcount has reached zero;
> > > > and during the grace period, the refcount is zero and you're not
> > > > allowed to increment it again.
> > >
> > > Can you give an example of this "most refcounted objects with RCU" usecase?
> > > I could not find any good examples of such. I want to document this pattern
> > > and possibly submit to Documentation/RCU.
> > 
> > E.g. struct posix_acl is a relatively straightforward example:
> > posix_acl_release() is a wrapper around refcount_dec_and_test(); if
> > the refcount has dropped to zero, the object is released after an RCU
> > grace period using kfree_rcu().
> > get_cached_acl() takes an RCU read lock, does rcu_dereference() [with
> > a missing __rcu annotation, grmbl], and attempts to take a reference
> > with refcount_inc_not_zero().
> 
> Ok I get it now. It is quite a subtle difference in usage, I have noted both
> these usecases in my private notes for my own sanity ;-). I wonder if Paul
> thinks this is too silly to document into Documentation/RCU/, or if I should
> write-up something.
> 
> One thing I wonder is if one usage pattern is faster than the other.
> Certainly in the {get,put}_pid case, it seems nice to be able to do a
> get_pid even though free_pid's grace period has still not completed. Where as
> in the posix_acl case, once the grace period starts then it is no longer
> possible to get a reference as you pointed and its basically game-over for
> that object.

Wow, so I just found that Documentation/RCU/rcuref.txt already beautifully
talks about specifically both these RCU refcounting patterns - in the second
and third example.

I also found some issues with that document, so I will be submitting those
and CC you guys.

thanks!

 - Joel
Joel Fernandes March 29, 2019, 2:34 a.m. UTC | #14
On Thu, Mar 28, 2019 at 10:39:58AM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote:
> > On 03/27, Joel Fernandes wrote:
> > >
> > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > put_pid can race in this way in the original code right?
> > >
> > > get_pid			put_pid
> > >
> > > 			atomic_dec_and_test returns 1
> > > atomic_inc
> > > 			kfree
> > >
> > > deref pid /* boom */
> > > -------------------------------------------------
> > >
> > > I think get_pid needs to call atomic_inc_not_zero()
> > 
> > No.
> > 
> > get_pid() should only be used if you already have a reference or you do
> > something like
> > 
> > 	rcu_read_lock();
> > 	pid = find_vpid();
> > 	get_pid();
> > 	rcu_read_lock();
> > 
> > in this case we rely on call_rcu(delayed_put_pid) which drops the initial
> > reference.
> > 
> > If put_pid() sees pid->count == 1, then a) nobody else has a reference and
> > b) nobody else can find this pid on rcu-protected lists, so it is safe to
> > free it.
> 
> I agree. Check my reply to Jann, I already replied to him about this. thanks!
> 

Also Oleg, why not just call refcount_dec_and_test like below? If count is 1,
then it will decrement to 0 and return true anyway. Is this because we want
to avoid writes at the cost of more reads? Did I miss something? Thank you.

I don't remember very clearly, but I think Kees also asked about the same thing.

diff --git a/kernel/pid.c b/kernel/pid.c
index 2095c7da644d..89c4849fab5d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -106,8 +106,7 @@ void put_pid(struct pid *pid)
 		return;
 
 	ns = pid->numbers[pid->level].ns;
-	if ((refcount_read(&pid->count) == 1) ||
-	     refcount_dec_and_test(&pid->count)) {
+	if (refcount_dec_and_test(&pid->count)) {
 		kmem_cache_free(ns->pid_cachep, pid);
 		put_pid_ns(ns);
 	}
Oleg Nesterov March 29, 2019, 5:32 p.m. UTC | #15
On 03/28, Paul E. McKenney wrote:
>
> On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> >
> > Since you added Paul let me add more confusion to this thread ;)
>
> Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

OK, thanks, you certainly managed to confused me much more than I expected!

> > There were some concerns about the lack of barriers in put_pid(), but I can't
> > find that old discussion and I forgot the result of that discussion...
> >
> > Paul, could you confirm that this code
> >
> > 	CPU_0		CPU_1
> >
> > 	X = 1;		if (READ_ONCE(Y))
> > 	mb();			X = 2;
> > 	Y = 1;		BUG_ON(X != 2);
> >
> >
> > is correct? I think it is, control dependency pairs with mb(), right?
>
> The BUG_ON() is supposed to happen at the end of time, correct?

Yes,

> As written, there is (in the strict sense) a data race between the load
> of X in the BUG_ON() and CPU_0's store to X.

Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0
must not happen after "X = 2" on CPU 1 or we have a problem.


> But the more I talk to compiler writers, the
> less comfortable I become with data races in general.  :-/
>
> So I would also feel better if the "Y = 1" was WRITE_ONCE().

If we forget about potential compiler bugs, then it is not clear to me how
WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and
we do not really need the "once" semantics?

But as for put_pid(), it actually does atomic_dec_and_test(&Y), so this is
probably not relevant.

> On the other hand, this is a great opportunity to try out Alan Stern's
> prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
>
> https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org

Heh. Will do, but only after I buy more brains.

> Here is what I believe is the litmus test that your are interested in:
>
> ------------------------------------------------------------------------
> C OlegNesterov-put_pid
>
> {}
>
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	*y = 1;
> }
>
> P1(int *x, int *y)
> {
> 	int r1;
>
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		*x = 2;
> }
>
> exists (1:r1=1 /\ ~x=2)

I am not familiar with litmus, and I do not really understand what (and why)
it reports.

> Running this through herd with Alan's patch detects the data race
> and says that the undesired outcome is allowed:

OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes
*y == 1.

Still can't understand how this can happen... Nevermind ;)

> Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> gets rid of both the "Flag data-race" and the undesired outcome:

...

> P1(int *x, int *y)
> {
> 	int r1;
>
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		WRITE_ONCE(*x, 2);
> }

And this is what Documentation/memory-barriers.txt says in the "CONTROL
DEPENDENCIES" section:

		q = READ_ONCE(a);
		if (q) {
			WRITE_ONCE(b, 1);
		}

	Control dependencies pair normally with other types of barriers.
	That said, please note that neither READ_ONCE() nor WRITE_ONCE()
	are optional!

but again, I fail to really understand why WRITE_ONCE() is not optional
in this particular case.

Thanks!

Oleg.
Oleg Nesterov March 29, 2019, 5:37 p.m. UTC | #16
On 03/28, Joel Fernandes wrote:
>
> Also Oleg, why not just call refcount_dec_and_test like below?

Of course we can, this is just optimization to avoid the atomic op if possible.

Does this optimization really help? I have no idea ;)

But as you can see, it surely helps to provoke the interesting discussions!

Oleg.
Alan Stern March 29, 2019, 7:45 p.m. UTC | #17
On Fri, 29 Mar 2019, Oleg Nesterov wrote:

> On 03/28, Paul E. McKenney wrote:
> >
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > >
> > > Since you added Paul let me add more confusion to this thread ;)
> >
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> OK, thanks, you certainly managed to confused me much more than I expected!
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > find that old discussion and I forgot the result of that discussion...
> > >
> > > Paul, could you confirm that this code
> > >
> > > 	CPU_0		CPU_1
> > >
> > > 	X = 1;		if (READ_ONCE(Y))
> > > 	mb();			X = 2;
> > > 	Y = 1;		BUG_ON(X != 2);
> > >
> > >
> > > is correct? I think it is, control dependency pairs with mb(), right?
> >
> > The BUG_ON() is supposed to happen at the end of time, correct?
> 
> Yes,
> 
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.
> 
> Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0
> must not happen after "X = 2" on CPU 1 or we have a problem.

The situation is worse than that.  Strictly speaking, any time there is
a data race you have a problem.  In particular, the C11 Standard
permits a data race to cause undefined behavior.  For example, the
Standard wouldn't be violated if a data race caused your computer to
catch on fire.

In real life the situation isn't that bad, and you can do things the
Standard doesn't cover if you are very familiar with the properties of
the compiler.  But compilers change over time, so relying on special 
knowledge about it might not be a good idea in the long run.

> > But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> >
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> 
> If we forget about potential compiler bugs, then it is not clear to me how
> WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and
> we do not really need the "once" semantics?

There is a big difference between WRITE_ONCE() and plain assignment.  
Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store
instruction.  But given "X = 2", the compiler is allowed to emit
instructions equivalent to:

	if (X != 2)
		X = 2;

This is not a simple store; it also contains a load.  (And in some 
situations, a transformation like this might even be beneficial, 
because it can avoid a cache line transfer.)

CPUs are allowed to execute loads out of order with respect to control
dependencies.  So for example, if your pseudo-code for CPU_1 got
compiled to object code equivalent to:

	if (READ_ONCE(Y)) {
		if (X != 2)
			X = 2;
	}

then the CPU might read the value of X before reading the value of Y.  
This might not look like a problem, but it would be an example of a
race.

> But as for put_pid(), it actually does atomic_dec_and_test(&Y), so this is
> probably not relevant.
> 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> >
> > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> 
> Heh. Will do, but only after I buy more brains.
> 
> > Here is what I believe is the litmus test that your are interested in:
> >
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid
> >
> > {}
> >
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> >
> > P1(int *x, int *y)
> > {
> > 	int r1;
> >
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		*x = 2;
> > }
> >
> > exists (1:r1=1 /\ ~x=2)
> 
> I am not familiar with litmus, and I do not really understand what (and why)
> it reports.
> 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> 
> OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes
> *y == 1.
> 
> Still can't understand how this can happen... Nevermind ;)

You might imagine that the compiler could generate object code for P1() 
equivalent to:

	tmp = *x;
	*x = 2;
	if (!READ_ONCE(*y))
		*x = tmp;

Given this object code, P1 might indeed set *x to 2 before P0 sets *x
to 1.  I believe the Standard allows the compiler to do this sort of 
thing, although of course it would be a stupid thing to do.

Using plain accesses gives the compiler a tremendous amount of freedom
-- a lot more than many people think.

Alan

> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> 
> ...
> 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> >
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> 
> And this is what Documentation/memory-barriers.txt says in the "CONTROL
> DEPENDENCIES" section:
> 
> 		q = READ_ONCE(a);
> 		if (q) {
> 			WRITE_ONCE(b, 1);
> 		}
> 
> 	Control dependencies pair normally with other types of barriers.
> 	That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> 	are optional!
> 
> but again, I fail to really understand why WRITE_ONCE() is not optional
> in this particular case.
> 
> Thanks!
> 
> Oleg.
Joel Fernandes March 30, 2019, 2:36 a.m. UTC | #18
On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > On 03/28, Jann Horn wrote:
> > >
> > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > the thread.
> > 
> > Since you added Paul let me add more confusion to this thread ;)
> 
> Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

Nice to take part in the confusion fun too!!! ;-)

> > There were some concerns about the lack of barriers in put_pid(), but I can't
> > find that old discussion and I forgot the result of that discussion...
> > 
> > Paul, could you confirm that this code
> > 
> > 	CPU_0		CPU_1
> > 
> > 	X = 1;		if (READ_ONCE(Y))
> > 	mb();			X = 2;
> > 	Y = 1;		BUG_ON(X != 2);
> > 
> > 
> > is correct? I think it is, control dependency pairs with mb(), right?
> 
> The BUG_ON() is supposed to happen at the end of time, correct?
> As written, there is (in the strict sense) a data race between the load
> of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> you could of course argue that this data race is harmless, especially
> if X is a single byte.  But the more I talk to compiler writers, the
> less comfortable I become with data races in general.  :-/
> 
> So I would also feel better if the "Y = 1" was WRITE_ONCE().
> 
> On the other hand, this is a great opportunity to try out Alan Stern's
> prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> 
> https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> 
> Also adding Alan on CC.
> 
> Here is what I believe is the litmus test that your are interested in:
> 
> ------------------------------------------------------------------------
> C OlegNesterov-put_pid
> 
> {}
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	*y = 1;
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		*x = 2;
> }
> 
> exists (1:r1=1 /\ ~x=2)
> ------------------------------------------------------------------------
> 
> Running this through herd with Alan's patch detects the data race
> and says that the undesired outcome is allowed:
> 
> 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> 	Test OlegNesterov-put_pid Allowed
> 	States 3
> 	1:r1=0; x=1;
> 	1:r1=1; x=1;
> 	1:r1=1; x=2;
> 	Ok
> 	Witnesses
> 	Positive: 1 Negative: 2
> 	Flag data-race
> 	Condition exists (1:r1=1 /\ not (x=2))
> 	Observation OlegNesterov-put_pid Sometimes 1 2
> 	Time OlegNesterov-put_pid 0.00
> 	Hash=a3e0043ad753effa860fea37eeba0a76
> 
> Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> although it does remove the "Flag data-race".
> 
> Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> gets rid of both the "Flag data-race" and the undesired outcome:
> 
> 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> 	Test OlegNesterov-put_pid-WO-WO Allowed
> 	States 2
> 	1:r1=0; x=1;
> 	1:r1=1; x=2;
> 	No
> 	Witnesses
> 	Positive: 0 Negative: 2
> 	Condition exists (1:r1=1 /\ not (x=2))
> 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> 	Time OlegNesterov-put_pid-WO-WO 0.01
> 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> 
> Here is the corresponding litmus test, in case I messed something up:
> 
> ------------------------------------------------------------------------
> C OlegNesterov-put_pid-WO-WO
> 
> {}
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
sufficient to prevent the exists condition. Shouldn't the compiler know that,
in P0(), it should not reorder the store to y=1 before the x=1 because there
is an explicit barrier between the 2 stores? Looks me to me like a broken
compiler :-|. 

So I would have expected the following litmus to result in Never, but it
doesn't with Alan's patch:

P0(int *x, int *y)
{
	*x = 1;
	smp_mb();
	*y = 1;
}

P1(int *x, int *y)
{
	int r1;

	r1 = READ_ONCE(*y);
	if (r1)
		WRITE_ONCE(*x, 2);
}

exists (1:r1=1 /\ ~x=2)

> ------------------------------------------------------------------------
> 
> > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > discussion.
> 
> Good point, let's try with smp_load_acquire() in P1():
> 
> 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> 	Test OlegNesterov-put_pid-WO-sla Allowed
> 	States 2
> 	1:r1=0; x=1;
> 	1:r1=1; x=2;
> 	No
> 	Witnesses
> 	Positive: 0 Negative: 2
> 	Condition exists (1:r1=1 /\ not (x=2))
> 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> 	Time OlegNesterov-put_pid-WO-sla 0.01
> 	Hash=4fb0276eabf924793dec1970199db3a6
> 
> This also works.  Here is the litmus test:
> 
> ------------------------------------------------------------------------
> C OlegNesterov-put_pid-WO-sla
> 
> {}
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = smp_load_acquire(y);
> 	if (r1)
> 		*x = 2;
> }
> 
> exists (1:r1=1 /\ ~x=2)
> ------------------------------------------------------------------------
> 
> Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> smp_load_acquire() gets us a data race and allows the undesired
> outcome:

Yeah, I think this is also what I was confused about above, is why is that
WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
I'm missing something. ;-)

> 
> 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
> 	Test OlegNesterov-put_pid-sla Allowed
> 	States 3
> 	1:r1=0; x=1;
> 	1:r1=1; x=1;
> 	1:r1=1; x=2;
> 	Ok
> 	Witnesses
> 	Positive: 1 Negative: 2
> 	Flag data-race
> 	Condition exists (1:r1=1 /\ not (x=2))
> 	Observation OlegNesterov-put_pid-sla Sometimes 1 2
> 	Time OlegNesterov-put_pid-sla 0.01
> 	Hash=ec6f71f3d9f7cd6e45a874c872e3d946
> 
> But what if you are certain that the compiler cannot mess up your use
> of plain C-language loads and stores?  Then simply tell LKMM that they
> are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
> somewhat paranoid, but real C compilers really do tear stores of certain
> constants on systems (like x86) that have store-immediate instructions,
> so a bit of paranoia is not misplaced here.  ;-)
> 
> Plus please note that this patch to LKMM is prototype and thus subject
> to change.

Ah I see. Appreciate if Alan can also CC me on future posting of this since
I'm quite interested. ;-)

thanks,

 - Joel

> 							Thanx, Paul
>
Alan Stern March 30, 2019, 3:16 p.m. UTC | #19
On Fri, 29 Mar 2019, Joel Fernandes wrote:

> On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > On 03/28, Jann Horn wrote:
> > > >
> > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > the thread.
> > > 
> > > Since you added Paul let me add more confusion to this thread ;)
> > 
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> Nice to take part in the confusion fun too!!! ;-)
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > find that old discussion and I forgot the result of that discussion...
> > > 
> > > Paul, could you confirm that this code
> > > 
> > > 	CPU_0		CPU_1
> > > 
> > > 	X = 1;		if (READ_ONCE(Y))
> > > 	mb();			X = 2;
> > > 	Y = 1;		BUG_ON(X != 2);
> > > 
> > > 
> > > is correct? I think it is, control dependency pairs with mb(), right?
> > 
> > The BUG_ON() is supposed to happen at the end of time, correct?
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > you could of course argue that this data race is harmless, especially
> > if X is a single byte.  But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> > 
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > 
> > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> > 
> > Also adding Alan on CC.
> > 
> > Here is what I believe is the litmus test that your are interested in:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		*x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > ------------------------------------------------------------------------
> > 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > 	Test OlegNesterov-put_pid Allowed
> > 	States 3
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=1;
> > 	1:r1=1; x=2;
> > 	Ok
> > 	Witnesses
> > 	Positive: 1 Negative: 2
> > 	Flag data-race
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid Sometimes 1 2
> > 	Time OlegNesterov-put_pid 0.00
> > 	Hash=a3e0043ad753effa860fea37eeba0a76
> > 
> > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > although it does remove the "Flag data-race".
> > 
> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > 	Test OlegNesterov-put_pid-WO-WO Allowed
> > 	States 2
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=2;
> > 	No
> > 	Witnesses
> > 	Positive: 0 Negative: 2
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > 	Time OlegNesterov-put_pid-WO-WO 0.01
> > 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > 
> > Here is the corresponding litmus test, in case I messed something up:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid-WO-WO
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> P0() is required,

If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it
would race with P1's "READ_ONCE(*y)".

>  and why would the READ_ONCE / WRITE_ONCE in P1() not be
> sufficient to prevent the exists condition. Shouldn't the compiler know that,
> in P0(), it should not reorder the store to y=1 before the x=1 because there
> is an explicit barrier between the 2 stores? Looks me to me like a broken
> compiler :-|. 
> 
> So I would have expected the following litmus to result in Never, but it
> doesn't with Alan's patch:
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	*y = 1;
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

You have to realize that in the presence of a data race, all bets are
off.  The memory model will still output a prediction, but there is no
guarantee that the prediction will be correct.

In this case P0's write to y races with P1's READ_ONCE.  Therefore the 
memory model may very will give an incorrect result.

> > ------------------------------------------------------------------------
> > 
> > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > > discussion.
> > 
> > Good point, let's try with smp_load_acquire() in P1():
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> > 	Test OlegNesterov-put_pid-WO-sla Allowed
> > 	States 2
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=2;
> > 	No
> > 	Witnesses
> > 	Positive: 0 Negative: 2
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> > 	Time OlegNesterov-put_pid-WO-sla 0.01
> > 	Hash=4fb0276eabf924793dec1970199db3a6
> > 
> > This also works.  Here is the litmus test:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid-WO-sla
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = smp_load_acquire(y);
> > 	if (r1)
> > 		*x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > ------------------------------------------------------------------------
> > 
> > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> > smp_load_acquire() gets us a data race and allows the undesired
> > outcome:
> 
> Yeah, I think this is also what I was confused about above, is why is that
> WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
> I'm missing something. ;-)

A plain write to *y in P0 races with the smp_load_acquire in P1.  
That's all -- it's not very deep or subtle.  Remember, the definition
of a race is two concurrent accesses to the same variable from
different CPUs, where at least one of the accesses is plain and at
least one of them is a write.

I've heard that people on the C++ Standards committee have proposed
that plain writes should not race with marked reads.  That is, when
such concurrent accesses occur the outcome should be an undefined
result for the marked read rather than undefined behavior.  If this
change gets adopted and we put it into the memory model, then your
expectation would be correct.  But as things stand, it isn't.

Alan

> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
> > 	Test OlegNesterov-put_pid-sla Allowed
> > 	States 3
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=1;
> > 	1:r1=1; x=2;
> > 	Ok
> > 	Witnesses
> > 	Positive: 1 Negative: 2
> > 	Flag data-race
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-sla Sometimes 1 2
> > 	Time OlegNesterov-put_pid-sla 0.01
> > 	Hash=ec6f71f3d9f7cd6e45a874c872e3d946
> > 
> > But what if you are certain that the compiler cannot mess up your use
> > of plain C-language loads and stores?  Then simply tell LKMM that they
> > are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
> > somewhat paranoid, but real C compilers really do tear stores of certain
> > constants on systems (like x86) that have store-immediate instructions,
> > so a bit of paranoia is not misplaced here.  ;-)
> > 
> > Plus please note that this patch to LKMM is prototype and thus subject
> > to change.
> 
> Ah I see. Appreciate if Alan can also CC me on future posting of this since
> I'm quite interested. ;-)
> 
> thanks,
> 
>  - Joel
Paul E. McKenney March 31, 2019, 9:55 p.m. UTC | #20
On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > On 03/28, Jann Horn wrote:
> > > >
> > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > the thread.
> > > 
> > > Since you added Paul let me add more confusion to this thread ;)
> > 
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> Nice to take part in the confusion fun too!!! ;-)
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > find that old discussion and I forgot the result of that discussion...
> > > 
> > > Paul, could you confirm that this code
> > > 
> > > 	CPU_0		CPU_1
> > > 
> > > 	X = 1;		if (READ_ONCE(Y))
> > > 	mb();			X = 2;
> > > 	Y = 1;		BUG_ON(X != 2);
> > > 
> > > 
> > > is correct? I think it is, control dependency pairs with mb(), right?
> > 
> > The BUG_ON() is supposed to happen at the end of time, correct?
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > you could of course argue that this data race is harmless, especially
> > if X is a single byte.  But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> > 
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > 
> > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> > 
> > Also adding Alan on CC.
> > 
> > Here is what I believe is the litmus test that your are interested in:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		*x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > ------------------------------------------------------------------------
> > 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > 	Test OlegNesterov-put_pid Allowed
> > 	States 3
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=1;
> > 	1:r1=1; x=2;
> > 	Ok
> > 	Witnesses
> > 	Positive: 1 Negative: 2
> > 	Flag data-race
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid Sometimes 1 2
> > 	Time OlegNesterov-put_pid 0.00
> > 	Hash=a3e0043ad753effa860fea37eeba0a76
> > 
> > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > although it does remove the "Flag data-race".
> > 
> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > 	Test OlegNesterov-put_pid-WO-WO Allowed
> > 	States 2
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=2;
> > 	No
> > 	Witnesses
> > 	Positive: 0 Negative: 2
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > 	Time OlegNesterov-put_pid-WO-WO 0.01
> > 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > 
> > Here is the corresponding litmus test, in case I messed something up:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid-WO-WO
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
> sufficient to prevent the exists condition. Shouldn't the compiler know that,
> in P0(), it should not reorder the store to y=1 before the x=1 because there
> is an explicit barrier between the 2 stores? Looks me to me like a broken
> compiler :-|. 
> 
> So I would have expected the following litmus to result in Never, but it
> doesn't with Alan's patch:
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	*y = 1;
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

The problem is that the compiler can turn both of P0()'s writes into reads:

P0(int *x, int *y)
{
	if (*x != 1)
		*x = 1;
	smp_mb();
	if (*y != 1)
		*y = 1;
}

These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
to get an iron-clad ordering guarantee.

> > ------------------------------------------------------------------------
> > 
> > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > > discussion.
> > 
> > Good point, let's try with smp_load_acquire() in P1():
> > 
> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> > 	Test OlegNesterov-put_pid-WO-sla Allowed
> > 	States 2
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=2;
> > 	No
> > 	Witnesses
> > 	Positive: 0 Negative: 2
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> > 	Time OlegNesterov-put_pid-WO-sla 0.01
> > 	Hash=4fb0276eabf924793dec1970199db3a6
> > 
> > This also works.  Here is the litmus test:
> > 
> > ------------------------------------------------------------------------
> > C OlegNesterov-put_pid-WO-sla
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = smp_load_acquire(y);
> > 	if (r1)
> > 		*x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > ------------------------------------------------------------------------
> > 
> > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> > smp_load_acquire() gets us a data race and allows the undesired
> > outcome:
> 
> Yeah, I think this is also what I was confused about above, is why is that
> WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
> I'm missing something. ;-)

The first of P0()'s writes can be a plain write, at least assuming
sufficient synchronization to avoid the data race.  But turning the second
of P0()'s writes into a plain write is a bit riskier: That is a write of
a constant, and those really are torn in some cases on some architectures.
Like x86, for example.

> > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
> > 	Test OlegNesterov-put_pid-sla Allowed
> > 	States 3
> > 	1:r1=0; x=1;
> > 	1:r1=1; x=1;
> > 	1:r1=1; x=2;
> > 	Ok
> > 	Witnesses
> > 	Positive: 1 Negative: 2
> > 	Flag data-race
> > 	Condition exists (1:r1=1 /\ not (x=2))
> > 	Observation OlegNesterov-put_pid-sla Sometimes 1 2
> > 	Time OlegNesterov-put_pid-sla 0.01
> > 	Hash=ec6f71f3d9f7cd6e45a874c872e3d946
> > 
> > But what if you are certain that the compiler cannot mess up your use
> > of plain C-language loads and stores?  Then simply tell LKMM that they
> > are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
> > somewhat paranoid, but real C compilers really do tear stores of certain
> > constants on systems (like x86) that have store-immediate instructions,
> > so a bit of paranoia is not misplaced here.  ;-)
> > 
> > Plus please note that this patch to LKMM is prototype and thus subject
> > to change.
> 
> Ah I see. Appreciate if Alan can also CC me on future posting of this since
> I'm quite interested. ;-)

His last posting should be easy to find.  But please let me know if not,
as I would be happy to send it along.

							Thanx, Pau
Paul E. McKenney March 31, 2019, 9:57 p.m. UTC | #21
On Sat, Mar 30, 2019 at 11:16:01AM -0400, Alan Stern wrote:
> On Fri, 29 Mar 2019, Joel Fernandes wrote:
> > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > On 03/28, Jann Horn wrote:
> > > > >
> > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > > the thread.
> > > > 
> > > > Since you added Paul let me add more confusion to this thread ;)
> > > 
> > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > 
> > Nice to take part in the confusion fun too!!! ;-)
> > 
> > > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > > find that old discussion and I forgot the result of that discussion...
> > > > 
> > > > Paul, could you confirm that this code
> > > > 
> > > > 	CPU_0		CPU_1
> > > > 
> > > > 	X = 1;		if (READ_ONCE(Y))
> > > > 	mb();			X = 2;
> > > > 	Y = 1;		BUG_ON(X != 2);
> > > > 
> > > > 
> > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > 
> > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > As written, there is (in the strict sense) a data race between the load
> > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > you could of course argue that this data race is harmless, especially
> > > if X is a single byte.  But the more I talk to compiler writers, the
> > > less comfortable I become with data races in general.  :-/
> > > 
> > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > 
> > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > 
> > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> > > 
> > > Also adding Alan on CC.
> > > 
> > > Here is what I believe is the litmus test that your are interested in:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	*y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		*x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > ------------------------------------------------------------------------
> > > 
> > > Running this through herd with Alan's patch detects the data race
> > > and says that the undesired outcome is allowed:
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > > 	Test OlegNesterov-put_pid Allowed
> > > 	States 3
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=1;
> > > 	1:r1=1; x=2;
> > > 	Ok
> > > 	Witnesses
> > > 	Positive: 1 Negative: 2
> > > 	Flag data-race
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid Sometimes 1 2
> > > 	Time OlegNesterov-put_pid 0.00
> > > 	Hash=a3e0043ad753effa860fea37eeba0a76
> > > 
> > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > although it does remove the "Flag data-race".
> > > 
> > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > > 	Test OlegNesterov-put_pid-WO-WO Allowed
> > > 	States 2
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=2;
> > > 	No
> > > 	Witnesses
> > > 	Positive: 0 Negative: 2
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > > 	Time OlegNesterov-put_pid-WO-WO 0.01
> > > 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > 
> > > Here is the corresponding litmus test, in case I messed something up:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid-WO-WO
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > 
> > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> > P0() is required,
> 
> If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it
> would race with P1's "READ_ONCE(*y)".
> 
> >  and why would the READ_ONCE / WRITE_ONCE in P1() not be
> > sufficient to prevent the exists condition. Shouldn't the compiler know that,
> > in P0(), it should not reorder the store to y=1 before the x=1 because there
> > is an explicit barrier between the 2 stores? Looks me to me like a broken
> > compiler :-|. 
> > 
> > So I would have expected the following litmus to result in Never, but it
> > doesn't with Alan's patch:
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> You have to realize that in the presence of a data race, all bets are
> off.  The memory model will still output a prediction, but there is no
> guarantee that the prediction will be correct.
> 
> In this case P0's write to y races with P1's READ_ONCE.  Therefore the 
> memory model may very will give an incorrect result.
> 
> > > ------------------------------------------------------------------------
> > > 
> > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > > > discussion.
> > > 
> > > Good point, let's try with smp_load_acquire() in P1():
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> > > 	Test OlegNesterov-put_pid-WO-sla Allowed
> > > 	States 2
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=2;
> > > 	No
> > > 	Witnesses
> > > 	Positive: 0 Negative: 2
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> > > 	Time OlegNesterov-put_pid-WO-sla 0.01
> > > 	Hash=4fb0276eabf924793dec1970199db3a6
> > > 
> > > This also works.  Here is the litmus test:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid-WO-sla
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = smp_load_acquire(y);
> > > 	if (r1)
> > > 		*x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > ------------------------------------------------------------------------
> > > 
> > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> > > smp_load_acquire() gets us a data race and allows the undesired
> > > outcome:
> > 
> > Yeah, I think this is also what I was confused about above, is why is that
> > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
> > I'm missing something. ;-)
> 
> A plain write to *y in P0 races with the smp_load_acquire in P1.  
> That's all -- it's not very deep or subtle.  Remember, the definition
> of a race is two concurrent accesses to the same variable from
> different CPUs, where at least one of the accesses is plain and at
> least one of them is a write.
> 
> I've heard that people on the C++ Standards committee have proposed
> that plain writes should not race with marked reads.  That is, when
> such concurrent accesses occur the outcome should be an undefined
> result for the marked read rather than undefined behavior.  If this
> change gets adopted and we put it into the memory model, then your
> expectation would be correct.  But as things stand, it isn't.

At least in the case where the marking is a volatile, yes.  But there
might be some distance between a proposal and an actual change to the
standard.  ;-)

There is also a proposal for a memcpy-like thing that acts as plain
loads and stores, but is defined not to data-race with marked accesses.
However, tearing and so on is possible, so if you do race with marked
accesses, you might be surprising results.

							Thanx, Paul
David Laight April 1, 2019, 3:28 p.m. UTC | #22
From: Alan Stern
> Sent: 29 March 2019 19:45
...
> There is a big difference between WRITE_ONCE() and plain assignment.
> Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store
> instruction.  But given "X = 2", the compiler is allowed to emit
> instructions equivalent to:
> 
> 	if (X != 2)
> 		X = 2;

Worse for you, it can also emit:
	X = 0;
	X = 2;

Many years ago I fell foul of a compiler (not C) that implemented
a write to a 2 bit wide bitfield as:
	X &= ~3
	X |= value
even when 'value' was a compile time constant of 3.
Took a while to find out why the linked list got f*cked.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Joel Fernandes April 1, 2019, 9:11 p.m. UTC | #23
Thanks a lot Alan and Paul for the replies. I replied inline.

On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > On 03/28, Jann Horn wrote:
> > > > >
> > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > > the thread.
> > > > 
> > > > Since you added Paul let me add more confusion to this thread ;)
> > > 
> > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > 
> > Nice to take part in the confusion fun too!!! ;-)
> > 
> > > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > > find that old discussion and I forgot the result of that discussion...
> > > > 
> > > > Paul, could you confirm that this code
> > > > 
> > > > 	CPU_0		CPU_1
> > > > 
> > > > 	X = 1;		if (READ_ONCE(Y))
> > > > 	mb();			X = 2;
> > > > 	Y = 1;		BUG_ON(X != 2);
> > > > 
> > > > 
> > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > 
> > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > As written, there is (in the strict sense) a data race between the load
> > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > you could of course argue that this data race is harmless, especially
> > > if X is a single byte.  But the more I talk to compiler writers, the
> > > less comfortable I become with data races in general.  :-/
> > > 
> > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > 
> > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > 
> > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> > > 
> > > Also adding Alan on CC.
> > > 
> > > Here is what I believe is the litmus test that your are interested in:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	*y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		*x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > ------------------------------------------------------------------------
> > > 
> > > Running this through herd with Alan's patch detects the data race
> > > and says that the undesired outcome is allowed:
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > > 	Test OlegNesterov-put_pid Allowed
> > > 	States 3
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=1;
> > > 	1:r1=1; x=2;
> > > 	Ok
> > > 	Witnesses
> > > 	Positive: 1 Negative: 2
> > > 	Flag data-race
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid Sometimes 1 2
> > > 	Time OlegNesterov-put_pid 0.00
> > > 	Hash=a3e0043ad753effa860fea37eeba0a76
> > > 
> > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > although it does remove the "Flag data-race".
> > > 
> > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > > 	Test OlegNesterov-put_pid-WO-WO Allowed
> > > 	States 2
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=2;
> > > 	No
> > > 	Witnesses
> > > 	Positive: 0 Negative: 2
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > > 	Time OlegNesterov-put_pid-WO-WO 0.01
> > > 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > 
> > > Here is the corresponding litmus test, in case I messed something up:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid-WO-WO
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > 
> > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
> > sufficient to prevent the exists condition. Shouldn't the compiler know that,
> > in P0(), it should not reorder the store to y=1 before the x=1 because there
> > is an explicit barrier between the 2 stores? Looks me to me like a broken
> > compiler :-|. 
> > 
> > So I would have expected the following litmus to result in Never, but it
> > doesn't with Alan's patch:
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> The problem is that the compiler can turn both of P0()'s writes into reads:
> 
> P0(int *x, int *y)
> {
> 	if (*x != 1)
> 		*x = 1;
> 	smp_mb();
> 	if (*y != 1)
> 		*y = 1;
> }
> 
> These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
> to get an iron-clad ordering guarantee.

But the snippet above has smp_mb() which does order writes, even for the
plain accesses.

> > > ------------------------------------------------------------------------
> > > 
> > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > > > discussion.
> > > 
> > > Good point, let's try with smp_load_acquire() in P1():
> > > 
> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> > > 	Test OlegNesterov-put_pid-WO-sla Allowed
> > > 	States 2
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=2;
> > > 	No
> > > 	Witnesses
> > > 	Positive: 0 Negative: 2
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> > > 	Time OlegNesterov-put_pid-WO-sla 0.01
> > > 	Hash=4fb0276eabf924793dec1970199db3a6
> > > 
> > > This also works.  Here is the litmus test:
> > > 
> > > ------------------------------------------------------------------------
> > > C OlegNesterov-put_pid-WO-sla
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = smp_load_acquire(y);
> > > 	if (r1)
> > > 		*x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > ------------------------------------------------------------------------
> > > 
> > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> > > smp_load_acquire() gets us a data race and allows the undesired
> > > outcome:
> > 
> > Yeah, I think this is also what I was confused about above, is why is that
> > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
> > I'm missing something. ;-)
> 
> The first of P0()'s writes can be a plain write, at least assuming
> sufficient synchronization to avoid the data race.  But turning the second
> of P0()'s writes into a plain write is a bit riskier: That is a write of
> a constant, and those really are torn in some cases on some architectures.
> Like x86, for example.

I understand. Are we talking about load/store tearing being the issue here?
Even under load/store tearing, I feel the program will produce correct
results because r1 is either 0 or 1 (a single bit cannot be torn).

Further, from the herd simulator output (below), according to the "States",
r1==1 means P1() AFAICS would have already finished the the read and set the
r1 register to 1.  Then I am wondering why it couldn't take the branch to set
*x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
for the below program. I still couldn't see in my mind how for the below
program, this is possible - in terms of compiler optimizations or other kinds
of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
missing something :-)

---8<-----------------------
C Joel-put_pid

{}

P0(int *x, int *y)
{
	*x = 1;
	smp_mb();
	*y = 1;
}

P1(int *x, int *y)
{
	int r1;

	r1 = READ_ONCE(*y);
	if (r1)
		WRITE_ONCE(*x, 2);
}

exists (1:r1=1 /\ ~x=2)

---8<-----------------------
Output:

Test Joel-put_pid Allowed
States 3
1:r1=0; x=1;
1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.
1:r1=1; x=2;
Ok
Witnesses
Positive: 1 Negative: 2
Flag data-race
Condition exists (1:r1=1 /\ not (x=2))
Observation Joel-put_pid Sometimes 1 2
Time OlegNesterov-put_pid-WO-WO 0.01
Hash=c7bdd50418d42779b7c10ba9128369df

> > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
> > > 	Test OlegNesterov-put_pid-sla Allowed
> > > 	States 3
> > > 	1:r1=0; x=1;
> > > 	1:r1=1; x=1;
> > > 	1:r1=1; x=2;
> > > 	Ok
> > > 	Witnesses
> > > 	Positive: 1 Negative: 2
> > > 	Flag data-race
> > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > 	Observation OlegNesterov-put_pid-sla Sometimes 1 2
> > > 	Time OlegNesterov-put_pid-sla 0.01
> > > 	Hash=ec6f71f3d9f7cd6e45a874c872e3d946
> > > 
> > > But what if you are certain that the compiler cannot mess up your use
> > > of plain C-language loads and stores?  Then simply tell LKMM that they
> > > are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
> > > somewhat paranoid, but real C compilers really do tear stores of certain
> > > constants on systems (like x86) that have store-immediate instructions,
> > > so a bit of paranoia is not misplaced here.  ;-)
> > > 
> > > Plus please note that this patch to LKMM is prototype and thus subject
> > > to change.
> > 
> > Ah I see. Appreciate if Alan can also CC me on future posting of this since
> > I'm quite interested. ;-)
> 
> His last posting should be easy to find.  But please let me know if not,
> as I would be happy to send it along.

I found it and I'm going through it. Thanks!
 
 - Joel
Paul E. McKenney April 4, 2019, 3:23 p.m. UTC | #24
On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> Thanks a lot Alan and Paul for the replies. I replied inline.
> 
> On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> > > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > > On 03/28, Jann Horn wrote:
> > > > > >
> > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > > > the thread.
> > > > > 
> > > > > Since you added Paul let me add more confusion to this thread ;)
> > > > 
> > > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > > 
> > > Nice to take part in the confusion fun too!!! ;-)
> > > 
> > > > > There were some concerns about the lack of barriers in put_pid(), but I can't
> > > > > find that old discussion and I forgot the result of that discussion...
> > > > > 
> > > > > Paul, could you confirm that this code
> > > > > 
> > > > > 	CPU_0		CPU_1
> > > > > 
> > > > > 	X = 1;		if (READ_ONCE(Y))
> > > > > 	mb();			X = 2;
> > > > > 	Y = 1;		BUG_ON(X != 2);
> > > > > 
> > > > > 
> > > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > > 
> > > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > > As written, there is (in the strict sense) a data race between the load
> > > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > > you could of course argue that this data race is harmless, especially
> > > > if X is a single byte.  But the more I talk to compiler writers, the
> > > > less comfortable I become with data races in general.  :-/
> > > > 
> > > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > > 
> > > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > > 
> > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org
> > > > 
> > > > Also adding Alan on CC.
> > > > 
> > > > Here is what I believe is the litmus test that your are interested in:
> > > > 
> > > > ------------------------------------------------------------------------
> > > > C OlegNesterov-put_pid
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	*x = 1;
> > > > 	smp_mb();
> > > > 	*y = 1;
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*y);
> > > > 	if (r1)
> > > > 		*x = 2;
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > > ------------------------------------------------------------------------
> > > > 
> > > > Running this through herd with Alan's patch detects the data race
> > > > and says that the undesired outcome is allowed:
> > > > 
> > > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > > > 	Test OlegNesterov-put_pid Allowed
> > > > 	States 3
> > > > 	1:r1=0; x=1;
> > > > 	1:r1=1; x=1;
> > > > 	1:r1=1; x=2;
> > > > 	Ok
> > > > 	Witnesses
> > > > 	Positive: 1 Negative: 2
> > > > 	Flag data-race
> > > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > > 	Observation OlegNesterov-put_pid Sometimes 1 2
> > > > 	Time OlegNesterov-put_pid 0.00
> > > > 	Hash=a3e0043ad753effa860fea37eeba0a76
> > > > 
> > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > > although it does remove the "Flag data-race".
> > > > 
> > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > > 
> > > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > > > 	Test OlegNesterov-put_pid-WO-WO Allowed
> > > > 	States 2
> > > > 	1:r1=0; x=1;
> > > > 	1:r1=1; x=2;
> > > > 	No
> > > > 	Witnesses
> > > > 	Positive: 0 Negative: 2
> > > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > > 	Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > > > 	Time OlegNesterov-put_pid-WO-WO 0.01
> > > > 	Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > > 
> > > > Here is the corresponding litmus test, in case I messed something up:
> > > > 
> > > > ------------------------------------------------------------------------
> > > > C OlegNesterov-put_pid-WO-WO
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	*x = 1;
> > > > 	smp_mb();
> > > > 	WRITE_ONCE(*y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*y);
> > > > 	if (r1)
> > > > 		WRITE_ONCE(*x, 2);
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> > > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
> > > sufficient to prevent the exists condition. Shouldn't the compiler know that,
> > > in P0(), it should not reorder the store to y=1 before the x=1 because there
> > > is an explicit barrier between the 2 stores? Looks me to me like a broken
> > > compiler :-|. 
> > > 
> > > So I would have expected the following litmus to result in Never, but it
> > > doesn't with Alan's patch:
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	*y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > 
> > The problem is that the compiler can turn both of P0()'s writes into reads:
> > 
> > P0(int *x, int *y)
> > {
> > 	if (*x != 1)
> > 		*x = 1;
> > 	smp_mb();
> > 	if (*y != 1)
> > 		*y = 1;
> > }
> > 
> > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
> > to get an iron-clad ordering guarantee.
> 
> But the snippet above has smp_mb() which does order writes, even for the
> plain accesses.

True, but that code has a data race, namely P0()'s plain write to y
races with P1()'s READ_ONCE().  Data races give the compiler absolutely
astonishing levels of freedom to rearrange your code.  So, if you
as a developer or maintainer choose to have data races, it is your
responsibility to ensure that the compiler cannot mess you up.  So what
you should do in that case is to list the transformed code the compiler's
optimizations can produce and feed the corresponding litmus tests to LKMM,
using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.

> > > > ------------------------------------------------------------------------
> > > > 
> > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> > > > > discussion.
> > > > 
> > > > Good point, let's try with smp_load_acquire() in P1():
> > > > 
> > > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
> > > > 	Test OlegNesterov-put_pid-WO-sla Allowed
> > > > 	States 2
> > > > 	1:r1=0; x=1;
> > > > 	1:r1=1; x=2;
> > > > 	No
> > > > 	Witnesses
> > > > 	Positive: 0 Negative: 2
> > > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > > 	Observation OlegNesterov-put_pid-WO-sla Never 0 2
> > > > 	Time OlegNesterov-put_pid-WO-sla 0.01
> > > > 	Hash=4fb0276eabf924793dec1970199db3a6
> > > > 
> > > > This also works.  Here is the litmus test:
> > > > 
> > > > ------------------------------------------------------------------------
> > > > C OlegNesterov-put_pid-WO-sla
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	*x = 1;
> > > > 	smp_mb();
> > > > 	WRITE_ONCE(*y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = smp_load_acquire(y);
> > > > 	if (r1)
> > > > 		*x = 2;
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > > ------------------------------------------------------------------------
> > > > 
> > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
> > > > smp_load_acquire() gets us a data race and allows the undesired
> > > > outcome:
> > > 
> > > Yeah, I think this is also what I was confused about above, is why is that
> > > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely
> > > I'm missing something. ;-)
> > 
> > The first of P0()'s writes can be a plain write, at least assuming
> > sufficient synchronization to avoid the data race.  But turning the second
> > of P0()'s writes into a plain write is a bit riskier: That is a write of
> > a constant, and those really are torn in some cases on some architectures.
> > Like x86, for example.
> 
> I understand. Are we talking about load/store tearing being the issue here?
> Even under load/store tearing, I feel the program will produce correct
> results because r1 is either 0 or 1 (a single bit cannot be torn).

The compiler can (at least in theory) also transform *y = 1 as follows:

	*y = r1; /* Use *y as temporary storage. */
	do_something();
	r1 = *y;
	*y = 1;

Here r1 is some register and do_something() is an inline function visible
to the compiler (hence not implying barrier() upon call and return).

I don't know of any compilers that actually do this, but for me use
of WRITE_ONCE() is a small price to pay to prevent all past, present,
and future compiler from ever destroying^W optimizing my code in this way.

In your own code, if you dislike WRITE_ONCE() so much that you are willing
to commit to (now and forever!!!) checking all applicable versions of
compilers to make sure that they don't do this, well and good, knock
yourself out.  But it is your responsibility to do that checking, and you
can attest to LKMM that you have done so by giving LKMM litmus tests that
substitute WRITE_ONCE() for that plain C-language assignment statement.

> Further, from the herd simulator output (below), according to the "States",
> r1==1 means P1() AFAICS would have already finished the the read and set the
> r1 register to 1.  Then I am wondering why it couldn't take the branch to set
> *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> for the below program. I still couldn't see in my mind how for the below
> program, this is possible - in terms of compiler optimizations or other kinds
> of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> missing something :-)
> 
> ---8<-----------------------
> C Joel-put_pid
> 
> {}
> 
> P0(int *x, int *y)
> {
> 	*x = 1;
> 	smp_mb();
> 	*y = 1;
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	if (r1)
> 		WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)
> 
> ---8<-----------------------
> Output:
> 
> Test Joel-put_pid Allowed
> States 3
> 1:r1=0; x=1;
> 1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.

I must defer to Alan on this, but my guess is that this is due to
the fact that there is a data race.

							Thanx, Paul

> 1:r1=1; x=2;
> Ok
> Witnesses
> Positive: 1 Negative: 2
> Flag data-race
> Condition exists (1:r1=1 /\ not (x=2))
> Observation Joel-put_pid Sometimes 1 2
> Time OlegNesterov-put_pid-WO-WO 0.01
> Hash=c7bdd50418d42779b7c10ba9128369df
> 
> > > > 	$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
> > > > 	Test OlegNesterov-put_pid-sla Allowed
> > > > 	States 3
> > > > 	1:r1=0; x=1;
> > > > 	1:r1=1; x=1;
> > > > 	1:r1=1; x=2;
> > > > 	Ok
> > > > 	Witnesses
> > > > 	Positive: 1 Negative: 2
> > > > 	Flag data-race
> > > > 	Condition exists (1:r1=1 /\ not (x=2))
> > > > 	Observation OlegNesterov-put_pid-sla Sometimes 1 2
> > > > 	Time OlegNesterov-put_pid-sla 0.01
> > > > 	Hash=ec6f71f3d9f7cd6e45a874c872e3d946
> > > > 
> > > > But what if you are certain that the compiler cannot mess up your use
> > > > of plain C-language loads and stores?  Then simply tell LKMM that they
> > > > are READ_ONCE() and WRITE_ONCE(), respectively.  LKMM is admittedly
> > > > somewhat paranoid, but real C compilers really do tear stores of certain
> > > > constants on systems (like x86) that have store-immediate instructions,
> > > > so a bit of paranoia is not misplaced here.  ;-)
> > > > 
> > > > Plus please note that this patch to LKMM is prototype and thus subject
> > > > to change.
> > > 
> > > Ah I see. Appreciate if Alan can also CC me on future posting of this since
> > > I'm quite interested. ;-)
> > 
> > His last posting should be easy to find.  But please let me know if not,
> > as I would be happy to send it along.
> 
> I found it and I'm going through it. Thanks!
>  
>  - Joel
>
Alan Stern April 4, 2019, 4:01 p.m. UTC | #25
On Thu, 4 Apr 2019, Paul E. McKenney wrote:

> On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:

> > > > So I would have expected the following litmus to result in Never, but it
> > > > doesn't with Alan's patch:
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	*x = 1;
> > > > 	smp_mb();
> > > > 	*y = 1;
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*y);
> > > > 	if (r1)
> > > > 		WRITE_ONCE(*x, 2);
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > The problem is that the compiler can turn both of P0()'s writes into reads:
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	if (*x != 1)
> > > 		*x = 1;
> > > 	smp_mb();
> > > 	if (*y != 1)
> > > 		*y = 1;
> > > }
> > > 
> > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
> > > to get an iron-clad ordering guarantee.
> > 
> > But the snippet above has smp_mb() which does order writes, even for the
> > plain accesses.
> 
> True, but that code has a data race, namely P0()'s plain write to y
> races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> astonishing levels of freedom to rearrange your code.  So, if you
> as a developer or maintainer choose to have data races, it is your
> responsibility to ensure that the compiler cannot mess you up.  So what
> you should do in that case is to list the transformed code the compiler's
> optimizations can produce and feed the corresponding litmus tests to LKMM,
> using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.

In this case, I strongly suspect the compiler would not mess up the
code badly enough to allow the unwanted end result.  But the LKMM tries
to avoid making strong assumptions about what compilers will or will
not do.


> > I understand. Are we talking about load/store tearing being the issue here?
> > Even under load/store tearing, I feel the program will produce correct
> > results because r1 is either 0 or 1 (a single bit cannot be torn).
> 
> The compiler can (at least in theory) also transform *y = 1 as follows:
> 
> 	*y = r1; /* Use *y as temporary storage. */
> 	do_something();
> 	r1 = *y;
> 	*y = 1;
> 
> Here r1 is some register and do_something() is an inline function visible
> to the compiler (hence not implying barrier() upon call and return).
> 
> I don't know of any compilers that actually do this, but for me use
> of WRITE_ONCE() is a small price to pay to prevent all past, present,
> and future compiler from ever destroying^W optimizing my code in this way.
> 
> In your own code, if you dislike WRITE_ONCE() so much that you are willing
> to commit to (now and forever!!!) checking all applicable versions of
> compilers to make sure that they don't do this, well and good, knock
> yourself out.  But it is your responsibility to do that checking, and you
> can attest to LKMM that you have done so by giving LKMM litmus tests that
> substitute WRITE_ONCE() for that plain C-language assignment statement.

Remember that the LKMM does not produce strict bounds.  That is, the
LKMM will say that some outcomes are allowed even though no existing
compiler/CPU combination would ever actually produce them.  This litmus
test is an example.

> > Further, from the herd simulator output (below), according to the "States",
> > r1==1 means P1() AFAICS would have already finished the the read and set the
> > r1 register to 1.  Then I am wondering why it couldn't take the branch to set
> > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > for the below program. I still couldn't see in my mind how for the below
> > program, this is possible - in terms of compiler optimizations or other kinds
> > of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> > and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> > missing something :-)
> > 
> > ---8<-----------------------
> > C Joel-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > 	*x = 1;
> > 	smp_mb();
> > 	*y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	if (r1)
> > 		WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > 
> > ---8<-----------------------
> > Output:
> > 
> > Test Joel-put_pid Allowed
> > States 3
> > 1:r1=0; x=1;
> > 1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.
> 
> I must defer to Alan on this, but my guess is that this is due to
> the fact that there is a data race.

Yes, and because the plain-access/data-race patch for the LKMM was
intended only to detect data races, not to be aware of all the kinds of
ordering that plain accesses can induce.  I said as much at the start
of the patch's cover letter and it bears repeating.

In this case it is certainly true that the "*x = 1" write must be
complete before the value of *y is changed from 0.  Hence P1's
WRITE_ONCE will certainly overwrite the 1 with 2, and the final value
of *x will be 2 if r1=1.

But the notion of "visibility" of a write that I put into the LKMM
patch only allows for chains of marked accesses.  Since "*y = 1" is a
plain access, the model does not recognize that it can enforce the
ordering between the two writes to *x.

Also, you must remember, the LKMM's prediction about whether an outcome
will or will not occur are meaningless if a data race is present.  
Therefore the most fundamental the answer to why the "1:r1=1; x=1;"  
line is there is basically what Paul said: It's there because the
herd model gets completely messed up by data races.

Alan Stern
Joel Fernandes April 4, 2019, 6:08 p.m. UTC | #26
Thanks a lot Paul and Allen, I replied below.

On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote:
> On Thu, 4 Apr 2019, Paul E. McKenney wrote:
> 
> > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> 
> > > > > So I would have expected the following litmus to result in Never, but it
> > > > > doesn't with Alan's patch:
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > > 	*x = 1;
> > > > > 	smp_mb();
> > > > > 	*y = 1;
> > > > > }
> > > > > 
> > > > > P1(int *x, int *y)
> > > > > {
> > > > > 	int r1;
> > > > > 
> > > > > 	r1 = READ_ONCE(*y);
> > > > > 	if (r1)
> > > > > 		WRITE_ONCE(*x, 2);
> > > > > }
> > > > > 
> > > > > exists (1:r1=1 /\ ~x=2)
> > > > 
> > > > The problem is that the compiler can turn both of P0()'s writes into reads:
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	if (*x != 1)
> > > > 		*x = 1;
> > > > 	smp_mb();
> > > > 	if (*y != 1)
> > > > 		*y = 1;
> > > > }
> > > > 
> > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
> > > > to get an iron-clad ordering guarantee.
> > > 
> > > But the snippet above has smp_mb() which does order writes, even for the
> > > plain accesses.
> > 
> > True, but that code has a data race, namely P0()'s plain write to y
> > races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> > astonishing levels of freedom to rearrange your code.  So, if you
> > as a developer or maintainer choose to have data races, it is your
> > responsibility to ensure that the compiler cannot mess you up.  So what
> > you should do in that case is to list the transformed code the compiler's
> > optimizations can produce and feed the corresponding litmus tests to LKMM,
> > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.
> 
> In this case, I strongly suspect the compiler would not mess up the
> code badly enough to allow the unwanted end result.  But the LKMM tries
> to avoid making strong assumptions about what compilers will or will
> not do.

Here I was just trying to understand better how any kind of code
transformation can cause an issue. Certainly I can see an issue if the
compiler uses the memory location as a temporary variable as Paul pointed, to
which I agree that a WRITE_ONCE is better. I am in favor of being careful,
but here I was just trying to understand this better.

> > > I understand. Are we talking about load/store tearing being the issue here?
> > > Even under load/store tearing, I feel the program will produce correct
> > > results because r1 is either 0 or 1 (a single bit cannot be torn).
> > 
> > The compiler can (at least in theory) also transform *y = 1 as follows:
> > 
> > 	*y = r1; /* Use *y as temporary storage. */
> > 	do_something();
> > 	r1 = *y;
> > 	*y = 1;
> > 
> > Here r1 is some register and do_something() is an inline function visible
> > to the compiler (hence not implying barrier() upon call and return).
> > 
> > I don't know of any compilers that actually do this, but for me use
> > of WRITE_ONCE() is a small price to pay to prevent all past, present,
> > and future compiler from ever destroying^W optimizing my code in this way.
> > 
> > In your own code, if you dislike WRITE_ONCE() so much that you are willing
> > to commit to (now and forever!!!) checking all applicable versions of
> > compilers to make sure that they don't do this, well and good, knock
> > yourself out.  But it is your responsibility to do that checking, and you
> > can attest to LKMM that you have done so by giving LKMM litmus tests that
> > substitute WRITE_ONCE() for that plain C-language assignment statement.
> 
> Remember that the LKMM does not produce strict bounds.  That is, the
> LKMM will say that some outcomes are allowed even though no existing
> compiler/CPU combination would ever actually produce them.  This litmus
> test is an example.

Ok.

> > > Further, from the herd simulator output (below), according to the "States",
> > > r1==1 means P1() AFAICS would have already finished the the read and set the
> > > r1 register to 1.  Then I am wondering why it couldn't take the branch to set
> > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > > for the below program. I still couldn't see in my mind how for the below
> > > program, this is possible - in terms of compiler optimizations or other kinds
> > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> > > missing something :-)
> > > 
> > > ---8<-----------------------
> > > C Joel-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > > 	*x = 1;
> > > 	smp_mb();
> > > 	*y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	if (r1)
> > > 		WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > ---8<-----------------------
> > > Output:
> > > 
> > > Test Joel-put_pid Allowed
> > > States 3
> > > 1:r1=0; x=1;
> > > 1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.
> > 
> > I must defer to Alan on this, but my guess is that this is due to
> > the fact that there is a data race.
> 
> Yes, and because the plain-access/data-race patch for the LKMM was
> intended only to detect data races, not to be aware of all the kinds of
> ordering that plain accesses can induce.  I said as much at the start
> of the patch's cover letter and it bears repeating.
> 
> In this case it is certainly true that the "*x = 1" write must be
> complete before the value of *y is changed from 0.  Hence P1's
> WRITE_ONCE will certainly overwrite the 1 with 2, and the final value
> of *x will be 2 if r1=1.
> 
> But the notion of "visibility" of a write that I put into the LKMM
> patch only allows for chains of marked accesses.  Since "*y = 1" is a
> plain access, the model does not recognize that it can enforce the
> ordering between the two writes to *x.
> 
> Also, you must remember, the LKMM's prediction about whether an outcome
> will or will not occur are meaningless if a data race is present.  
> Therefore the most fundamental the answer to why the "1:r1=1; x=1;"  
> line is there is basically what Paul said: It's there because the
> herd model gets completely messed up by data races.

Makes sense to me. Thanks for the good work on this.

FWIW, thought to mention (feel free ignore the suggestion if its
meaningless): If there is any chance that the outcome can be better
outputted, like r1=X; x=1; Where X stands for the result of a data race, that
would be lovely.  I don't know much about herd internals (yet) to say if the
suggestion makes sense but as a user, it would certainly help reduce
confusion.

thanks,

 - Joel
Paul E. McKenney April 4, 2019, 6:19 p.m. UTC | #27
On Thu, Apr 04, 2019 at 02:08:42PM -0400, Joel Fernandes wrote:
> Thanks a lot Paul and Allen, I replied below.
> 
> On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote:
> > On Thu, 4 Apr 2019, Paul E. McKenney wrote:
> > 
> > > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> > 
> > > > > > So I would have expected the following litmus to result in Never, but it
> > > > > > doesn't with Alan's patch:
> > > > > > 
> > > > > > P0(int *x, int *y)
> > > > > > {
> > > > > > 	*x = 1;
> > > > > > 	smp_mb();
> > > > > > 	*y = 1;
> > > > > > }
> > > > > > 
> > > > > > P1(int *x, int *y)
> > > > > > {
> > > > > > 	int r1;
> > > > > > 
> > > > > > 	r1 = READ_ONCE(*y);
> > > > > > 	if (r1)
> > > > > > 		WRITE_ONCE(*x, 2);
> > > > > > }
> > > > > > 
> > > > > > exists (1:r1=1 /\ ~x=2)
> > > > > 
> > > > > The problem is that the compiler can turn both of P0()'s writes into reads:
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > > 	if (*x != 1)
> > > > > 		*x = 1;
> > > > > 	smp_mb();
> > > > > 	if (*y != 1)
> > > > > 		*y = 1;
> > > > > }
> > > > > 
> > > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE()
> > > > > to get an iron-clad ordering guarantee.
> > > > 
> > > > But the snippet above has smp_mb() which does order writes, even for the
> > > > plain accesses.
> > > 
> > > True, but that code has a data race, namely P0()'s plain write to y
> > > races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> > > astonishing levels of freedom to rearrange your code.  So, if you
> > > as a developer or maintainer choose to have data races, it is your
> > > responsibility to ensure that the compiler cannot mess you up.  So what
> > > you should do in that case is to list the transformed code the compiler's
> > > optimizations can produce and feed the corresponding litmus tests to LKMM,
> > > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.
> > 
> > In this case, I strongly suspect the compiler would not mess up the
> > code badly enough to allow the unwanted end result.  But the LKMM tries
> > to avoid making strong assumptions about what compilers will or will
> > not do.
> 
> Here I was just trying to understand better how any kind of code
> transformation can cause an issue. Certainly I can see an issue if the
> compiler uses the memory location as a temporary variable as Paul pointed, to
> which I agree that a WRITE_ONCE is better. I am in favor of being careful,
> but here I was just trying to understand this better.
> 
> > > > I understand. Are we talking about load/store tearing being the issue here?
> > > > Even under load/store tearing, I feel the program will produce correct
> > > > results because r1 is either 0 or 1 (a single bit cannot be torn).
> > > 
> > > The compiler can (at least in theory) also transform *y = 1 as follows:
> > > 
> > > 	*y = r1; /* Use *y as temporary storage. */
> > > 	do_something();
> > > 	r1 = *y;
> > > 	*y = 1;
> > > 
> > > Here r1 is some register and do_something() is an inline function visible
> > > to the compiler (hence not implying barrier() upon call and return).
> > > 
> > > I don't know of any compilers that actually do this, but for me use
> > > of WRITE_ONCE() is a small price to pay to prevent all past, present,
> > > and future compiler from ever destroying^W optimizing my code in this way.
> > > 
> > > In your own code, if you dislike WRITE_ONCE() so much that you are willing
> > > to commit to (now and forever!!!) checking all applicable versions of
> > > compilers to make sure that they don't do this, well and good, knock
> > > yourself out.  But it is your responsibility to do that checking, and you
> > > can attest to LKMM that you have done so by giving LKMM litmus tests that
> > > substitute WRITE_ONCE() for that plain C-language assignment statement.
> > 
> > Remember that the LKMM does not produce strict bounds.  That is, the
> > LKMM will say that some outcomes are allowed even though no existing
> > compiler/CPU combination would ever actually produce them.  This litmus
> > test is an example.
> 
> Ok.
> 
> > > > Further, from the herd simulator output (below), according to the "States",
> > > > r1==1 means P1() AFAICS would have already finished the the read and set the
> > > > r1 register to 1.  Then I am wondering why it couldn't take the branch to set
> > > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > > > for the below program. I still couldn't see in my mind how for the below
> > > > program, this is possible - in terms of compiler optimizations or other kinds
> > > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> > > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> > > > missing something :-)
> > > > 
> > > > ---8<-----------------------
> > > > C Joel-put_pid
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > 	*x = 1;
> > > > 	smp_mb();
> > > > 	*y = 1;
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*y);
> > > > 	if (r1)
> > > > 		WRITE_ONCE(*x, 2);
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > > 
> > > > ---8<-----------------------
> > > > Output:
> > > > 
> > > > Test Joel-put_pid Allowed
> > > > States 3
> > > > 1:r1=0; x=1;
> > > > 1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.
> > > 
> > > I must defer to Alan on this, but my guess is that this is due to
> > > the fact that there is a data race.
> > 
> > Yes, and because the plain-access/data-race patch for the LKMM was
> > intended only to detect data races, not to be aware of all the kinds of
> > ordering that plain accesses can induce.  I said as much at the start
> > of the patch's cover letter and it bears repeating.
> > 
> > In this case it is certainly true that the "*x = 1" write must be
> > complete before the value of *y is changed from 0.  Hence P1's
> > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value
> > of *x will be 2 if r1=1.
> > 
> > But the notion of "visibility" of a write that I put into the LKMM
> > patch only allows for chains of marked accesses.  Since "*y = 1" is a
> > plain access, the model does not recognize that it can enforce the
> > ordering between the two writes to *x.
> > 
> > Also, you must remember, the LKMM's prediction about whether an outcome
> > will or will not occur are meaningless if a data race is present.  
> > Therefore the most fundamental the answer to why the "1:r1=1; x=1;"  
> > line is there is basically what Paul said: It's there because the
> > herd model gets completely messed up by data races.
> 
> Makes sense to me. Thanks for the good work on this.
> 
> FWIW, thought to mention (feel free ignore the suggestion if its
> meaningless): If there is any chance that the outcome can be better
> outputted, like r1=X; x=1; Where X stands for the result of a data race, that
> would be lovely.  I don't know much about herd internals (yet) to say if the
> suggestion makes sense but as a user, it would certainly help reduce
> confusion.

The "Flag data-race" that appeared in the herd output is your friend in
this case.  If you see that in the output, that means that herd detected
a data race, and the states output might or might not be reliable.

							Thanx, Paul
Alan Stern April 4, 2019, 7:09 p.m. UTC | #28
On Thu, 4 Apr 2019, Joel Fernandes wrote:

> FWIW, thought to mention (feel free ignore the suggestion if its
> meaningless): If there is any chance that the outcome can be better
> outputted, like r1=X; x=1; Where X stands for the result of a data race, that
> would be lovely.  I don't know much about herd internals (yet) to say if the
> suggestion makes sense but as a user, it would certainly help reduce
> confusion.

I don't think that is feasible.  For one thing, according to the C 
Standard, in the presence of a data race a program is allowed to do 
anything at all.  There's no point trying to display all possible 
outcomes of an execution!

If we adopted the proposal that certain kinds of data races only result
in undefined values, things would be a little better.  It's possible
that the herd program could be changed to support undefined values.  
At present, however, it does not.

Alan Stern
Joel Fernandes April 4, 2019, 8:31 p.m. UTC | #29
On Thu, Apr 04, 2019 at 11:19:46AM -0700, Paul E. McKenney wrote:
[snip]
> > > > > Further, from the herd simulator output (below), according to the "States",
> > > > > r1==1 means P1() AFAICS would have already finished the the read and set the
> > > > > r1 register to 1.  Then I am wondering why it couldn't take the branch to set
> > > > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > > > > for the below program. I still couldn't see in my mind how for the below
> > > > > program, this is possible - in terms of compiler optimizations or other kinds
> > > > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> > > > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> > > > > missing something :-)
> > > > > 
> > > > > ---8<-----------------------
> > > > > C Joel-put_pid
> > > > > 
> > > > > {}
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > > 	*x = 1;
> > > > > 	smp_mb();
> > > > > 	*y = 1;
> > > > > }
> > > > > 
> > > > > P1(int *x, int *y)
> > > > > {
> > > > > 	int r1;
> > > > > 
> > > > > 	r1 = READ_ONCE(*y);
> > > > > 	if (r1)
> > > > > 		WRITE_ONCE(*x, 2);
> > > > > }
> > > > > 
> > > > > exists (1:r1=1 /\ ~x=2)
> > > > > 
> > > > > ---8<-----------------------
> > > > > Output:
> > > > > 
> > > > > Test Joel-put_pid Allowed
> > > > > States 3
> > > > > 1:r1=0; x=1;
> > > > > 1:r1=1; x=1;	<-- Can't figure out why r1=1 and x != 2 here.
> > > > 
> > > > I must defer to Alan on this, but my guess is that this is due to
> > > > the fact that there is a data race.
> > > 
> > > Yes, and because the plain-access/data-race patch for the LKMM was
> > > intended only to detect data races, not to be aware of all the kinds of
> > > ordering that plain accesses can induce.  I said as much at the start
> > > of the patch's cover letter and it bears repeating.
> > > 
> > > In this case it is certainly true that the "*x = 1" write must be
> > > complete before the value of *y is changed from 0.  Hence P1's
> > > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value
> > > of *x will be 2 if r1=1.
> > > 
> > > But the notion of "visibility" of a write that I put into the LKMM
> > > patch only allows for chains of marked accesses.  Since "*y = 1" is a
> > > plain access, the model does not recognize that it can enforce the
> > > ordering between the two writes to *x.
> > > 
> > > Also, you must remember, the LKMM's prediction about whether an outcome
> > > will or will not occur are meaningless if a data race is present.  
> > > Therefore the most fundamental the answer to why the "1:r1=1; x=1;"  
> > > line is there is basically what Paul said: It's there because the
> > > herd model gets completely messed up by data races.
> > 
> > Makes sense to me. Thanks for the good work on this.
> > 
> > FWIW, thought to mention (feel free ignore the suggestion if its
> > meaningless): If there is any chance that the outcome can be better
> > outputted, like r1=X; x=1; Where X stands for the result of a data race, that
> > would be lovely.  I don't know much about herd internals (yet) to say if the
> > suggestion makes sense but as a user, it would certainly help reduce
> > confusion.
> 
> The "Flag data-race" that appeared in the herd output is your friend in
> this case.  If you see that in the output, that means that herd detected
> a data race, and the states output might or might not be reliable.

Thanks Paul and Alan, the "Flag data-race" indication sounds good to me. I
will watch out for that :)
 - Joel

Patch
diff mbox series

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 14a9a39da9c7..8cb86d377ff5 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@ 
 #define _LINUX_PID_H
 
 #include <linux/rculist.h>
+#include <linux/refcount.h>
 
 enum pid_type
 {
@@ -56,7 +57,7 @@  struct upid {
 
 struct pid
 {
-	atomic_t count;
+	refcount_t count;
 	unsigned int level;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
@@ -69,7 +70,7 @@  extern struct pid init_struct_pid;
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
-		atomic_inc(&pid->count);
+		refcount_inc(&pid->count);
 	return pid;
 }
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..2095c7da644d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -37,7 +37,7 @@ 
 #include <linux/init_task.h>
 #include <linux/syscalls.h>
 #include <linux/proc_ns.h>
-#include <linux/proc_fs.h>
+#include <linux/refcount.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
 
@@ -106,8 +106,8 @@  void put_pid(struct pid *pid)
 		return;
 
 	ns = pid->numbers[pid->level].ns;
-	if ((atomic_read(&pid->count) == 1) ||
-	     atomic_dec_and_test(&pid->count)) {
+	if ((refcount_read(&pid->count) == 1) ||
+	     refcount_dec_and_test(&pid->count)) {
 		kmem_cache_free(ns->pid_cachep, pid);
 		put_pid_ns(ns);
 	}
@@ -210,7 +210,7 @@  struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 
 	get_pid_ns(ns);
-	atomic_set(&pid->count, 1);
+	refcount_set(&pid->count, 1);
 	for (type = 0; type < PIDTYPE_MAX; ++type)
 		INIT_HLIST_HEAD(&pid->tasks[type]);