Message ID | 20190327145331.215360-1-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Convert struct pid count to refcount_t | expand |
On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > struct pid's count is an atomic_t field used as a refcount. Use > refcount_t for it which is basically atomic_t but does additional > checking to prevent use-after-free bugs. No change in behavior if > CONFIG_REFCOUNT_FULL=n. > > Cc: keescook@chromium.org > Cc: kernel-team@android.com > Cc: kernel-hardening@lists.openwall.com > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > [...] > diff --git a/kernel/pid.c b/kernel/pid.c > index 20881598bdfa..2095c7da644d 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -37,7 +37,7 @@ > #include <linux/init_task.h> > #include <linux/syscalls.h> > #include <linux/proc_ns.h> > -#include <linux/proc_fs.h> > +#include <linux/refcount.h> > #include <linux/sched/task.h> > #include <linux/idr.h> > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > return; > > ns = pid->numbers[pid->level].ns; > - if ((atomic_read(&pid->count) == 1) || > - atomic_dec_and_test(&pid->count)) { > + if ((refcount_read(&pid->count) == 1) || > + refcount_dec_and_test(&pid->count)) { Why is this (and the original code) safe in the face of a race against get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I don't see this code pattern anywhere else in the kernel. Beyond that, yes please. :) -Kees > kmem_cache_free(ns->pid_cachep, pid); > put_pid_ns(ns); > } > @@ -210,7 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) > } > > get_pid_ns(ns); > - atomic_set(&pid->count, 1); > + refcount_set(&pid->count, 1); > for (type = 0; type < PIDTYPE_MAX; ++type) > INIT_HLIST_HEAD(&pid->tasks[type]); > > -- > 2.21.0.392.gf8f6787159e-goog > -- Kees Cook
On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > struct pid's count is an atomic_t field used as a refcount. Use > > refcount_t for it which is basically atomic_t but does additional > > checking to prevent use-after-free bugs. No change in behavior if > > CONFIG_REFCOUNT_FULL=n. > > > > Cc: keescook@chromium.org > > Cc: kernel-team@android.com > > Cc: kernel-hardening@lists.openwall.com > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > [...] > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 20881598bdfa..2095c7da644d 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > > @@ -37,7 +37,7 @@ > > #include <linux/init_task.h> > > #include <linux/syscalls.h> > > #include <linux/proc_ns.h> > > -#include <linux/proc_fs.h> > > +#include <linux/refcount.h> > > #include <linux/sched/task.h> > > #include <linux/idr.h> > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > return; > > > > ns = pid->numbers[pid->level].ns; > > - if ((atomic_read(&pid->count) == 1) || > > - atomic_dec_and_test(&pid->count)) { > > + if ((refcount_read(&pid->count) == 1) || > > + refcount_dec_and_test(&pid->count)) { > > Why is this (and the original code) safe in the face of a race against > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > don't see this code pattern anywhere else in the kernel. Semantically, it doesn't make a difference whether you do this or leave out the "refcount_read(&pid->count) == 1". If you read a 1 from refcount_read(), then you have the only reference to "struct pid", and therefore you want to free it. If you don't get a 1, you have to atomically drop a reference, which, if someone else is concurrently also dropping a reference, may leave you with the last reference (in the case where refcount_dec_and_test() returns true), in which case you still have to take care of freeing it. My guess is that the goal of this is to make the "drop last reference" case a little bit faster by avoiding the cacheline dirtying and the atomic op, at the expense of an extra memory op and branch every time we drop a non-final reference. But that's a pretty low-level optimization, and forking by itself isn't exactly fast... I think the clean thing to do would be to either move this detail into the refcount implementation (if it turns out to actually be valuable in at least a microbenchmark), or just get rid of it. Given the overhead of fork()/clone(), I would be surprised if you could actually measure this effect here. Eric, can you remember the rationale for doing it that way in commit 92476d7fc0326a409ab1d3864a04093a6be9aca7? Am I guessing correctly?
On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > refcount_t for it which is basically atomic_t but does additional > > > checking to prevent use-after-free bugs. No change in behavior if > > > CONFIG_REFCOUNT_FULL=n. > > > > > > Cc: keescook@chromium.org > > > Cc: kernel-team@android.com > > > Cc: kernel-hardening@lists.openwall.com > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > [...] > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > index 20881598bdfa..2095c7da644d 100644 > > > --- a/kernel/pid.c > > > +++ b/kernel/pid.c > > > @@ -37,7 +37,7 @@ > > > #include <linux/init_task.h> > > > #include <linux/syscalls.h> > > > #include <linux/proc_ns.h> > > > -#include <linux/proc_fs.h> > > > +#include <linux/refcount.h> > > > #include <linux/sched/task.h> > > > #include <linux/idr.h> > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > return; > > > > > > ns = pid->numbers[pid->level].ns; > > > - if ((atomic_read(&pid->count) == 1) || > > > - atomic_dec_and_test(&pid->count)) { > > > + if ((refcount_read(&pid->count) == 1) || > > > + refcount_dec_and_test(&pid->count)) { > > > > Why is this (and the original code) safe in the face of a race against > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > don't see this code pattern anywhere else in the kernel. > > Semantically, it doesn't make a difference whether you do this or > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > refcount_read(), then you have the only reference to "struct pid", and > therefore you want to free it. If you don't get a 1, you have to > atomically drop a reference, which, if someone else is concurrently > also dropping a reference, may leave you with the last reference (in > the case where refcount_dec_and_test() returns true), in which case > you still have to take care of freeing it. Also, based on Kees comment, I think it appears to me that get_pid and put_pid can race in this way in the original code right? get_pid put_pid atomic_dec_and_test returns 1 atomic_inc kfree deref pid /* boom */ ------------------------------------------------- I think get_pid needs to call atomic_inc_not_zero() and put_pid should not test for pid->count == 1 as condition for freeing, but rather just do atomic_dec_and_test. So something like the following diff. (And I see a similar pattern used in drivers/net/mac.c) Is the above scenario valid? I didn't see any locking around get_pid or pud_pid to avoid such a race. ---8<----------------------- diff --git a/include/linux/pid.h b/include/linux/pid.h index 8cb86d377ff5..3d79834e3180 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -69,8 +69,8 @@ extern struct pid init_struct_pid; static inline struct pid *get_pid(struct pid *pid) { - if (pid) - refcount_inc(&pid->count); + if (!pid || !refcount_inc_not_zero(&pid->count)) + return NULL; return pid; } diff --git a/kernel/pid.c b/kernel/pid.c index 2095c7da644d..89c4849fab5d 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -106,8 +106,7 @@ void put_pid(struct pid *pid) return; ns = pid->numbers[pid->level].ns; - if ((refcount_read(&pid->count) == 1) || - refcount_dec_and_test(&pid->count)) { + if (refcount_dec_and_test(&pid->count)) { kmem_cache_free(ns->pid_cachep, pid); put_pid_ns(ns); }
On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote: > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > > refcount_t for it which is basically atomic_t but does additional > > > > checking to prevent use-after-free bugs. No change in behavior if > > > > CONFIG_REFCOUNT_FULL=n. > > > > > > > > Cc: keescook@chromium.org > > > > Cc: kernel-team@android.com > > > > Cc: kernel-hardening@lists.openwall.com > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > [...] > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > index 20881598bdfa..2095c7da644d 100644 > > > > --- a/kernel/pid.c > > > > +++ b/kernel/pid.c > > > > @@ -37,7 +37,7 @@ > > > > #include <linux/init_task.h> > > > > #include <linux/syscalls.h> > > > > #include <linux/proc_ns.h> > > > > -#include <linux/proc_fs.h> > > > > +#include <linux/refcount.h> > > > > #include <linux/sched/task.h> > > > > #include <linux/idr.h> > > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > > return; > > > > > > > > ns = pid->numbers[pid->level].ns; > > > > - if ((atomic_read(&pid->count) == 1) || > > > > - atomic_dec_and_test(&pid->count)) { > > > > + if ((refcount_read(&pid->count) == 1) || > > > > + refcount_dec_and_test(&pid->count)) { > > > > > > Why is this (and the original code) safe in the face of a race against > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > > don't see this code pattern anywhere else in the kernel. > > > > Semantically, it doesn't make a difference whether you do this or > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > > refcount_read(), then you have the only reference to "struct pid", and > > therefore you want to free it. If you don't get a 1, you have to > > atomically drop a reference, which, if someone else is concurrently > > also dropping a reference, may leave you with the last reference (in > > the case where refcount_dec_and_test() returns true), in which case > > you still have to take care of freeing it. > > Also, based on Kees comment, I think it appears to me that get_pid and > put_pid can race in this way in the original code right? > > get_pid put_pid > > atomic_dec_and_test returns 1 This can't happen. get_pid() can only be called on an existing reference. If you are calling get_pid() on an existing reference, and someone else is dropping another reference with put_pid(), then when both functions start running, the refcount must be at least 2. > atomic_inc > kfree > > deref pid /* boom */ > ------------------------------------------------- > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should > not test for pid->count == 1 as condition for freeing, but rather just do > atomic_dec_and_test. So something like the following diff. (And I see a > similar pattern used in drivers/net/mac.c) get_pid() can only be called when you already have a refcounted reference; in other words, when the reference count is at least one. The lifetime management of struct pid differs from the lifetime management of most other objects in the kernel; the usual patterns don't quite apply here. Look at put_pid(): When the refcount has reached zero, there is no RCU grace period (unlike most other objects with RCU-managed lifetimes). Instead, free_pid() has an RCU grace period *before* it invokes delayed_put_pid() to drop a reference; and free_pid() is also the function that removes a PID from the namespace's IDR, and it is used by __change_pid() when a task loses its reference on a PID. In other words: Most refcounted objects with RCU guarantee that the object waits for a grace period after its refcount has reached zero; and during the grace period, the refcount is zero and you're not allowed to increment it again. But for struct pid, the guarantee is instead that there is an RCU grace period after it has been removed from the IDRs and the task, and during the grace period, refcounting is guaranteed to still work normally. > Is the above scenario valid? I didn't see any locking around get_pid or > pud_pid to avoid such a race. > > ---8<----------------------- > > diff --git a/include/linux/pid.h b/include/linux/pid.h > index 8cb86d377ff5..3d79834e3180 100644 > --- a/include/linux/pid.h > +++ b/include/linux/pid.h > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid; > > static inline struct pid *get_pid(struct pid *pid) > { > - if (pid) > - refcount_inc(&pid->count); > + if (!pid || !refcount_inc_not_zero(&pid->count)) > + return NULL; > return pid; > } Nope, this is wrong. Once the refcount is zero, the object goes away, refcount_inc_not_zero() makes no sense here. > > diff --git a/kernel/pid.c b/kernel/pid.c > index 2095c7da644d..89c4849fab5d 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -106,8 +106,7 @@ void put_pid(struct pid *pid) > return; > > ns = pid->numbers[pid->level].ns; > - if ((refcount_read(&pid->count) == 1) || > - refcount_dec_and_test(&pid->count)) { > + if (refcount_dec_and_test(&pid->count)) { > kmem_cache_free(ns->pid_cachep, pid); > put_pid_ns(ns); > } > >
On 03/27, Joel Fernandes wrote: > > Also, based on Kees comment, I think it appears to me that get_pid and > put_pid can race in this way in the original code right? > > get_pid put_pid > > atomic_dec_and_test returns 1 > atomic_inc > kfree > > deref pid /* boom */ > ------------------------------------------------- > > I think get_pid needs to call atomic_inc_not_zero() No. get_pid() should only be used if you already have a reference or you do something like rcu_read_lock(); pid = find_vpid(); get_pid(); rcu_read_lock(); in this case we rely on call_rcu(delayed_put_pid) which drops the initial reference. If put_pid() sees pid->count == 1, then a) nobody else has a reference and b) nobody else can find this pid on rcu-protected lists, so it is safe to free it. Oleg.
On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote: > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > > > <joel@joelfernandes.org> wrote: > > > > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > > > refcount_t for it which is basically atomic_t but does additional > > > > > checking to prevent use-after-free bugs. No change in behavior if > > > > > CONFIG_REFCOUNT_FULL=n. > > > > > > > > > > Cc: keescook@chromium.org > > > > > Cc: kernel-team@android.com > > > > > Cc: kernel-hardening@lists.openwall.com > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > [...] > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > > index 20881598bdfa..2095c7da644d 100644 > > > > > --- a/kernel/pid.c > > > > > +++ b/kernel/pid.c > > > > > @@ -37,7 +37,7 @@ > > > > > #include <linux/init_task.h> > > > > > #include <linux/syscalls.h> > > > > > #include <linux/proc_ns.h> > > > > > -#include <linux/proc_fs.h> > > > > > +#include <linux/refcount.h> > > > > > #include <linux/sched/task.h> > > > > > #include <linux/idr.h> > > > > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > > > return; > > > > > > > > > > ns = pid->numbers[pid->level].ns; > > > > > - if ((atomic_read(&pid->count) == 1) || > > > > > - atomic_dec_and_test(&pid->count)) { > > > > > + if ((refcount_read(&pid->count) == 1) || > > > > > + refcount_dec_and_test(&pid->count)) { > > > > > > > > Why is this (and the original code) safe in the face of a race against > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > > > don't see this code pattern anywhere else in the kernel. > > > > > > Semantically, it doesn't make a difference whether you do this or > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > > > refcount_read(), then you have the only reference to "struct pid", and > > > therefore you want to free it. If you don't get a 1, you have to > > > atomically drop a reference, which, if someone else is concurrently > > > also dropping a reference, may leave you with the last reference (in > > > the case where refcount_dec_and_test() returns true), in which case > > > you still have to take care of freeing it. > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > put_pid can race in this way in the original code right? > > > > get_pid put_pid > > > > atomic_dec_and_test returns 1 > > This can't happen. get_pid() can only be called on an existing > reference. If you are calling get_pid() on an existing reference, and > someone else is dropping another reference with put_pid(), then when > both functions start running, the refcount must be at least 2. Sigh, you are right. Ok. I was quite tired last night when I wrote this. Obviously, I should have waited a bit and thought it through. Kees can you describe more the race you had in mind? > > atomic_inc > > kfree > > > > deref pid /* boom */ > > ------------------------------------------------- > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should > > not test for pid->count == 1 as condition for freeing, but rather just do > > atomic_dec_and_test. So something like the following diff. (And I see a > > similar pattern used in drivers/net/mac.c) > > get_pid() can only be called when you already have a refcounted > reference; in other words, when the reference count is at least one. > The lifetime management of struct pid differs from the lifetime > management of most other objects in the kernel; the usual patterns > don't quite apply here. > > Look at put_pid(): When the refcount has reached zero, there is no RCU > grace period (unlike most other objects with RCU-managed lifetimes). > Instead, free_pid() has an RCU grace period *before* it invokes > delayed_put_pid() to drop a reference; and free_pid() is also the > function that removes a PID from the namespace's IDR, and it is used > by __change_pid() when a task loses its reference on a PID. > > In other words: Most refcounted objects with RCU guarantee that the > object waits for a grace period after its refcount has reached zero; > and during the grace period, the refcount is zero and you're not > allowed to increment it again. Can you give an example of this "most refcounted objects with RCU" usecase? I could not find any good examples of such. I want to document this pattern and possibly submit to Documentation/RCU. > But for struct pid, the guarantee is > instead that there is an RCU grace period after it has been removed > from the IDRs and the task, and during the grace period, refcounting > is guaranteed to still work normally. Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct pid flow is something like (replaced "pid" with data"); get_data: atomic_inc(data->refcount); some_user_of_data: rcu_read_lock(); From X, obtain a ptr to data using rcu_dereference. get_data(data); rcu_read_unlock(); free_data: remove all references to data in all places in X call_rcu(put_data) put_data: if (atomic_dec_and_test(data->refcount)) { free(data); } create_data: data = alloc(..) atomic_set(data->refcount, 1); set pointers to data in X. > > pud_pid to avoid such a race. > > > > ---8<----------------------- > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > index 8cb86d377ff5..3d79834e3180 100644 > > --- a/include/linux/pid.h > > +++ b/include/linux/pid.h > > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid; > > > > static inline struct pid *get_pid(struct pid *pid) > > { > > - if (pid) > > - refcount_inc(&pid->count); > > + if (!pid || !refcount_inc_not_zero(&pid->count)) > > + return NULL; > > return pid; > > } > > Nope, this is wrong. Once the refcount is zero, the object goes away, > refcount_inc_not_zero() makes no sense here. Yeah ok, I think what you meant here is that references to the object from all places go away before the grace period starts, so a get_pid on an object with refcount of zero is impossible since there's no way to *get* to that object after the grace-period ends. So, yes you are right that refcount_inc is all that's needed. Also note to the on looker, the original patch I sent is not wrong, that still applies and is correct. We are just discussing here any possible issues with the *existing* code. thanks! - Joel
On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote: > On 03/27, Joel Fernandes wrote: > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > put_pid can race in this way in the original code right? > > > > get_pid put_pid > > > > atomic_dec_and_test returns 1 > > atomic_inc > > kfree > > > > deref pid /* boom */ > > ------------------------------------------------- > > > > I think get_pid needs to call atomic_inc_not_zero() > > No. > > get_pid() should only be used if you already have a reference or you do > something like > > rcu_read_lock(); > pid = find_vpid(); > get_pid(); > rcu_read_lock(); > > in this case we rely on call_rcu(delayed_put_pid) which drops the initial > reference. > > If put_pid() sees pid->count == 1, then a) nobody else has a reference and > b) nobody else can find this pid on rcu-protected lists, so it is safe to > free it. I agree. Check my reply to Jann, I already replied to him about this. thanks!
Since we're just talking about RCU stuff now, adding Paul McKenney to the thread. On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote: > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote: > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > > > > <joel@joelfernandes.org> wrote: > > > > > > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > > > > refcount_t for it which is basically atomic_t but does additional > > > > > > checking to prevent use-after-free bugs. No change in behavior if > > > > > > CONFIG_REFCOUNT_FULL=n. > > > > > > > > > > > > Cc: keescook@chromium.org > > > > > > Cc: kernel-team@android.com > > > > > > Cc: kernel-hardening@lists.openwall.com > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > [...] > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > > > index 20881598bdfa..2095c7da644d 100644 > > > > > > --- a/kernel/pid.c > > > > > > +++ b/kernel/pid.c > > > > > > @@ -37,7 +37,7 @@ > > > > > > #include <linux/init_task.h> > > > > > > #include <linux/syscalls.h> > > > > > > #include <linux/proc_ns.h> > > > > > > -#include <linux/proc_fs.h> > > > > > > +#include <linux/refcount.h> > > > > > > #include <linux/sched/task.h> > > > > > > #include <linux/idr.h> > > > > > > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > > > > return; > > > > > > > > > > > > ns = pid->numbers[pid->level].ns; > > > > > > - if ((atomic_read(&pid->count) == 1) || > > > > > > - atomic_dec_and_test(&pid->count)) { > > > > > > + if ((refcount_read(&pid->count) == 1) || > > > > > > + refcount_dec_and_test(&pid->count)) { > > > > > > > > > > Why is this (and the original code) safe in the face of a race against > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > > > > don't see this code pattern anywhere else in the kernel. > > > > > > > > Semantically, it doesn't make a difference whether you do this or > > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > > > > refcount_read(), then you have the only reference to "struct pid", and > > > > therefore you want to free it. If you don't get a 1, you have to > > > > atomically drop a reference, which, if someone else is concurrently > > > > also dropping a reference, may leave you with the last reference (in > > > > the case where refcount_dec_and_test() returns true), in which case > > > > you still have to take care of freeing it. > > > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > > put_pid can race in this way in the original code right? > > > > > > get_pid put_pid > > > > > > atomic_dec_and_test returns 1 > > > > This can't happen. get_pid() can only be called on an existing > > reference. If you are calling get_pid() on an existing reference, and > > someone else is dropping another reference with put_pid(), then when > > both functions start running, the refcount must be at least 2. > > Sigh, you are right. Ok. I was quite tired last night when I wrote this. > Obviously, I should have waited a bit and thought it through. > > Kees can you describe more the race you had in mind? > > > > atomic_inc > > > kfree > > > > > > deref pid /* boom */ > > > ------------------------------------------------- > > > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should > > > not test for pid->count == 1 as condition for freeing, but rather just do > > > atomic_dec_and_test. So something like the following diff. (And I see a > > > similar pattern used in drivers/net/mac.c) > > > > get_pid() can only be called when you already have a refcounted > > reference; in other words, when the reference count is at least one. > > The lifetime management of struct pid differs from the lifetime > > management of most other objects in the kernel; the usual patterns > > don't quite apply here. > > > > Look at put_pid(): When the refcount has reached zero, there is no RCU > > grace period (unlike most other objects with RCU-managed lifetimes). > > Instead, free_pid() has an RCU grace period *before* it invokes > > delayed_put_pid() to drop a reference; and free_pid() is also the > > function that removes a PID from the namespace's IDR, and it is used > > by __change_pid() when a task loses its reference on a PID. > > > > In other words: Most refcounted objects with RCU guarantee that the > > object waits for a grace period after its refcount has reached zero; > > and during the grace period, the refcount is zero and you're not > > allowed to increment it again. > > Can you give an example of this "most refcounted objects with RCU" usecase? > I could not find any good examples of such. I want to document this pattern > and possibly submit to Documentation/RCU. E.g. struct posix_acl is a relatively straightforward example: posix_acl_release() is a wrapper around refcount_dec_and_test(); if the refcount has dropped to zero, the object is released after an RCU grace period using kfree_rcu(). get_cached_acl() takes an RCU read lock, does rcu_dereference() [with a missing __rcu annotation, grmbl], and attempts to take a reference with refcount_inc_not_zero(). > > But for struct pid, the guarantee is > > instead that there is an RCU grace period after it has been removed > > from the IDRs and the task, and during the grace period, refcounting > > is guaranteed to still work normally. > > Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct > pid flow is something like (replaced "pid" with data"); > > get_data: > atomic_inc(data->refcount); > > some_user_of_data: > rcu_read_lock(); > From X, obtain a ptr to data using rcu_dereference. > get_data(data); > rcu_read_unlock(); > > free_data: > remove all references to data in all places in X > call_rcu(put_data) > > put_data: > if (atomic_dec_and_test(data->refcount)) { > free(data); > } > > create_data: > data = alloc(..) > atomic_set(data->refcount, 1); > set pointers to data in X. > > > > pud_pid to avoid such a race. > > > > > > ---8<----------------------- > > > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h > > > index 8cb86d377ff5..3d79834e3180 100644 > > > --- a/include/linux/pid.h > > > +++ b/include/linux/pid.h > > > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid; > > > > > > static inline struct pid *get_pid(struct pid *pid) > > > { > > > - if (pid) > > > - refcount_inc(&pid->count); > > > + if (!pid || !refcount_inc_not_zero(&pid->count)) > > > + return NULL; > > > return pid; > > > } > > > > Nope, this is wrong. Once the refcount is zero, the object goes away, > > refcount_inc_not_zero() makes no sense here. > > Yeah ok, I think what you meant here is that references to the object from > all places go away before the grace period starts, so a get_pid on an object > with refcount of zero is impossible since there's no way to *get* to that > object after the grace-period ends. > > So, yes you are right that refcount_inc is all that's needed. > > Also note to the on looker, the original patch I sent is not wrong, that > still applies and is correct. We are just discussing here any possible issues > with the *existing* code. > > thanks! > > - Joel >
On 03/28, Jann Horn wrote: > > Since we're just talking about RCU stuff now, adding Paul McKenney to > the thread. Since you added Paul let me add more confusion to this thread ;) There were some concerns about the lack of barriers in put_pid(), but I can't find that old discussion and I forgot the result of that discussion... Paul, could you confirm that this code CPU_0 CPU_1 X = 1; if (READ_ONCE(Y)) mb(); X = 2; Y = 1; BUG_ON(X != 2); is correct? I think it is, control dependency pairs with mb(), right? If not, then put_pid() needs atomic_read_acquire() as it was proposed in that discussion. Oleg.
On Thu, Mar 28, 2019 at 7:37 AM Joel Fernandes <joel@joelfernandes.org> wrote: > Kees can you describe more the race you had in mind? I just didn't see any barriers, so it seemed racey to me. > Also note to the on looker, the original patch I sent is not wrong, that > still applies and is correct. We are just discussing here any possible issues > with the *existing* code. Correct. And akpm has (correctly) taken this for -mm.
On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > On 03/28, Jann Horn wrote: > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > the thread. > > Since you added Paul let me add more confusion to this thread ;) Woo-hoo!!! More confusion! Bring it on!!! ;-) > There were some concerns about the lack of barriers in put_pid(), but I can't > find that old discussion and I forgot the result of that discussion... > > Paul, could you confirm that this code > > CPU_0 CPU_1 > > X = 1; if (READ_ONCE(Y)) > mb(); X = 2; > Y = 1; BUG_ON(X != 2); > > > is correct? I think it is, control dependency pairs with mb(), right? The BUG_ON() is supposed to happen at the end of time, correct? As written, there is (in the strict sense) a data race between the load of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, you could of course argue that this data race is harmless, especially if X is a single byte. But the more I talk to compiler writers, the less comfortable I become with data races in general. :-/ So I would also feel better if the "Y = 1" was WRITE_ONCE(). On the other hand, this is a great opportunity to try out Alan Stern's prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org Also adding Alan on CC. Here is what I believe is the litmus test that your are interested in: ------------------------------------------------------------------------ C OlegNesterov-put_pid {} P0(int *x, int *y) { *x = 1; smp_mb(); *y = 1; } P1(int *x, int *y) { int r1; r1 = READ_ONCE(*y); if (r1) *x = 2; } exists (1:r1=1 /\ ~x=2) ------------------------------------------------------------------------ Running this through herd with Alan's patch detects the data race and says that the undesired outcome is allowed: $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus Test OlegNesterov-put_pid Allowed States 3 1:r1=0; x=1; 1:r1=1; x=1; 1:r1=1; x=2; Ok Witnesses Positive: 1 Negative: 2 Flag data-race Condition exists (1:r1=1 /\ not (x=2)) Observation OlegNesterov-put_pid Sometimes 1 2 Time OlegNesterov-put_pid 0.00 Hash=a3e0043ad753effa860fea37eeba0a76 Using WRITE_ONCE() for P0()'s store to y still allows this outcome, although it does remove the "Flag data-race". Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x gets rid of both the "Flag data-race" and the undesired outcome: $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus Test OlegNesterov-put_pid-WO-WO Allowed States 2 1:r1=0; x=1; 1:r1=1; x=2; No Witnesses Positive: 0 Negative: 2 Condition exists (1:r1=1 /\ not (x=2)) Observation OlegNesterov-put_pid-WO-WO Never 0 2 Time OlegNesterov-put_pid-WO-WO 0.01 Hash=6e1643e3c5e4739b590bde0a8e8a918e Here is the corresponding litmus test, in case I messed something up: ------------------------------------------------------------------------ C OlegNesterov-put_pid-WO-WO {} P0(int *x, int *y) { *x = 1; smp_mb(); WRITE_ONCE(*y, 1); } P1(int *x, int *y) { int r1; r1 = READ_ONCE(*y); if (r1) WRITE_ONCE(*x, 2); } exists (1:r1=1 /\ ~x=2) ------------------------------------------------------------------------ > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > discussion. Good point, let's try with smp_load_acquire() in P1(): $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus Test OlegNesterov-put_pid-WO-sla Allowed States 2 1:r1=0; x=1; 1:r1=1; x=2; No Witnesses Positive: 0 Negative: 2 Condition exists (1:r1=1 /\ not (x=2)) Observation OlegNesterov-put_pid-WO-sla Never 0 2 Time OlegNesterov-put_pid-WO-sla 0.01 Hash=4fb0276eabf924793dec1970199db3a6 This also works. Here is the litmus test: ------------------------------------------------------------------------ C OlegNesterov-put_pid-WO-sla {} P0(int *x, int *y) { *x = 1; smp_mb(); WRITE_ONCE(*y, 1); } P1(int *x, int *y) { int r1; r1 = smp_load_acquire(y); if (r1) *x = 2; } exists (1:r1=1 /\ ~x=2) ------------------------------------------------------------------------ Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s smp_load_acquire() gets us a data race and allows the undesired outcome: $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus Test OlegNesterov-put_pid-sla Allowed States 3 1:r1=0; x=1; 1:r1=1; x=1; 1:r1=1; x=2; Ok Witnesses Positive: 1 Negative: 2 Flag data-race Condition exists (1:r1=1 /\ not (x=2)) Observation OlegNesterov-put_pid-sla Sometimes 1 2 Time OlegNesterov-put_pid-sla 0.01 Hash=ec6f71f3d9f7cd6e45a874c872e3d946 But what if you are certain that the compiler cannot mess up your use of plain C-language loads and stores? Then simply tell LKMM that they are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly somewhat paranoid, but real C compilers really do tear stores of certain constants on systems (like x86) that have store-immediate instructions, so a bit of paranoia is not misplaced here. ;-) Plus please note that this patch to LKMM is prototype and thus subject to change. Thanx, Paul
On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote: > Since we're just talking about RCU stuff now, adding Paul McKenney to > the thread. > > On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote: > > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > > > > > <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > > > > > refcount_t for it which is basically atomic_t but does additional > > > > > > > checking to prevent use-after-free bugs. No change in behavior if > > > > > > > CONFIG_REFCOUNT_FULL=n. > > > > > > > > > > > > > > Cc: keescook@chromium.org > > > > > > > Cc: kernel-team@android.com > > > > > > > Cc: kernel-hardening@lists.openwall.com > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > [...] > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > > > > index 20881598bdfa..2095c7da644d 100644 > > > > > > > --- a/kernel/pid.c > > > > > > > +++ b/kernel/pid.c > > > > > > > @@ -37,7 +37,7 @@ > > > > > > > #include <linux/init_task.h> > > > > > > > #include <linux/syscalls.h> > > > > > > > #include <linux/proc_ns.h> > > > > > > > -#include <linux/proc_fs.h> > > > > > > > +#include <linux/refcount.h> > > > > > > > #include <linux/sched/task.h> > > > > > > > #include <linux/idr.h> > > > > > > > > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > > > > > return; > > > > > > > > > > > > > > ns = pid->numbers[pid->level].ns; > > > > > > > - if ((atomic_read(&pid->count) == 1) || > > > > > > > - atomic_dec_and_test(&pid->count)) { > > > > > > > + if ((refcount_read(&pid->count) == 1) || > > > > > > > + refcount_dec_and_test(&pid->count)) { > > > > > > > > > > > > Why is this (and the original code) safe in the face of a race against > > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > > > > > don't see this code pattern anywhere else in the kernel. > > > > > > > > > > Semantically, it doesn't make a difference whether you do this or > > > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > > > > > refcount_read(), then you have the only reference to "struct pid", and > > > > > therefore you want to free it. If you don't get a 1, you have to > > > > > atomically drop a reference, which, if someone else is concurrently > > > > > also dropping a reference, may leave you with the last reference (in > > > > > the case where refcount_dec_and_test() returns true), in which case > > > > > you still have to take care of freeing it. > > > > > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > > > put_pid can race in this way in the original code right? > > > > > > > > get_pid put_pid > > > > > > > > atomic_dec_and_test returns 1 > > > > > > This can't happen. get_pid() can only be called on an existing > > > reference. If you are calling get_pid() on an existing reference, and > > > someone else is dropping another reference with put_pid(), then when > > > both functions start running, the refcount must be at least 2. > > > > Sigh, you are right. Ok. I was quite tired last night when I wrote this. > > Obviously, I should have waited a bit and thought it through. > > > > Kees can you describe more the race you had in mind? > > > > > > atomic_inc > > > > kfree > > > > > > > > deref pid /* boom */ > > > > ------------------------------------------------- > > > > > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should > > > > not test for pid->count == 1 as condition for freeing, but rather just do > > > > atomic_dec_and_test. So something like the following diff. (And I see a > > > > similar pattern used in drivers/net/mac.c) > > > > > > get_pid() can only be called when you already have a refcounted > > > reference; in other words, when the reference count is at least one. > > > The lifetime management of struct pid differs from the lifetime > > > management of most other objects in the kernel; the usual patterns > > > don't quite apply here. > > > > > > Look at put_pid(): When the refcount has reached zero, there is no RCU > > > grace period (unlike most other objects with RCU-managed lifetimes). > > > Instead, free_pid() has an RCU grace period *before* it invokes > > > delayed_put_pid() to drop a reference; and free_pid() is also the > > > function that removes a PID from the namespace's IDR, and it is used > > > by __change_pid() when a task loses its reference on a PID. > > > > > > In other words: Most refcounted objects with RCU guarantee that the > > > object waits for a grace period after its refcount has reached zero; > > > and during the grace period, the refcount is zero and you're not > > > allowed to increment it again. > > > > Can you give an example of this "most refcounted objects with RCU" usecase? > > I could not find any good examples of such. I want to document this pattern > > and possibly submit to Documentation/RCU. > > E.g. struct posix_acl is a relatively straightforward example: > posix_acl_release() is a wrapper around refcount_dec_and_test(); if > the refcount has dropped to zero, the object is released after an RCU > grace period using kfree_rcu(). > get_cached_acl() takes an RCU read lock, does rcu_dereference() [with > a missing __rcu annotation, grmbl], and attempts to take a reference > with refcount_inc_not_zero(). Ok I get it now. It is quite a subtle difference in usage, I have noted both these usecases in my private notes for my own sanity ;-). I wonder if Paul thinks this is too silly to document into Documentation/RCU/, or if I should write-up something. One thing I wonder is if one usage pattern is faster than the other. Certainly in the {get,put}_pid case, it seems nice to be able to do a get_pid even though free_pid's grace period has still not completed. Where as in the posix_acl case, once the grace period starts then it is no longer possible to get a reference as you pointed and its basically game-over for that object. thank you! - Joel
On Thu, Mar 28, 2019 at 04:00:52PM -0400, Joel Fernandes wrote: > On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote: > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > the thread. > > > > On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote: > > > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote: > > > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google) > > > > > > > <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use > > > > > > > > refcount_t for it which is basically atomic_t but does additional > > > > > > > > checking to prevent use-after-free bugs. No change in behavior if > > > > > > > > CONFIG_REFCOUNT_FULL=n. > > > > > > > > > > > > > > > > Cc: keescook@chromium.org > > > > > > > > Cc: kernel-team@android.com > > > > > > > > Cc: kernel-hardening@lists.openwall.com > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > [...] > > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > > > > > > > index 20881598bdfa..2095c7da644d 100644 > > > > > > > > --- a/kernel/pid.c > > > > > > > > +++ b/kernel/pid.c > > > > > > > > @@ -37,7 +37,7 @@ > > > > > > > > #include <linux/init_task.h> > > > > > > > > #include <linux/syscalls.h> > > > > > > > > #include <linux/proc_ns.h> > > > > > > > > -#include <linux/proc_fs.h> > > > > > > > > +#include <linux/refcount.h> > > > > > > > > #include <linux/sched/task.h> > > > > > > > > #include <linux/idr.h> > > > > > > > > > > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) > > > > > > > > return; > > > > > > > > > > > > > > > > ns = pid->numbers[pid->level].ns; > > > > > > > > - if ((atomic_read(&pid->count) == 1) || > > > > > > > > - atomic_dec_and_test(&pid->count)) { > > > > > > > > + if ((refcount_read(&pid->count) == 1) || > > > > > > > > + refcount_dec_and_test(&pid->count)) { > > > > > > > > > > > > > > Why is this (and the original code) safe in the face of a race against > > > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I > > > > > > > don't see this code pattern anywhere else in the kernel. > > > > > > > > > > > > Semantically, it doesn't make a difference whether you do this or > > > > > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from > > > > > > refcount_read(), then you have the only reference to "struct pid", and > > > > > > therefore you want to free it. If you don't get a 1, you have to > > > > > > atomically drop a reference, which, if someone else is concurrently > > > > > > also dropping a reference, may leave you with the last reference (in > > > > > > the case where refcount_dec_and_test() returns true), in which case > > > > > > you still have to take care of freeing it. > > > > > > > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > > > > put_pid can race in this way in the original code right? > > > > > > > > > > get_pid put_pid > > > > > > > > > > atomic_dec_and_test returns 1 > > > > > > > > This can't happen. get_pid() can only be called on an existing > > > > reference. If you are calling get_pid() on an existing reference, and > > > > someone else is dropping another reference with put_pid(), then when > > > > both functions start running, the refcount must be at least 2. > > > > > > Sigh, you are right. Ok. I was quite tired last night when I wrote this. > > > Obviously, I should have waited a bit and thought it through. > > > > > > Kees can you describe more the race you had in mind? > > > > > > > > atomic_inc > > > > > kfree > > > > > > > > > > deref pid /* boom */ > > > > > ------------------------------------------------- > > > > > > > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should > > > > > not test for pid->count == 1 as condition for freeing, but rather just do > > > > > atomic_dec_and_test. So something like the following diff. (And I see a > > > > > similar pattern used in drivers/net/mac.c) > > > > > > > > get_pid() can only be called when you already have a refcounted > > > > reference; in other words, when the reference count is at least one. > > > > The lifetime management of struct pid differs from the lifetime > > > > management of most other objects in the kernel; the usual patterns > > > > don't quite apply here. > > > > > > > > Look at put_pid(): When the refcount has reached zero, there is no RCU > > > > grace period (unlike most other objects with RCU-managed lifetimes). > > > > Instead, free_pid() has an RCU grace period *before* it invokes > > > > delayed_put_pid() to drop a reference; and free_pid() is also the > > > > function that removes a PID from the namespace's IDR, and it is used > > > > by __change_pid() when a task loses its reference on a PID. > > > > > > > > In other words: Most refcounted objects with RCU guarantee that the > > > > object waits for a grace period after its refcount has reached zero; > > > > and during the grace period, the refcount is zero and you're not > > > > allowed to increment it again. > > > > > > Can you give an example of this "most refcounted objects with RCU" usecase? > > > I could not find any good examples of such. I want to document this pattern > > > and possibly submit to Documentation/RCU. > > > > E.g. struct posix_acl is a relatively straightforward example: > > posix_acl_release() is a wrapper around refcount_dec_and_test(); if > > the refcount has dropped to zero, the object is released after an RCU > > grace period using kfree_rcu(). > > get_cached_acl() takes an RCU read lock, does rcu_dereference() [with > > a missing __rcu annotation, grmbl], and attempts to take a reference > > with refcount_inc_not_zero(). > > Ok I get it now. It is quite a subtle difference in usage, I have noted both > these usecases in my private notes for my own sanity ;-). I wonder if Paul > thinks this is too silly to document into Documentation/RCU/, or if I should > write-up something. > > One thing I wonder is if one usage pattern is faster than the other. > Certainly in the {get,put}_pid case, it seems nice to be able to do a > get_pid even though free_pid's grace period has still not completed. Where as > in the posix_acl case, once the grace period starts then it is no longer > possible to get a reference as you pointed and its basically game-over for > that object. Wow, so I just found that Documentation/RCU/rcuref.txt already beautifully talks about specifically both these RCU refcounting patterns - in the second and third example. I also found some issues with that document, so I will be submitting those and CC you guys. thanks! - Joel
On Thu, Mar 28, 2019 at 10:39:58AM -0400, Joel Fernandes wrote: > On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote: > > On 03/27, Joel Fernandes wrote: > > > > > > Also, based on Kees comment, I think it appears to me that get_pid and > > > put_pid can race in this way in the original code right? > > > > > > get_pid put_pid > > > > > > atomic_dec_and_test returns 1 > > > atomic_inc > > > kfree > > > > > > deref pid /* boom */ > > > ------------------------------------------------- > > > > > > I think get_pid needs to call atomic_inc_not_zero() > > > > No. > > > > get_pid() should only be used if you already have a reference or you do > > something like > > > > rcu_read_lock(); > > pid = find_vpid(); > > get_pid(); > > rcu_read_lock(); > > > > in this case we rely on call_rcu(delayed_put_pid) which drops the initial > > reference. > > > > If put_pid() sees pid->count == 1, then a) nobody else has a reference and > > b) nobody else can find this pid on rcu-protected lists, so it is safe to > > free it. > > I agree. Check my reply to Jann, I already replied to him about this. thanks! > Also Oleg, why not just call refcount_dec_and_test like below? If count is 1, then it will decrement to 0 and return true anyway. Is this because we want to avoid writes at the cost of more reads? Did I miss something? Thank you. I don't remember very clearly, but I think Kees also asked about the same thing. diff --git a/kernel/pid.c b/kernel/pid.c index 2095c7da644d..89c4849fab5d 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -106,8 +106,7 @@ void put_pid(struct pid *pid) return; ns = pid->numbers[pid->level].ns; - if ((refcount_read(&pid->count) == 1) || - refcount_dec_and_test(&pid->count)) { + if (refcount_dec_and_test(&pid->count)) { kmem_cache_free(ns->pid_cachep, pid); put_pid_ns(ns); }
On 03/28, Paul E. McKenney wrote: > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > > Since you added Paul let me add more confusion to this thread ;) > > Woo-hoo!!! More confusion! Bring it on!!! ;-) OK, thanks, you certainly managed to confused me much more than I expected! > > There were some concerns about the lack of barriers in put_pid(), but I can't > > find that old discussion and I forgot the result of that discussion... > > > > Paul, could you confirm that this code > > > > CPU_0 CPU_1 > > > > X = 1; if (READ_ONCE(Y)) > > mb(); X = 2; > > Y = 1; BUG_ON(X != 2); > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > The BUG_ON() is supposed to happen at the end of time, correct? Yes, > As written, there is (in the strict sense) a data race between the load > of X in the BUG_ON() and CPU_0's store to X. Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0 must not happen after "X = 2" on CPU 1 or we have a problem. > But the more I talk to compiler writers, the > less comfortable I become with data races in general. :-/ > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). If we forget about potential compiler bugs, then it is not clear to me how WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and we do not really need the "once" semantics? But as for put_pid(), it actually does atomic_dec_and_test(&Y), so this is probably not relevant. > On the other hand, this is a great opportunity to try out Alan Stern's > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org Heh. Will do, but only after I buy more brains. > Here is what I believe is the litmus test that your are interested in: > > ------------------------------------------------------------------------ > C OlegNesterov-put_pid > > {} > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > *y = 1; > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > *x = 2; > } > > exists (1:r1=1 /\ ~x=2) I am not familiar with litmus, and I do not really understand what (and why) it reports. > Running this through herd with Alan's patch detects the data race > and says that the undesired outcome is allowed: OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes *y == 1. Still can't understand how this can happen... Nevermind ;) > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > gets rid of both the "Flag data-race" and the undesired outcome: ... > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > WRITE_ONCE(*x, 2); > } And this is what Documentation/memory-barriers.txt says in the "CONTROL DEPENDENCIES" section: q = READ_ONCE(a); if (q) { WRITE_ONCE(b, 1); } Control dependencies pair normally with other types of barriers. That said, please note that neither READ_ONCE() nor WRITE_ONCE() are optional! but again, I fail to really understand why WRITE_ONCE() is not optional in this particular case. Thanks! Oleg.
On 03/28, Joel Fernandes wrote: > > Also Oleg, why not just call refcount_dec_and_test like below? Of course we can, this is just optimization to avoid the atomic op if possible. Does this optimization really help? I have no idea ;) But as you can see, it surely helps to provoke the interesting discussions! Oleg.
On Fri, 29 Mar 2019, Oleg Nesterov wrote: > On 03/28, Paul E. McKenney wrote: > > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > OK, thanks, you certainly managed to confused me much more than I expected! > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > find that old discussion and I forgot the result of that discussion... > > > > > > Paul, could you confirm that this code > > > > > > CPU_0 CPU_1 > > > > > > X = 1; if (READ_ONCE(Y)) > > > mb(); X = 2; > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > Yes, > > > As written, there is (in the strict sense) a data race between the load > > of X in the BUG_ON() and CPU_0's store to X. > > Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0 > must not happen after "X = 2" on CPU 1 or we have a problem. The situation is worse than that. Strictly speaking, any time there is a data race you have a problem. In particular, the C11 Standard permits a data race to cause undefined behavior. For example, the Standard wouldn't be violated if a data race caused your computer to catch on fire. In real life the situation isn't that bad, and you can do things the Standard doesn't cover if you are very familiar with the properties of the compiler. But compilers change over time, so relying on special knowledge about it might not be a good idea in the long run. > > But the more I talk to compiler writers, the > > less comfortable I become with data races in general. :-/ > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > If we forget about potential compiler bugs, then it is not clear to me how > WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and > we do not really need the "once" semantics? There is a big difference between WRITE_ONCE() and plain assignment. Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store instruction. But given "X = 2", the compiler is allowed to emit instructions equivalent to: if (X != 2) X = 2; This is not a simple store; it also contains a load. (And in some situations, a transformation like this might even be beneficial, because it can avoid a cache line transfer.) CPUs are allowed to execute loads out of order with respect to control dependencies. So for example, if your pseudo-code for CPU_1 got compiled to object code equivalent to: if (READ_ONCE(Y)) { if (X != 2) X = 2; } then the CPU might read the value of X before reading the value of Y. This might not look like a problem, but it would be an example of a race. > But as for put_pid(), it actually does atomic_dec_and_test(&Y), so this is > probably not relevant. > > > On the other hand, this is a great opportunity to try out Alan Stern's > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > Heh. Will do, but only after I buy more brains. > > > Here is what I believe is the litmus test that your are interested in: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > *x = 2; > > } > > > > exists (1:r1=1 /\ ~x=2) > > I am not familiar with litmus, and I do not really understand what (and why) > it reports. > > > Running this through herd with Alan's patch detects the data race > > and says that the undesired outcome is allowed: > > OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes > *y == 1. > > Still can't understand how this can happen... Nevermind ;) You might imagine that the compiler could generate object code for P1() equivalent to: tmp = *x; *x = 2; if (!READ_ONCE(*y)) *x = tmp; Given this object code, P1 might indeed set *x to 2 before P0 sets *x to 1. I believe the Standard allows the compiler to do this sort of thing, although of course it would be a stupid thing to do. Using plain accesses gives the compiler a tremendous amount of freedom -- a lot more than many people think. Alan > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > gets rid of both the "Flag data-race" and the undesired outcome: > > ... > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > And this is what Documentation/memory-barriers.txt says in the "CONTROL > DEPENDENCIES" section: > > q = READ_ONCE(a); > if (q) { > WRITE_ONCE(b, 1); > } > > Control dependencies pair normally with other types of barriers. > That said, please note that neither READ_ONCE() nor WRITE_ONCE() > are optional! > > but again, I fail to really understand why WRITE_ONCE() is not optional > in this particular case. > > Thanks! > > Oleg.
On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > On 03/28, Jann Horn wrote: > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > the thread. > > > > Since you added Paul let me add more confusion to this thread ;) > > Woo-hoo!!! More confusion! Bring it on!!! ;-) Nice to take part in the confusion fun too!!! ;-) > > There were some concerns about the lack of barriers in put_pid(), but I can't > > find that old discussion and I forgot the result of that discussion... > > > > Paul, could you confirm that this code > > > > CPU_0 CPU_1 > > > > X = 1; if (READ_ONCE(Y)) > > mb(); X = 2; > > Y = 1; BUG_ON(X != 2); > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > The BUG_ON() is supposed to happen at the end of time, correct? > As written, there is (in the strict sense) a data race between the load > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > you could of course argue that this data race is harmless, especially > if X is a single byte. But the more I talk to compiler writers, the > less comfortable I become with data races in general. :-/ > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > On the other hand, this is a great opportunity to try out Alan Stern's > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > Also adding Alan on CC. > > Here is what I believe is the litmus test that your are interested in: > > ------------------------------------------------------------------------ > C OlegNesterov-put_pid > > {} > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > *y = 1; > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > *x = 2; > } > > exists (1:r1=1 /\ ~x=2) > ------------------------------------------------------------------------ > > Running this through herd with Alan's patch detects the data race > and says that the undesired outcome is allowed: > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > Test OlegNesterov-put_pid Allowed > States 3 > 1:r1=0; x=1; > 1:r1=1; x=1; > 1:r1=1; x=2; > Ok > Witnesses > Positive: 1 Negative: 2 > Flag data-race > Condition exists (1:r1=1 /\ not (x=2)) > Observation OlegNesterov-put_pid Sometimes 1 2 > Time OlegNesterov-put_pid 0.00 > Hash=a3e0043ad753effa860fea37eeba0a76 > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > although it does remove the "Flag data-race". > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > gets rid of both the "Flag data-race" and the undesired outcome: > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > Test OlegNesterov-put_pid-WO-WO Allowed > States 2 > 1:r1=0; x=1; > 1:r1=1; x=2; > No > Witnesses > Positive: 0 Negative: 2 > Condition exists (1:r1=1 /\ not (x=2)) > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > Time OlegNesterov-put_pid-WO-WO 0.01 > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > Here is the corresponding litmus test, in case I messed something up: > > ------------------------------------------------------------------------ > C OlegNesterov-put_pid-WO-WO > > {} > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > WRITE_ONCE(*y, 1); > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > WRITE_ONCE(*x, 2); > } > > exists (1:r1=1 /\ ~x=2) I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be sufficient to prevent the exists condition. Shouldn't the compiler know that, in P0(), it should not reorder the store to y=1 before the x=1 because there is an explicit barrier between the 2 stores? Looks me to me like a broken compiler :-|. So I would have expected the following litmus to result in Never, but it doesn't with Alan's patch: P0(int *x, int *y) { *x = 1; smp_mb(); *y = 1; } P1(int *x, int *y) { int r1; r1 = READ_ONCE(*y); if (r1) WRITE_ONCE(*x, 2); } exists (1:r1=1 /\ ~x=2) > ------------------------------------------------------------------------ > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > discussion. > > Good point, let's try with smp_load_acquire() in P1(): > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > Test OlegNesterov-put_pid-WO-sla Allowed > States 2 > 1:r1=0; x=1; > 1:r1=1; x=2; > No > Witnesses > Positive: 0 Negative: 2 > Condition exists (1:r1=1 /\ not (x=2)) > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > Time OlegNesterov-put_pid-WO-sla 0.01 > Hash=4fb0276eabf924793dec1970199db3a6 > > This also works. Here is the litmus test: > > ------------------------------------------------------------------------ > C OlegNesterov-put_pid-WO-sla > > {} > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > WRITE_ONCE(*y, 1); > } > > P1(int *x, int *y) > { > int r1; > > r1 = smp_load_acquire(y); > if (r1) > *x = 2; > } > > exists (1:r1=1 /\ ~x=2) > ------------------------------------------------------------------------ > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > smp_load_acquire() gets us a data race and allows the undesired > outcome: Yeah, I think this is also what I was confused about above, is why is that WRITE_ONCE required in P0() because there's already an smp_mb there. Surely I'm missing something. ;-) > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus > Test OlegNesterov-put_pid-sla Allowed > States 3 > 1:r1=0; x=1; > 1:r1=1; x=1; > 1:r1=1; x=2; > Ok > Witnesses > Positive: 1 Negative: 2 > Flag data-race > Condition exists (1:r1=1 /\ not (x=2)) > Observation OlegNesterov-put_pid-sla Sometimes 1 2 > Time OlegNesterov-put_pid-sla 0.01 > Hash=ec6f71f3d9f7cd6e45a874c872e3d946 > > But what if you are certain that the compiler cannot mess up your use > of plain C-language loads and stores? Then simply tell LKMM that they > are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly > somewhat paranoid, but real C compilers really do tear stores of certain > constants on systems (like x86) that have store-immediate instructions, > so a bit of paranoia is not misplaced here. ;-) > > Plus please note that this patch to LKMM is prototype and thus subject > to change. Ah I see. Appreciate if Alan can also CC me on future posting of this since I'm quite interested. ;-) thanks, - Joel > Thanx, Paul >
On Fri, 29 Mar 2019, Joel Fernandes wrote: > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > On 03/28, Jann Horn wrote: > > > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > > the thread. > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > Nice to take part in the confusion fun too!!! ;-) > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > find that old discussion and I forgot the result of that discussion... > > > > > > Paul, could you confirm that this code > > > > > > CPU_0 CPU_1 > > > > > > X = 1; if (READ_ONCE(Y)) > > > mb(); X = 2; > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > As written, there is (in the strict sense) a data race between the load > > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > > you could of course argue that this data race is harmless, especially > > if X is a single byte. But the more I talk to compiler writers, the > > less comfortable I become with data races in general. :-/ > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > > > On the other hand, this is a great opportunity to try out Alan Stern's > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > > > Also adding Alan on CC. > > > > Here is what I believe is the litmus test that your are interested in: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > *x = 2; > > } > > > > exists (1:r1=1 /\ ~x=2) > > ------------------------------------------------------------------------ > > > > Running this through herd with Alan's patch detects the data race > > and says that the undesired outcome is allowed: > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > > Test OlegNesterov-put_pid Allowed > > States 3 > > 1:r1=0; x=1; > > 1:r1=1; x=1; > > 1:r1=1; x=2; > > Ok > > Witnesses > > Positive: 1 Negative: 2 > > Flag data-race > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid Sometimes 1 2 > > Time OlegNesterov-put_pid 0.00 > > Hash=a3e0043ad753effa860fea37eeba0a76 > > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > > although it does remove the "Flag data-race". > > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > gets rid of both the "Flag data-race" and the undesired outcome: > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > > Test OlegNesterov-put_pid-WO-WO Allowed > > States 2 > > 1:r1=0; x=1; > > 1:r1=1; x=2; > > No > > Witnesses > > Positive: 0 Negative: 2 > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > > Time OlegNesterov-put_pid-WO-WO 0.01 > > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > > > Here is the corresponding litmus test, in case I messed something up: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid-WO-WO > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > WRITE_ONCE(*y, 1); > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > > > exists (1:r1=1 /\ ~x=2) > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in > P0() is required, If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it would race with P1's "READ_ONCE(*y)". > and why would the READ_ONCE / WRITE_ONCE in P1() not be > sufficient to prevent the exists condition. Shouldn't the compiler know that, > in P0(), it should not reorder the store to y=1 before the x=1 because there > is an explicit barrier between the 2 stores? Looks me to me like a broken > compiler :-|. > > So I would have expected the following litmus to result in Never, but it > doesn't with Alan's patch: > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > *y = 1; > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > WRITE_ONCE(*x, 2); > } > > exists (1:r1=1 /\ ~x=2) You have to realize that in the presence of a data race, all bets are off. The memory model will still output a prediction, but there is no guarantee that the prediction will be correct. In this case P0's write to y races with P1's READ_ONCE. Therefore the memory model may very will give an incorrect result. > > ------------------------------------------------------------------------ > > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > > discussion. > > > > Good point, let's try with smp_load_acquire() in P1(): > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > > Test OlegNesterov-put_pid-WO-sla Allowed > > States 2 > > 1:r1=0; x=1; > > 1:r1=1; x=2; > > No > > Witnesses > > Positive: 0 Negative: 2 > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > > Time OlegNesterov-put_pid-WO-sla 0.01 > > Hash=4fb0276eabf924793dec1970199db3a6 > > > > This also works. Here is the litmus test: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid-WO-sla > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > WRITE_ONCE(*y, 1); > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = smp_load_acquire(y); > > if (r1) > > *x = 2; > > } > > > > exists (1:r1=1 /\ ~x=2) > > ------------------------------------------------------------------------ > > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > > smp_load_acquire() gets us a data race and allows the undesired > > outcome: > > Yeah, I think this is also what I was confused about above, is why is that > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely > I'm missing something. ;-) A plain write to *y in P0 races with the smp_load_acquire in P1. That's all -- it's not very deep or subtle. Remember, the definition of a race is two concurrent accesses to the same variable from different CPUs, where at least one of the accesses is plain and at least one of them is a write. I've heard that people on the C++ Standards committee have proposed that plain writes should not race with marked reads. That is, when such concurrent accesses occur the outcome should be an undefined result for the marked read rather than undefined behavior. If this change gets adopted and we put it into the memory model, then your expectation would be correct. But as things stand, it isn't. Alan > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus > > Test OlegNesterov-put_pid-sla Allowed > > States 3 > > 1:r1=0; x=1; > > 1:r1=1; x=1; > > 1:r1=1; x=2; > > Ok > > Witnesses > > Positive: 1 Negative: 2 > > Flag data-race > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-sla Sometimes 1 2 > > Time OlegNesterov-put_pid-sla 0.01 > > Hash=ec6f71f3d9f7cd6e45a874c872e3d946 > > > > But what if you are certain that the compiler cannot mess up your use > > of plain C-language loads and stores? Then simply tell LKMM that they > > are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly > > somewhat paranoid, but real C compilers really do tear stores of certain > > constants on systems (like x86) that have store-immediate instructions, > > so a bit of paranoia is not misplaced here. ;-) > > > > Plus please note that this patch to LKMM is prototype and thus subject > > to change. > > Ah I see. Appreciate if Alan can also CC me on future posting of this since > I'm quite interested. ;-) > > thanks, > > - Joel
On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote: > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > On 03/28, Jann Horn wrote: > > > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > > the thread. > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > Nice to take part in the confusion fun too!!! ;-) > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > find that old discussion and I forgot the result of that discussion... > > > > > > Paul, could you confirm that this code > > > > > > CPU_0 CPU_1 > > > > > > X = 1; if (READ_ONCE(Y)) > > > mb(); X = 2; > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > As written, there is (in the strict sense) a data race between the load > > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > > you could of course argue that this data race is harmless, especially > > if X is a single byte. But the more I talk to compiler writers, the > > less comfortable I become with data races in general. :-/ > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > > > On the other hand, this is a great opportunity to try out Alan Stern's > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > > > Also adding Alan on CC. > > > > Here is what I believe is the litmus test that your are interested in: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > *x = 2; > > } > > > > exists (1:r1=1 /\ ~x=2) > > ------------------------------------------------------------------------ > > > > Running this through herd with Alan's patch detects the data race > > and says that the undesired outcome is allowed: > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > > Test OlegNesterov-put_pid Allowed > > States 3 > > 1:r1=0; x=1; > > 1:r1=1; x=1; > > 1:r1=1; x=2; > > Ok > > Witnesses > > Positive: 1 Negative: 2 > > Flag data-race > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid Sometimes 1 2 > > Time OlegNesterov-put_pid 0.00 > > Hash=a3e0043ad753effa860fea37eeba0a76 > > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > > although it does remove the "Flag data-race". > > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > gets rid of both the "Flag data-race" and the undesired outcome: > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > > Test OlegNesterov-put_pid-WO-WO Allowed > > States 2 > > 1:r1=0; x=1; > > 1:r1=1; x=2; > > No > > Witnesses > > Positive: 0 Negative: 2 > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > > Time OlegNesterov-put_pid-WO-WO 0.01 > > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > > > Here is the corresponding litmus test, in case I messed something up: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid-WO-WO > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > WRITE_ONCE(*y, 1); > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > > > exists (1:r1=1 /\ ~x=2) > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be > sufficient to prevent the exists condition. Shouldn't the compiler know that, > in P0(), it should not reorder the store to y=1 before the x=1 because there > is an explicit barrier between the 2 stores? Looks me to me like a broken > compiler :-|. > > So I would have expected the following litmus to result in Never, but it > doesn't with Alan's patch: > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > *y = 1; > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > WRITE_ONCE(*x, 2); > } > > exists (1:r1=1 /\ ~x=2) The problem is that the compiler can turn both of P0()'s writes into reads: P0(int *x, int *y) { if (*x != 1) *x = 1; smp_mb(); if (*y != 1) *y = 1; } These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() to get an iron-clad ordering guarantee. > > ------------------------------------------------------------------------ > > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > > discussion. > > > > Good point, let's try with smp_load_acquire() in P1(): > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > > Test OlegNesterov-put_pid-WO-sla Allowed > > States 2 > > 1:r1=0; x=1; > > 1:r1=1; x=2; > > No > > Witnesses > > Positive: 0 Negative: 2 > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > > Time OlegNesterov-put_pid-WO-sla 0.01 > > Hash=4fb0276eabf924793dec1970199db3a6 > > > > This also works. Here is the litmus test: > > > > ------------------------------------------------------------------------ > > C OlegNesterov-put_pid-WO-sla > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > WRITE_ONCE(*y, 1); > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = smp_load_acquire(y); > > if (r1) > > *x = 2; > > } > > > > exists (1:r1=1 /\ ~x=2) > > ------------------------------------------------------------------------ > > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > > smp_load_acquire() gets us a data race and allows the undesired > > outcome: > > Yeah, I think this is also what I was confused about above, is why is that > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely > I'm missing something. ;-) The first of P0()'s writes can be a plain write, at least assuming sufficient synchronization to avoid the data race. But turning the second of P0()'s writes into a plain write is a bit riskier: That is a write of a constant, and those really are torn in some cases on some architectures. Like x86, for example. > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus > > Test OlegNesterov-put_pid-sla Allowed > > States 3 > > 1:r1=0; x=1; > > 1:r1=1; x=1; > > 1:r1=1; x=2; > > Ok > > Witnesses > > Positive: 1 Negative: 2 > > Flag data-race > > Condition exists (1:r1=1 /\ not (x=2)) > > Observation OlegNesterov-put_pid-sla Sometimes 1 2 > > Time OlegNesterov-put_pid-sla 0.01 > > Hash=ec6f71f3d9f7cd6e45a874c872e3d946 > > > > But what if you are certain that the compiler cannot mess up your use > > of plain C-language loads and stores? Then simply tell LKMM that they > > are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly > > somewhat paranoid, but real C compilers really do tear stores of certain > > constants on systems (like x86) that have store-immediate instructions, > > so a bit of paranoia is not misplaced here. ;-) > > > > Plus please note that this patch to LKMM is prototype and thus subject > > to change. > > Ah I see. Appreciate if Alan can also CC me on future posting of this since > I'm quite interested. ;-) His last posting should be easy to find. But please let me know if not, as I would be happy to send it along. Thanx, Pau
On Sat, Mar 30, 2019 at 11:16:01AM -0400, Alan Stern wrote: > On Fri, 29 Mar 2019, Joel Fernandes wrote: > > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > > On 03/28, Jann Horn wrote: > > > > > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > > > the thread. > > > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > > > Nice to take part in the confusion fun too!!! ;-) > > > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > > find that old discussion and I forgot the result of that discussion... > > > > > > > > Paul, could you confirm that this code > > > > > > > > CPU_0 CPU_1 > > > > > > > > X = 1; if (READ_ONCE(Y)) > > > > mb(); X = 2; > > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > > As written, there is (in the strict sense) a data race between the load > > > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > > > you could of course argue that this data race is harmless, especially > > > if X is a single byte. But the more I talk to compiler writers, the > > > less comfortable I become with data races in general. :-/ > > > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > > > > > On the other hand, this is a great opportunity to try out Alan Stern's > > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > > > > > Also adding Alan on CC. > > > > > > Here is what I believe is the litmus test that your are interested in: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > *y = 1; > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > *x = 2; > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > ------------------------------------------------------------------------ > > > > > > Running this through herd with Alan's patch detects the data race > > > and says that the undesired outcome is allowed: > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > > > Test OlegNesterov-put_pid Allowed > > > States 3 > > > 1:r1=0; x=1; > > > 1:r1=1; x=1; > > > 1:r1=1; x=2; > > > Ok > > > Witnesses > > > Positive: 1 Negative: 2 > > > Flag data-race > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid Sometimes 1 2 > > > Time OlegNesterov-put_pid 0.00 > > > Hash=a3e0043ad753effa860fea37eeba0a76 > > > > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > > > although it does remove the "Flag data-race". > > > > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > > gets rid of both the "Flag data-race" and the undesired outcome: > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > > > Test OlegNesterov-put_pid-WO-WO Allowed > > > States 2 > > > 1:r1=0; x=1; > > > 1:r1=1; x=2; > > > No > > > Witnesses > > > Positive: 0 Negative: 2 > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > > > Time OlegNesterov-put_pid-WO-WO 0.01 > > > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > > > > > Here is the corresponding litmus test, in case I messed something up: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid-WO-WO > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > WRITE_ONCE(*y, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > WRITE_ONCE(*x, 2); > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in > > P0() is required, > > If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it > would race with P1's "READ_ONCE(*y)". > > > and why would the READ_ONCE / WRITE_ONCE in P1() not be > > sufficient to prevent the exists condition. Shouldn't the compiler know that, > > in P0(), it should not reorder the store to y=1 before the x=1 because there > > is an explicit barrier between the 2 stores? Looks me to me like a broken > > compiler :-|. > > > > So I would have expected the following litmus to result in Never, but it > > doesn't with Alan's patch: > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > > > exists (1:r1=1 /\ ~x=2) > > You have to realize that in the presence of a data race, all bets are > off. The memory model will still output a prediction, but there is no > guarantee that the prediction will be correct. > > In this case P0's write to y races with P1's READ_ONCE. Therefore the > memory model may very will give an incorrect result. > > > > ------------------------------------------------------------------------ > > > > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > > > discussion. > > > > > > Good point, let's try with smp_load_acquire() in P1(): > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > > > Test OlegNesterov-put_pid-WO-sla Allowed > > > States 2 > > > 1:r1=0; x=1; > > > 1:r1=1; x=2; > > > No > > > Witnesses > > > Positive: 0 Negative: 2 > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > > > Time OlegNesterov-put_pid-WO-sla 0.01 > > > Hash=4fb0276eabf924793dec1970199db3a6 > > > > > > This also works. Here is the litmus test: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid-WO-sla > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > WRITE_ONCE(*y, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = smp_load_acquire(y); > > > if (r1) > > > *x = 2; > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > ------------------------------------------------------------------------ > > > > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > > > smp_load_acquire() gets us a data race and allows the undesired > > > outcome: > > > > Yeah, I think this is also what I was confused about above, is why is that > > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely > > I'm missing something. ;-) > > A plain write to *y in P0 races with the smp_load_acquire in P1. > That's all -- it's not very deep or subtle. Remember, the definition > of a race is two concurrent accesses to the same variable from > different CPUs, where at least one of the accesses is plain and at > least one of them is a write. > > I've heard that people on the C++ Standards committee have proposed > that plain writes should not race with marked reads. That is, when > such concurrent accesses occur the outcome should be an undefined > result for the marked read rather than undefined behavior. If this > change gets adopted and we put it into the memory model, then your > expectation would be correct. But as things stand, it isn't. At least in the case where the marking is a volatile, yes. But there might be some distance between a proposal and an actual change to the standard. ;-) There is also a proposal for a memcpy-like thing that acts as plain loads and stores, but is defined not to data-race with marked accesses. However, tearing and so on is possible, so if you do race with marked accesses, you might be surprising results. Thanx, Paul
From: Alan Stern > Sent: 29 March 2019 19:45 ... > There is a big difference between WRITE_ONCE() and plain assignment. > Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store > instruction. But given "X = 2", the compiler is allowed to emit > instructions equivalent to: > > if (X != 2) > X = 2; Worse for you, it can also emit: X = 0; X = 2; Many years ago I fell foul of a compiler (not C) that implemented a write to a 2 bit wide bitfield as: X &= ~3 X |= value even when 'value' was a compile time constant of 3. Took a while to find out why the linked list got f*cked. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Thanks a lot Alan and Paul for the replies. I replied inline. On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote: > On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote: > > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > > On 03/28, Jann Horn wrote: > > > > > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > > > the thread. > > > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > > > Nice to take part in the confusion fun too!!! ;-) > > > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > > find that old discussion and I forgot the result of that discussion... > > > > > > > > Paul, could you confirm that this code > > > > > > > > CPU_0 CPU_1 > > > > > > > > X = 1; if (READ_ONCE(Y)) > > > > mb(); X = 2; > > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > > As written, there is (in the strict sense) a data race between the load > > > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > > > you could of course argue that this data race is harmless, especially > > > if X is a single byte. But the more I talk to compiler writers, the > > > less comfortable I become with data races in general. :-/ > > > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > > > > > On the other hand, this is a great opportunity to try out Alan Stern's > > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > > > > > Also adding Alan on CC. > > > > > > Here is what I believe is the litmus test that your are interested in: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > *y = 1; > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > *x = 2; > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > ------------------------------------------------------------------------ > > > > > > Running this through herd with Alan's patch detects the data race > > > and says that the undesired outcome is allowed: > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > > > Test OlegNesterov-put_pid Allowed > > > States 3 > > > 1:r1=0; x=1; > > > 1:r1=1; x=1; > > > 1:r1=1; x=2; > > > Ok > > > Witnesses > > > Positive: 1 Negative: 2 > > > Flag data-race > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid Sometimes 1 2 > > > Time OlegNesterov-put_pid 0.00 > > > Hash=a3e0043ad753effa860fea37eeba0a76 > > > > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > > > although it does remove the "Flag data-race". > > > > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > > gets rid of both the "Flag data-race" and the undesired outcome: > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > > > Test OlegNesterov-put_pid-WO-WO Allowed > > > States 2 > > > 1:r1=0; x=1; > > > 1:r1=1; x=2; > > > No > > > Witnesses > > > Positive: 0 Negative: 2 > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > > > Time OlegNesterov-put_pid-WO-WO 0.01 > > > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > > > > > Here is the corresponding litmus test, in case I messed something up: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid-WO-WO > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > WRITE_ONCE(*y, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > WRITE_ONCE(*x, 2); > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in > > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be > > sufficient to prevent the exists condition. Shouldn't the compiler know that, > > in P0(), it should not reorder the store to y=1 before the x=1 because there > > is an explicit barrier between the 2 stores? Looks me to me like a broken > > compiler :-|. > > > > So I would have expected the following litmus to result in Never, but it > > doesn't with Alan's patch: > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > > > exists (1:r1=1 /\ ~x=2) > > The problem is that the compiler can turn both of P0()'s writes into reads: > > P0(int *x, int *y) > { > if (*x != 1) > *x = 1; > smp_mb(); > if (*y != 1) > *y = 1; > } > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() > to get an iron-clad ordering guarantee. But the snippet above has smp_mb() which does order writes, even for the plain accesses. > > > ------------------------------------------------------------------------ > > > > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > > > discussion. > > > > > > Good point, let's try with smp_load_acquire() in P1(): > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > > > Test OlegNesterov-put_pid-WO-sla Allowed > > > States 2 > > > 1:r1=0; x=1; > > > 1:r1=1; x=2; > > > No > > > Witnesses > > > Positive: 0 Negative: 2 > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > > > Time OlegNesterov-put_pid-WO-sla 0.01 > > > Hash=4fb0276eabf924793dec1970199db3a6 > > > > > > This also works. Here is the litmus test: > > > > > > ------------------------------------------------------------------------ > > > C OlegNesterov-put_pid-WO-sla > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > WRITE_ONCE(*y, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = smp_load_acquire(y); > > > if (r1) > > > *x = 2; > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > ------------------------------------------------------------------------ > > > > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > > > smp_load_acquire() gets us a data race and allows the undesired > > > outcome: > > > > Yeah, I think this is also what I was confused about above, is why is that > > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely > > I'm missing something. ;-) > > The first of P0()'s writes can be a plain write, at least assuming > sufficient synchronization to avoid the data race. But turning the second > of P0()'s writes into a plain write is a bit riskier: That is a write of > a constant, and those really are torn in some cases on some architectures. > Like x86, for example. I understand. Are we talking about load/store tearing being the issue here? Even under load/store tearing, I feel the program will produce correct results because r1 is either 0 or 1 (a single bit cannot be torn). Further, from the herd simulator output (below), according to the "States", r1==1 means P1() AFAICS would have already finished the the read and set the r1 register to 1. Then I am wondering why it couldn't take the branch to set *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state for the below program. I still couldn't see in my mind how for the below program, this is possible - in terms of compiler optimizations or other kinds of ordering. Because there is a smp_mb() between the 2 plain writes in P0() and P1() did establish that r1 is 1 in the positive case. :-/. I am surely missing something :-) ---8<----------------------- C Joel-put_pid {} P0(int *x, int *y) { *x = 1; smp_mb(); *y = 1; } P1(int *x, int *y) { int r1; r1 = READ_ONCE(*y); if (r1) WRITE_ONCE(*x, 2); } exists (1:r1=1 /\ ~x=2) ---8<----------------------- Output: Test Joel-put_pid Allowed States 3 1:r1=0; x=1; 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. 1:r1=1; x=2; Ok Witnesses Positive: 1 Negative: 2 Flag data-race Condition exists (1:r1=1 /\ not (x=2)) Observation Joel-put_pid Sometimes 1 2 Time OlegNesterov-put_pid-WO-WO 0.01 Hash=c7bdd50418d42779b7c10ba9128369df > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus > > > Test OlegNesterov-put_pid-sla Allowed > > > States 3 > > > 1:r1=0; x=1; > > > 1:r1=1; x=1; > > > 1:r1=1; x=2; > > > Ok > > > Witnesses > > > Positive: 1 Negative: 2 > > > Flag data-race > > > Condition exists (1:r1=1 /\ not (x=2)) > > > Observation OlegNesterov-put_pid-sla Sometimes 1 2 > > > Time OlegNesterov-put_pid-sla 0.01 > > > Hash=ec6f71f3d9f7cd6e45a874c872e3d946 > > > > > > But what if you are certain that the compiler cannot mess up your use > > > of plain C-language loads and stores? Then simply tell LKMM that they > > > are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly > > > somewhat paranoid, but real C compilers really do tear stores of certain > > > constants on systems (like x86) that have store-immediate instructions, > > > so a bit of paranoia is not misplaced here. ;-) > > > > > > Plus please note that this patch to LKMM is prototype and thus subject > > > to change. > > > > Ah I see. Appreciate if Alan can also CC me on future posting of this since > > I'm quite interested. ;-) > > His last posting should be easy to find. But please let me know if not, > as I would be happy to send it along. I found it and I'm going through it. Thanks! - Joel
On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote: > Thanks a lot Alan and Paul for the replies. I replied inline. > > On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote: > > On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote: > > > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote: > > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote: > > > > > On 03/28, Jann Horn wrote: > > > > > > > > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to > > > > > > the thread. > > > > > > > > > > Since you added Paul let me add more confusion to this thread ;) > > > > > > > > Woo-hoo!!! More confusion! Bring it on!!! ;-) > > > > > > Nice to take part in the confusion fun too!!! ;-) > > > > > > > > There were some concerns about the lack of barriers in put_pid(), but I can't > > > > > find that old discussion and I forgot the result of that discussion... > > > > > > > > > > Paul, could you confirm that this code > > > > > > > > > > CPU_0 CPU_1 > > > > > > > > > > X = 1; if (READ_ONCE(Y)) > > > > > mb(); X = 2; > > > > > Y = 1; BUG_ON(X != 2); > > > > > > > > > > > > > > > is correct? I think it is, control dependency pairs with mb(), right? > > > > > > > > The BUG_ON() is supposed to happen at the end of time, correct? > > > > As written, there is (in the strict sense) a data race between the load > > > > of X in the BUG_ON() and CPU_0's store to X. In a less strict sense, > > > > you could of course argue that this data race is harmless, especially > > > > if X is a single byte. But the more I talk to compiler writers, the > > > > less comfortable I become with data races in general. :-/ > > > > > > > > So I would also feel better if the "Y = 1" was WRITE_ONCE(). > > > > > > > > On the other hand, this is a great opportunity to try out Alan Stern's > > > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)! > > > > > > > > https://lkml.kernel.org/r/Pine.LNX.4.44L0.1903191459270.1593-200000@iolanthe.rowland.org > > > > > > > > Also adding Alan on CC. > > > > > > > > Here is what I believe is the litmus test that your are interested in: > > > > > > > > ------------------------------------------------------------------------ > > > > C OlegNesterov-put_pid > > > > > > > > {} > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_mb(); > > > > *y = 1; > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > int r1; > > > > > > > > r1 = READ_ONCE(*y); > > > > if (r1) > > > > *x = 2; > > > > } > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > ------------------------------------------------------------------------ > > > > > > > > Running this through herd with Alan's patch detects the data race > > > > and says that the undesired outcome is allowed: > > > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus > > > > Test OlegNesterov-put_pid Allowed > > > > States 3 > > > > 1:r1=0; x=1; > > > > 1:r1=1; x=1; > > > > 1:r1=1; x=2; > > > > Ok > > > > Witnesses > > > > Positive: 1 Negative: 2 > > > > Flag data-race > > > > Condition exists (1:r1=1 /\ not (x=2)) > > > > Observation OlegNesterov-put_pid Sometimes 1 2 > > > > Time OlegNesterov-put_pid 0.00 > > > > Hash=a3e0043ad753effa860fea37eeba0a76 > > > > > > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome, > > > > although it does remove the "Flag data-race". > > > > > > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x > > > > gets rid of both the "Flag data-race" and the undesired outcome: > > > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus > > > > Test OlegNesterov-put_pid-WO-WO Allowed > > > > States 2 > > > > 1:r1=0; x=1; > > > > 1:r1=1; x=2; > > > > No > > > > Witnesses > > > > Positive: 0 Negative: 2 > > > > Condition exists (1:r1=1 /\ not (x=2)) > > > > Observation OlegNesterov-put_pid-WO-WO Never 0 2 > > > > Time OlegNesterov-put_pid-WO-WO 0.01 > > > > Hash=6e1643e3c5e4739b590bde0a8e8a918e > > > > > > > > Here is the corresponding litmus test, in case I messed something up: > > > > > > > > ------------------------------------------------------------------------ > > > > C OlegNesterov-put_pid-WO-WO > > > > > > > > {} > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_mb(); > > > > WRITE_ONCE(*y, 1); > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > int r1; > > > > > > > > r1 = READ_ONCE(*y); > > > > if (r1) > > > > WRITE_ONCE(*x, 2); > > > > } > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in > > > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be > > > sufficient to prevent the exists condition. Shouldn't the compiler know that, > > > in P0(), it should not reorder the store to y=1 before the x=1 because there > > > is an explicit barrier between the 2 stores? Looks me to me like a broken > > > compiler :-|. > > > > > > So I would have expected the following litmus to result in Never, but it > > > doesn't with Alan's patch: > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > *y = 1; > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > WRITE_ONCE(*x, 2); > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > > The problem is that the compiler can turn both of P0()'s writes into reads: > > > > P0(int *x, int *y) > > { > > if (*x != 1) > > *x = 1; > > smp_mb(); > > if (*y != 1) > > *y = 1; > > } > > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() > > to get an iron-clad ordering guarantee. > > But the snippet above has smp_mb() which does order writes, even for the > plain accesses. True, but that code has a data race, namely P0()'s plain write to y races with P1()'s READ_ONCE(). Data races give the compiler absolutely astonishing levels of freedom to rearrange your code. So, if you as a developer or maintainer choose to have data races, it is your responsibility to ensure that the compiler cannot mess you up. So what you should do in that case is to list the transformed code the compiler's optimizations can produce and feed the corresponding litmus tests to LKMM, using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses. > > > > ------------------------------------------------------------------------ > > > > > > > > > If not, then put_pid() needs atomic_read_acquire() as it was proposed in that > > > > > discussion. > > > > > > > > Good point, let's try with smp_load_acquire() in P1(): > > > > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus > > > > Test OlegNesterov-put_pid-WO-sla Allowed > > > > States 2 > > > > 1:r1=0; x=1; > > > > 1:r1=1; x=2; > > > > No > > > > Witnesses > > > > Positive: 0 Negative: 2 > > > > Condition exists (1:r1=1 /\ not (x=2)) > > > > Observation OlegNesterov-put_pid-WO-sla Never 0 2 > > > > Time OlegNesterov-put_pid-WO-sla 0.01 > > > > Hash=4fb0276eabf924793dec1970199db3a6 > > > > > > > > This also works. Here is the litmus test: > > > > > > > > ------------------------------------------------------------------------ > > > > C OlegNesterov-put_pid-WO-sla > > > > > > > > {} > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_mb(); > > > > WRITE_ONCE(*y, 1); > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > int r1; > > > > > > > > r1 = smp_load_acquire(y); > > > > if (r1) > > > > *x = 2; > > > > } > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > ------------------------------------------------------------------------ > > > > > > > > Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s > > > > smp_load_acquire() gets us a data race and allows the undesired > > > > outcome: > > > > > > Yeah, I think this is also what I was confused about above, is why is that > > > WRITE_ONCE required in P0() because there's already an smp_mb there. Surely > > > I'm missing something. ;-) > > > > The first of P0()'s writes can be a plain write, at least assuming > > sufficient synchronization to avoid the data race. But turning the second > > of P0()'s writes into a plain write is a bit riskier: That is a write of > > a constant, and those really are torn in some cases on some architectures. > > Like x86, for example. > > I understand. Are we talking about load/store tearing being the issue here? > Even under load/store tearing, I feel the program will produce correct > results because r1 is either 0 or 1 (a single bit cannot be torn). The compiler can (at least in theory) also transform *y = 1 as follows: *y = r1; /* Use *y as temporary storage. */ do_something(); r1 = *y; *y = 1; Here r1 is some register and do_something() is an inline function visible to the compiler (hence not implying barrier() upon call and return). I don't know of any compilers that actually do this, but for me use of WRITE_ONCE() is a small price to pay to prevent all past, present, and future compiler from ever destroying^W optimizing my code in this way. In your own code, if you dislike WRITE_ONCE() so much that you are willing to commit to (now and forever!!!) checking all applicable versions of compilers to make sure that they don't do this, well and good, knock yourself out. But it is your responsibility to do that checking, and you can attest to LKMM that you have done so by giving LKMM litmus tests that substitute WRITE_ONCE() for that plain C-language assignment statement. > Further, from the herd simulator output (below), according to the "States", > r1==1 means P1() AFAICS would have already finished the the read and set the > r1 register to 1. Then I am wondering why it couldn't take the branch to set > *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state > for the below program. I still couldn't see in my mind how for the below > program, this is possible - in terms of compiler optimizations or other kinds > of ordering. Because there is a smp_mb() between the 2 plain writes in P0() > and P1() did establish that r1 is 1 in the positive case. :-/. I am surely > missing something :-) > > ---8<----------------------- > C Joel-put_pid > > {} > > P0(int *x, int *y) > { > *x = 1; > smp_mb(); > *y = 1; > } > > P1(int *x, int *y) > { > int r1; > > r1 = READ_ONCE(*y); > if (r1) > WRITE_ONCE(*x, 2); > } > > exists (1:r1=1 /\ ~x=2) > > ---8<----------------------- > Output: > > Test Joel-put_pid Allowed > States 3 > 1:r1=0; x=1; > 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. I must defer to Alan on this, but my guess is that this is due to the fact that there is a data race. Thanx, Paul > 1:r1=1; x=2; > Ok > Witnesses > Positive: 1 Negative: 2 > Flag data-race > Condition exists (1:r1=1 /\ not (x=2)) > Observation Joel-put_pid Sometimes 1 2 > Time OlegNesterov-put_pid-WO-WO 0.01 > Hash=c7bdd50418d42779b7c10ba9128369df > > > > > $ herd7 -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus > > > > Test OlegNesterov-put_pid-sla Allowed > > > > States 3 > > > > 1:r1=0; x=1; > > > > 1:r1=1; x=1; > > > > 1:r1=1; x=2; > > > > Ok > > > > Witnesses > > > > Positive: 1 Negative: 2 > > > > Flag data-race > > > > Condition exists (1:r1=1 /\ not (x=2)) > > > > Observation OlegNesterov-put_pid-sla Sometimes 1 2 > > > > Time OlegNesterov-put_pid-sla 0.01 > > > > Hash=ec6f71f3d9f7cd6e45a874c872e3d946 > > > > > > > > But what if you are certain that the compiler cannot mess up your use > > > > of plain C-language loads and stores? Then simply tell LKMM that they > > > > are READ_ONCE() and WRITE_ONCE(), respectively. LKMM is admittedly > > > > somewhat paranoid, but real C compilers really do tear stores of certain > > > > constants on systems (like x86) that have store-immediate instructions, > > > > so a bit of paranoia is not misplaced here. ;-) > > > > > > > > Plus please note that this patch to LKMM is prototype and thus subject > > > > to change. > > > > > > Ah I see. Appreciate if Alan can also CC me on future posting of this since > > > I'm quite interested. ;-) > > > > His last posting should be easy to find. But please let me know if not, > > as I would be happy to send it along. > > I found it and I'm going through it. Thanks! > > - Joel >
On Thu, 4 Apr 2019, Paul E. McKenney wrote: > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote: > > > > So I would have expected the following litmus to result in Never, but it > > > > doesn't with Alan's patch: > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_mb(); > > > > *y = 1; > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > int r1; > > > > > > > > r1 = READ_ONCE(*y); > > > > if (r1) > > > > WRITE_ONCE(*x, 2); > > > > } > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > The problem is that the compiler can turn both of P0()'s writes into reads: > > > > > > P0(int *x, int *y) > > > { > > > if (*x != 1) > > > *x = 1; > > > smp_mb(); > > > if (*y != 1) > > > *y = 1; > > > } > > > > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() > > > to get an iron-clad ordering guarantee. > > > > But the snippet above has smp_mb() which does order writes, even for the > > plain accesses. > > True, but that code has a data race, namely P0()'s plain write to y > races with P1()'s READ_ONCE(). Data races give the compiler absolutely > astonishing levels of freedom to rearrange your code. So, if you > as a developer or maintainer choose to have data races, it is your > responsibility to ensure that the compiler cannot mess you up. So what > you should do in that case is to list the transformed code the compiler's > optimizations can produce and feed the corresponding litmus tests to LKMM, > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses. In this case, I strongly suspect the compiler would not mess up the code badly enough to allow the unwanted end result. But the LKMM tries to avoid making strong assumptions about what compilers will or will not do. > > I understand. Are we talking about load/store tearing being the issue here? > > Even under load/store tearing, I feel the program will produce correct > > results because r1 is either 0 or 1 (a single bit cannot be torn). > > The compiler can (at least in theory) also transform *y = 1 as follows: > > *y = r1; /* Use *y as temporary storage. */ > do_something(); > r1 = *y; > *y = 1; > > Here r1 is some register and do_something() is an inline function visible > to the compiler (hence not implying barrier() upon call and return). > > I don't know of any compilers that actually do this, but for me use > of WRITE_ONCE() is a small price to pay to prevent all past, present, > and future compiler from ever destroying^W optimizing my code in this way. > > In your own code, if you dislike WRITE_ONCE() so much that you are willing > to commit to (now and forever!!!) checking all applicable versions of > compilers to make sure that they don't do this, well and good, knock > yourself out. But it is your responsibility to do that checking, and you > can attest to LKMM that you have done so by giving LKMM litmus tests that > substitute WRITE_ONCE() for that plain C-language assignment statement. Remember that the LKMM does not produce strict bounds. That is, the LKMM will say that some outcomes are allowed even though no existing compiler/CPU combination would ever actually produce them. This litmus test is an example. > > Further, from the herd simulator output (below), according to the "States", > > r1==1 means P1() AFAICS would have already finished the the read and set the > > r1 register to 1. Then I am wondering why it couldn't take the branch to set > > *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state > > for the below program. I still couldn't see in my mind how for the below > > program, this is possible - in terms of compiler optimizations or other kinds > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0() > > and P1() did establish that r1 is 1 in the positive case. :-/. I am surely > > missing something :-) > > > > ---8<----------------------- > > C Joel-put_pid > > > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_mb(); > > *y = 1; > > } > > > > P1(int *x, int *y) > > { > > int r1; > > > > r1 = READ_ONCE(*y); > > if (r1) > > WRITE_ONCE(*x, 2); > > } > > > > exists (1:r1=1 /\ ~x=2) > > > > ---8<----------------------- > > Output: > > > > Test Joel-put_pid Allowed > > States 3 > > 1:r1=0; x=1; > > 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. > > I must defer to Alan on this, but my guess is that this is due to > the fact that there is a data race. Yes, and because the plain-access/data-race patch for the LKMM was intended only to detect data races, not to be aware of all the kinds of ordering that plain accesses can induce. I said as much at the start of the patch's cover letter and it bears repeating. In this case it is certainly true that the "*x = 1" write must be complete before the value of *y is changed from 0. Hence P1's WRITE_ONCE will certainly overwrite the 1 with 2, and the final value of *x will be 2 if r1=1. But the notion of "visibility" of a write that I put into the LKMM patch only allows for chains of marked accesses. Since "*y = 1" is a plain access, the model does not recognize that it can enforce the ordering between the two writes to *x. Also, you must remember, the LKMM's prediction about whether an outcome will or will not occur are meaningless if a data race is present. Therefore the most fundamental the answer to why the "1:r1=1; x=1;" line is there is basically what Paul said: It's there because the herd model gets completely messed up by data races. Alan Stern
Thanks a lot Paul and Allen, I replied below. On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote: > On Thu, 4 Apr 2019, Paul E. McKenney wrote: > > > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote: > > > > > > So I would have expected the following litmus to result in Never, but it > > > > > doesn't with Alan's patch: > > > > > > > > > > P0(int *x, int *y) > > > > > { > > > > > *x = 1; > > > > > smp_mb(); > > > > > *y = 1; > > > > > } > > > > > > > > > > P1(int *x, int *y) > > > > > { > > > > > int r1; > > > > > > > > > > r1 = READ_ONCE(*y); > > > > > if (r1) > > > > > WRITE_ONCE(*x, 2); > > > > > } > > > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > > > The problem is that the compiler can turn both of P0()'s writes into reads: > > > > > > > > P0(int *x, int *y) > > > > { > > > > if (*x != 1) > > > > *x = 1; > > > > smp_mb(); > > > > if (*y != 1) > > > > *y = 1; > > > > } > > > > > > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() > > > > to get an iron-clad ordering guarantee. > > > > > > But the snippet above has smp_mb() which does order writes, even for the > > > plain accesses. > > > > True, but that code has a data race, namely P0()'s plain write to y > > races with P1()'s READ_ONCE(). Data races give the compiler absolutely > > astonishing levels of freedom to rearrange your code. So, if you > > as a developer or maintainer choose to have data races, it is your > > responsibility to ensure that the compiler cannot mess you up. So what > > you should do in that case is to list the transformed code the compiler's > > optimizations can produce and feed the corresponding litmus tests to LKMM, > > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses. > > In this case, I strongly suspect the compiler would not mess up the > code badly enough to allow the unwanted end result. But the LKMM tries > to avoid making strong assumptions about what compilers will or will > not do. Here I was just trying to understand better how any kind of code transformation can cause an issue. Certainly I can see an issue if the compiler uses the memory location as a temporary variable as Paul pointed, to which I agree that a WRITE_ONCE is better. I am in favor of being careful, but here I was just trying to understand this better. > > > I understand. Are we talking about load/store tearing being the issue here? > > > Even under load/store tearing, I feel the program will produce correct > > > results because r1 is either 0 or 1 (a single bit cannot be torn). > > > > The compiler can (at least in theory) also transform *y = 1 as follows: > > > > *y = r1; /* Use *y as temporary storage. */ > > do_something(); > > r1 = *y; > > *y = 1; > > > > Here r1 is some register and do_something() is an inline function visible > > to the compiler (hence not implying barrier() upon call and return). > > > > I don't know of any compilers that actually do this, but for me use > > of WRITE_ONCE() is a small price to pay to prevent all past, present, > > and future compiler from ever destroying^W optimizing my code in this way. > > > > In your own code, if you dislike WRITE_ONCE() so much that you are willing > > to commit to (now and forever!!!) checking all applicable versions of > > compilers to make sure that they don't do this, well and good, knock > > yourself out. But it is your responsibility to do that checking, and you > > can attest to LKMM that you have done so by giving LKMM litmus tests that > > substitute WRITE_ONCE() for that plain C-language assignment statement. > > Remember that the LKMM does not produce strict bounds. That is, the > LKMM will say that some outcomes are allowed even though no existing > compiler/CPU combination would ever actually produce them. This litmus > test is an example. Ok. > > > Further, from the herd simulator output (below), according to the "States", > > > r1==1 means P1() AFAICS would have already finished the the read and set the > > > r1 register to 1. Then I am wondering why it couldn't take the branch to set > > > *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state > > > for the below program. I still couldn't see in my mind how for the below > > > program, this is possible - in terms of compiler optimizations or other kinds > > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0() > > > and P1() did establish that r1 is 1 in the positive case. :-/. I am surely > > > missing something :-) > > > > > > ---8<----------------------- > > > C Joel-put_pid > > > > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_mb(); > > > *y = 1; > > > } > > > > > > P1(int *x, int *y) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*y); > > > if (r1) > > > WRITE_ONCE(*x, 2); > > > } > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > ---8<----------------------- > > > Output: > > > > > > Test Joel-put_pid Allowed > > > States 3 > > > 1:r1=0; x=1; > > > 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. > > > > I must defer to Alan on this, but my guess is that this is due to > > the fact that there is a data race. > > Yes, and because the plain-access/data-race patch for the LKMM was > intended only to detect data races, not to be aware of all the kinds of > ordering that plain accesses can induce. I said as much at the start > of the patch's cover letter and it bears repeating. > > In this case it is certainly true that the "*x = 1" write must be > complete before the value of *y is changed from 0. Hence P1's > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value > of *x will be 2 if r1=1. > > But the notion of "visibility" of a write that I put into the LKMM > patch only allows for chains of marked accesses. Since "*y = 1" is a > plain access, the model does not recognize that it can enforce the > ordering between the two writes to *x. > > Also, you must remember, the LKMM's prediction about whether an outcome > will or will not occur are meaningless if a data race is present. > Therefore the most fundamental the answer to why the "1:r1=1; x=1;" > line is there is basically what Paul said: It's there because the > herd model gets completely messed up by data races. Makes sense to me. Thanks for the good work on this. FWIW, thought to mention (feel free ignore the suggestion if its meaningless): If there is any chance that the outcome can be better outputted, like r1=X; x=1; Where X stands for the result of a data race, that would be lovely. I don't know much about herd internals (yet) to say if the suggestion makes sense but as a user, it would certainly help reduce confusion. thanks, - Joel
On Thu, Apr 04, 2019 at 02:08:42PM -0400, Joel Fernandes wrote: > Thanks a lot Paul and Allen, I replied below. > > On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote: > > On Thu, 4 Apr 2019, Paul E. McKenney wrote: > > > > > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote: > > > > > > > > So I would have expected the following litmus to result in Never, but it > > > > > > doesn't with Alan's patch: > > > > > > > > > > > > P0(int *x, int *y) > > > > > > { > > > > > > *x = 1; > > > > > > smp_mb(); > > > > > > *y = 1; > > > > > > } > > > > > > > > > > > > P1(int *x, int *y) > > > > > > { > > > > > > int r1; > > > > > > > > > > > > r1 = READ_ONCE(*y); > > > > > > if (r1) > > > > > > WRITE_ONCE(*x, 2); > > > > > > } > > > > > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > > > > > The problem is that the compiler can turn both of P0()'s writes into reads: > > > > > > > > > > P0(int *x, int *y) > > > > > { > > > > > if (*x != 1) > > > > > *x = 1; > > > > > smp_mb(); > > > > > if (*y != 1) > > > > > *y = 1; > > > > > } > > > > > > > > > > These reads will not be ordered by smp_wmb(), so you have to do WRITE_ONCE() > > > > > to get an iron-clad ordering guarantee. > > > > > > > > But the snippet above has smp_mb() which does order writes, even for the > > > > plain accesses. > > > > > > True, but that code has a data race, namely P0()'s plain write to y > > > races with P1()'s READ_ONCE(). Data races give the compiler absolutely > > > astonishing levels of freedom to rearrange your code. So, if you > > > as a developer or maintainer choose to have data races, it is your > > > responsibility to ensure that the compiler cannot mess you up. So what > > > you should do in that case is to list the transformed code the compiler's > > > optimizations can produce and feed the corresponding litmus tests to LKMM, > > > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses. > > > > In this case, I strongly suspect the compiler would not mess up the > > code badly enough to allow the unwanted end result. But the LKMM tries > > to avoid making strong assumptions about what compilers will or will > > not do. > > Here I was just trying to understand better how any kind of code > transformation can cause an issue. Certainly I can see an issue if the > compiler uses the memory location as a temporary variable as Paul pointed, to > which I agree that a WRITE_ONCE is better. I am in favor of being careful, > but here I was just trying to understand this better. > > > > > I understand. Are we talking about load/store tearing being the issue here? > > > > Even under load/store tearing, I feel the program will produce correct > > > > results because r1 is either 0 or 1 (a single bit cannot be torn). > > > > > > The compiler can (at least in theory) also transform *y = 1 as follows: > > > > > > *y = r1; /* Use *y as temporary storage. */ > > > do_something(); > > > r1 = *y; > > > *y = 1; > > > > > > Here r1 is some register and do_something() is an inline function visible > > > to the compiler (hence not implying barrier() upon call and return). > > > > > > I don't know of any compilers that actually do this, but for me use > > > of WRITE_ONCE() is a small price to pay to prevent all past, present, > > > and future compiler from ever destroying^W optimizing my code in this way. > > > > > > In your own code, if you dislike WRITE_ONCE() so much that you are willing > > > to commit to (now and forever!!!) checking all applicable versions of > > > compilers to make sure that they don't do this, well and good, knock > > > yourself out. But it is your responsibility to do that checking, and you > > > can attest to LKMM that you have done so by giving LKMM litmus tests that > > > substitute WRITE_ONCE() for that plain C-language assignment statement. > > > > Remember that the LKMM does not produce strict bounds. That is, the > > LKMM will say that some outcomes are allowed even though no existing > > compiler/CPU combination would ever actually produce them. This litmus > > test is an example. > > Ok. > > > > > Further, from the herd simulator output (below), according to the "States", > > > > r1==1 means P1() AFAICS would have already finished the the read and set the > > > > r1 register to 1. Then I am wondering why it couldn't take the branch to set > > > > *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state > > > > for the below program. I still couldn't see in my mind how for the below > > > > program, this is possible - in terms of compiler optimizations or other kinds > > > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0() > > > > and P1() did establish that r1 is 1 in the positive case. :-/. I am surely > > > > missing something :-) > > > > > > > > ---8<----------------------- > > > > C Joel-put_pid > > > > > > > > {} > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_mb(); > > > > *y = 1; > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > int r1; > > > > > > > > r1 = READ_ONCE(*y); > > > > if (r1) > > > > WRITE_ONCE(*x, 2); > > > > } > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > > > ---8<----------------------- > > > > Output: > > > > > > > > Test Joel-put_pid Allowed > > > > States 3 > > > > 1:r1=0; x=1; > > > > 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. > > > > > > I must defer to Alan on this, but my guess is that this is due to > > > the fact that there is a data race. > > > > Yes, and because the plain-access/data-race patch for the LKMM was > > intended only to detect data races, not to be aware of all the kinds of > > ordering that plain accesses can induce. I said as much at the start > > of the patch's cover letter and it bears repeating. > > > > In this case it is certainly true that the "*x = 1" write must be > > complete before the value of *y is changed from 0. Hence P1's > > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value > > of *x will be 2 if r1=1. > > > > But the notion of "visibility" of a write that I put into the LKMM > > patch only allows for chains of marked accesses. Since "*y = 1" is a > > plain access, the model does not recognize that it can enforce the > > ordering between the two writes to *x. > > > > Also, you must remember, the LKMM's prediction about whether an outcome > > will or will not occur are meaningless if a data race is present. > > Therefore the most fundamental the answer to why the "1:r1=1; x=1;" > > line is there is basically what Paul said: It's there because the > > herd model gets completely messed up by data races. > > Makes sense to me. Thanks for the good work on this. > > FWIW, thought to mention (feel free ignore the suggestion if its > meaningless): If there is any chance that the outcome can be better > outputted, like r1=X; x=1; Where X stands for the result of a data race, that > would be lovely. I don't know much about herd internals (yet) to say if the > suggestion makes sense but as a user, it would certainly help reduce > confusion. The "Flag data-race" that appeared in the herd output is your friend in this case. If you see that in the output, that means that herd detected a data race, and the states output might or might not be reliable. Thanx, Paul
On Thu, 4 Apr 2019, Joel Fernandes wrote: > FWIW, thought to mention (feel free ignore the suggestion if its > meaningless): If there is any chance that the outcome can be better > outputted, like r1=X; x=1; Where X stands for the result of a data race, that > would be lovely. I don't know much about herd internals (yet) to say if the > suggestion makes sense but as a user, it would certainly help reduce > confusion. I don't think that is feasible. For one thing, according to the C Standard, in the presence of a data race a program is allowed to do anything at all. There's no point trying to display all possible outcomes of an execution! If we adopted the proposal that certain kinds of data races only result in undefined values, things would be a little better. It's possible that the herd program could be changed to support undefined values. At present, however, it does not. Alan Stern
On Thu, Apr 04, 2019 at 11:19:46AM -0700, Paul E. McKenney wrote: [snip] > > > > > Further, from the herd simulator output (below), according to the "States", > > > > > r1==1 means P1() AFAICS would have already finished the the read and set the > > > > > r1 register to 1. Then I am wondering why it couldn't take the branch to set > > > > > *x to 2. According to herd, r1 == 1 AND x == 1 is a perfectly valid state > > > > > for the below program. I still couldn't see in my mind how for the below > > > > > program, this is possible - in terms of compiler optimizations or other kinds > > > > > of ordering. Because there is a smp_mb() between the 2 plain writes in P0() > > > > > and P1() did establish that r1 is 1 in the positive case. :-/. I am surely > > > > > missing something :-) > > > > > > > > > > ---8<----------------------- > > > > > C Joel-put_pid > > > > > > > > > > {} > > > > > > > > > > P0(int *x, int *y) > > > > > { > > > > > *x = 1; > > > > > smp_mb(); > > > > > *y = 1; > > > > > } > > > > > > > > > > P1(int *x, int *y) > > > > > { > > > > > int r1; > > > > > > > > > > r1 = READ_ONCE(*y); > > > > > if (r1) > > > > > WRITE_ONCE(*x, 2); > > > > > } > > > > > > > > > > exists (1:r1=1 /\ ~x=2) > > > > > > > > > > ---8<----------------------- > > > > > Output: > > > > > > > > > > Test Joel-put_pid Allowed > > > > > States 3 > > > > > 1:r1=0; x=1; > > > > > 1:r1=1; x=1; <-- Can't figure out why r1=1 and x != 2 here. > > > > > > > > I must defer to Alan on this, but my guess is that this is due to > > > > the fact that there is a data race. > > > > > > Yes, and because the plain-access/data-race patch for the LKMM was > > > intended only to detect data races, not to be aware of all the kinds of > > > ordering that plain accesses can induce. I said as much at the start > > > of the patch's cover letter and it bears repeating. > > > > > > In this case it is certainly true that the "*x = 1" write must be > > > complete before the value of *y is changed from 0. Hence P1's > > > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value > > > of *x will be 2 if r1=1. > > > > > > But the notion of "visibility" of a write that I put into the LKMM > > > patch only allows for chains of marked accesses. Since "*y = 1" is a > > > plain access, the model does not recognize that it can enforce the > > > ordering between the two writes to *x. > > > > > > Also, you must remember, the LKMM's prediction about whether an outcome > > > will or will not occur are meaningless if a data race is present. > > > Therefore the most fundamental the answer to why the "1:r1=1; x=1;" > > > line is there is basically what Paul said: It's there because the > > > herd model gets completely messed up by data races. > > > > Makes sense to me. Thanks for the good work on this. > > > > FWIW, thought to mention (feel free ignore the suggestion if its > > meaningless): If there is any chance that the outcome can be better > > outputted, like r1=X; x=1; Where X stands for the result of a data race, that > > would be lovely. I don't know much about herd internals (yet) to say if the > > suggestion makes sense but as a user, it would certainly help reduce > > confusion. > > The "Flag data-race" that appeared in the herd output is your friend in > this case. If you see that in the output, that means that herd detected > a data race, and the states output might or might not be reliable. Thanks Paul and Alan, the "Flag data-race" indication sounds good to me. I will watch out for that :) - Joel
diff --git a/include/linux/pid.h b/include/linux/pid.h index 14a9a39da9c7..8cb86d377ff5 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -3,6 +3,7 @@ #define _LINUX_PID_H #include <linux/rculist.h> +#include <linux/refcount.h> enum pid_type { @@ -56,7 +57,7 @@ struct upid { struct pid { - atomic_t count; + refcount_t count; unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; @@ -69,7 +70,7 @@ extern struct pid init_struct_pid; static inline struct pid *get_pid(struct pid *pid) { if (pid) - atomic_inc(&pid->count); + refcount_inc(&pid->count); return pid; } diff --git a/kernel/pid.c b/kernel/pid.c index 20881598bdfa..2095c7da644d 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -37,7 +37,7 @@ #include <linux/init_task.h> #include <linux/syscalls.h> #include <linux/proc_ns.h> -#include <linux/proc_fs.h> +#include <linux/refcount.h> #include <linux/sched/task.h> #include <linux/idr.h> @@ -106,8 +106,8 @@ void put_pid(struct pid *pid) return; ns = pid->numbers[pid->level].ns; - if ((atomic_read(&pid->count) == 1) || - atomic_dec_and_test(&pid->count)) { + if ((refcount_read(&pid->count) == 1) || + refcount_dec_and_test(&pid->count)) { kmem_cache_free(ns->pid_cachep, pid); put_pid_ns(ns); } @@ -210,7 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) } get_pid_ns(ns); - atomic_set(&pid->count, 1); + refcount_set(&pid->count, 1); for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]);
struct pid's count is an atomic_t field used as a refcount. Use refcount_t for it which is basically atomic_t but does additional checking to prevent use-after-free bugs. No change in behavior if CONFIG_REFCOUNT_FULL=n. Cc: keescook@chromium.org Cc: kernel-team@android.com Cc: kernel-hardening@lists.openwall.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- include/linux/pid.h | 5 +++-- kernel/pid.c | 8 ++++---- 2 files changed, 7 insertions(+), 6 deletions(-)