diff mbox series

[125/166] lib/list: prevent compiler reloads inside 'safe' list iteration

Message ID 20200407031042.8o-fYMox-%akpm@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series [001/166] mm, memcg: bypass high reclaim iteration for cgroup hierarchy root | expand

Commit Message

Andrew Morton April 7, 2020, 3:10 a.m. UTC
From: Chris Wilson <chris@chris-wilson.co.uk>
Subject: lib/list: prevent compiler reloads inside 'safe' list iteration

Instruct the compiler to read the next element in the list iteration
once, and that it is not allowed to reload the value from the stale
element later. This is important as during the course of the safe
iteration, the stale element may be poisoned (unbeknownst to the
compiler).

This helps prevent kcsan warnings over 'unsafe' conduct in releasing the
list elements during list_for_each_entry_safe() and friends.

Link: http://lkml.kernel.org/r/20200310092119.14965-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/list.h |   50 +++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 14 deletions(-)

Comments

Linus Torvalds April 7, 2020, 3:44 p.m. UTC | #1
On Mon, Apr 6, 2020 at 8:10 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Subject: lib/list: prevent compiler reloads inside 'safe' list iteration
>
> Instruct the compiler to read the next element in the list iteration
> once, and that it is not allowed to reload the value from the stale
> element later. This is important as during the course of the safe
> iteration, the stale element may be poisoned (unbeknownst to the
> compiler).

Andrew, Chris, this one looks rather questionable to me.

How the heck would the ->next pointer be changed without the compiler
being aware of it? That implies a bug to begin with - possibly an
inline asm that changes kernel memory without having a memory clobber.

Quite fundamentally, the READ_ONCE() doesn't seem to fix anything. If
something else can change the list _concurrently_, it's still
completely broken, and hiding the KASAN report is just hiding a bug.

What and where was the actual KASAN issue? The commit log doesn't say...

            Linus
Chris Wilson April 7, 2020, 4:03 p.m. UTC | #2
Quoting Linus Torvalds (2020-04-07 16:44:35)
> On Mon, Apr 6, 2020 at 8:10 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Subject: lib/list: prevent compiler reloads inside 'safe' list iteration
> >
> > Instruct the compiler to read the next element in the list iteration
> > once, and that it is not allowed to reload the value from the stale
> > element later. This is important as during the course of the safe
> > iteration, the stale element may be poisoned (unbeknownst to the
> > compiler).
> 
> Andrew, Chris, this one looks rather questionable to me.
> 
> How the heck would the ->next pointer be changed without the compiler
> being aware of it? That implies a bug to begin with - possibly an
> inline asm that changes kernel memory without having a memory clobber.
> 
> Quite fundamentally, the READ_ONCE() doesn't seem to fix anything. If
> something else can change the list _concurrently_, it's still
> completely broken, and hiding the KASAN report is just hiding a bug.
> 
> What and where was the actual KASAN issue? The commit log doesn't say...

It'll take some time to reconstruct the original report, but the case in
question was in removing the last element of the list of the last list,
switch to a global lock over all such lists to park the HW, which in
doing so added one more element to the original list. [If we happen to
be retiring along the kernel timeline in the first place.]

list->next changed from pointing to list_head, to point to the new
element instead. However, we don't have to check the next element yet
and want to terminate the list iteration.

For reference,
drivers/gpu/drm/i915/gt/intel_engine_pm.c::__engine_park()

Global activity is serialised by engine->wakeref.mutex; every active
timeline is required to hold an engine wakeref, but retiring is local to
timelines and serialised by their own timeline->mutex.

lock(&timeline->lock)
list_for_each_safe(&timeline->requests)
  \-> i915_request_retire [list_del(&timeline->requests)]
   \-> intel_timeline_exit
    \-> lock(&engine->wakeref.mutex)
        engine_park [list_add_tail(&engine->kernel_context->timeline->requests)]

-Chris
Linus Torvalds April 7, 2020, 5:28 p.m. UTC | #3
On Tue, Apr 7, 2020 at 9:04 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> It'll take some time to reconstruct the original report, but the case in
> question was in removing the last element of the list of the last list,
> switch to a global lock over all such lists to park the HW, which in
> doing so added one more element to the original list. [If we happen to
> be retiring along the kernel timeline in the first place.]

Please point to the real code and the list.

Honestly, what you describe sounds complex enough that I think your
locking is simply just buggy.

IOW, this patch seems to really just paper over a locking bug, and the
KASAN report tied to it.

Because the fundamental issue is that READ_ONCE() can not fix a bug
here. Reading the next pointer once fundamentally cannot matter if it
can change concurrently: the code is buggy, and the READ_ONCE() just
means that it gets one or the other value randomly, and that the list
walking is fundamentally racy.

One the other hand, if the next pointer _cannot_ change concurrently,
then READ_ONCE() cannot make a difference.

So as fat as I can tell, we have two possibilities, and in both cases
changing the code to use READ_ONCE() is not the right thing to do. In
one case it hides a bug, and in another case it's just pointless.

> list->next changed from pointing to list_head, to point to the new
> element instead. However, we don't have to check the next element yet
> and want to terminate the list iteration.

I'd really like to see the actual code that has that list walking. You say:

> For reference,
> drivers/gpu/drm/i915/gt/intel_engine_pm.c::__engine_park()

.. but that function doesn't have any locking or list-walking. Is it
the "call_idle_barriers()" loop? What is it?

I'd really like to see the KASAN report and the discussion about this change.

And if there was no discussion, then the patch just seems like "I
changed code randomly and the KASAN report went away, so it's all
good".

> Global activity is serialised by engine->wakeref.mutex; every active
> timeline is required to hold an engine wakeref, but retiring is local to
> timelines and serialised by their own timeline->mutex.
>
> lock(&timeline->lock)
> list_for_each_safe(&timeline->requests)
>   \-> i915_request_retire [list_del(&timeline->requests)]
>    \-> intel_timeline_exit
>     \-> lock(&engine->wakeref.mutex)
>         engine_park [list_add_tail(&engine->kernel_context->timeline->requests)]

in that particular list_for_each_safe() thing, there's no possibility
that the 'next' field would be reloaded, since the list_del() in the
above will be somethign the compiler is aware of.

So yes, the beginning list_for_each_safe() might load it twice (or a
hundred times), but by the time that list_del() in
i915_request_retire() has been called, if the compiler then reloads it
afterwards, that would be a major compiler bug, since it's after that
value could have been written in the local thread.

So this doesn't explain it to me.

What it *sounds* like is that the "engine" lock that you do *not* hold
initially, is not protecting some accessor to that list, so you have a
race on the list at the time of that list_del().

And that race may be what KASAN is reporting, and what that patch is
_hiding_ from KASAN - but not fixing.

See what I am saying and why I find this patch questionable?

There may be something really subtle going on, but it really smells
like "two threads are modifying the same list at the same time".

And there's no way that the READ_ONCE() will fix that bug, it will
only make KASAN shut up about it.

                  Linus
Linus Torvalds April 7, 2020, 5:39 p.m. UTC | #4
On Tue, Apr 7, 2020 at 10:28 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There may be something really subtle going on, but it really smells
> like "two threads are modifying the same list at the same time".
>
> And there's no way that the READ_ONCE() will fix that bug, it will
> only make KASAN shut up about it.

[ And by KASAN I obviously meant KCSAN every time ]

An alternative is that it's just a KCSAN bug or false positive for
some other reason. Which is possible. But the more I look at this, the
more I think you really have a bug in your list handling.

I'm just not convinced the whole "we have a race where we randomly add
a tail object" in another thread is a valid reason for making those
"safe" list accessors use READ_ONCE.

The thing is, they were never designed to be safe wrt concurrent
accesses. They are safe from the _same_ thread removing the current
entry, not from other threads changing the entries - whether it be the
last one or not.

Because if _another_ thread removes (or adds) an entry, you have a
whole new set of issues. READ_ONCE() isn't sufficient. You need to
have memory ordering guarantees etc.

For example, the thread that adds another entry that might - or might
not - be visible without locking, would have to fully initialize the
entry, and set the ->next pointer on it, before it adds it to the
list.

I suspect you could use the RCU list walkers for a use-case like this.
So __list_add_rcu() for example uses the proper rcu_assign_pointer()
(which uses smp_store_release()) to make sure that the newly added
entry is actually _ordered_ wrt the stores to the entry.

But the "safe" list functions simply do not have those kinds of
ordering guarantees, and doing concurrent list_add() simply *CANNOT*
be right. Adding a READ_ONCE() does not in any way make it right
(although it might work in practice on x86).

               Linus
Chris Wilson April 7, 2020, 6:04 p.m. UTC | #5
Quoting Linus Torvalds (2020-04-07 18:28:34)
> On Tue, Apr 7, 2020 at 9:04 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > It'll take some time to reconstruct the original report, but the case in
> > question was in removing the last element of the list of the last list,
> > switch to a global lock over all such lists to park the HW, which in
> > doing so added one more element to the original list. [If we happen to
> > be retiring along the kernel timeline in the first place.]
> 
> Please point to the real code and the list.
> 
> Honestly, what you describe sounds complex enough that I think your
> locking is simply just buggy.
> 
> IOW, this patch seems to really just paper over a locking bug, and the
> KASAN report tied to it.
> 
> Because the fundamental issue is that READ_ONCE() can not fix a bug
> here. Reading the next pointer once fundamentally cannot matter if it
> can change concurrently: the code is buggy, and the READ_ONCE() just
> means that it gets one or the other value randomly, and that the list
> walking is fundamentally racy.
> 
> One the other hand, if the next pointer _cannot_ change concurrently,
> then READ_ONCE() cannot make a difference.
> 
> So as fat as I can tell, we have two possibilities, and in both cases
> changing the code to use READ_ONCE() is not the right thing to do. In
> one case it hides a bug, and in another case it's just pointless.
> 
> > list->next changed from pointing to list_head, to point to the new
> > element instead. However, we don't have to check the next element yet
> > and want to terminate the list iteration.
> 
> I'd really like to see the actual code that has that list walking. You say:
> 
> > For reference,
> > drivers/gpu/drm/i915/gt/intel_engine_pm.c::__engine_park()
> 
> .. but that function doesn't have any locking or list-walking. Is it
> the "call_idle_barriers()" loop? What is it?

list_for_each_safe @ intel_gt_requests.c::retire_requests
list_del @ i915_requests.c::i915_request_retire
list_add @ i915_requests.c::__i915_request_create

> I'd really like to see the KASAN report and the discussion about this change.
> 
> And if there was no discussion, then the patch just seems like "I
> changed code randomly and the KASAN report went away, so it's all
> good".
> 
> > Global activity is serialised by engine->wakeref.mutex; every active
> > timeline is required to hold an engine wakeref, but retiring is local to
> > timelines and serialised by their own timeline->mutex.
> >
> > lock(&timeline->lock)
> > list_for_each_safe(&timeline->requests)
> >   \-> i915_request_retire [list_del(&timeline->requests)]
> >    \-> intel_timeline_exit
> >     \-> lock(&engine->wakeref.mutex)
> >         engine_park [list_add_tail(&engine->kernel_context->timeline->requests)]
> 
> in that particular list_for_each_safe() thing, there's no possibility
> that the 'next' field would be reloaded, since the list_del() in the
> above will be somethign the compiler is aware of.
> 
> So yes, the beginning list_for_each_safe() might load it twice (or a
> hundred times), but by the time that list_del() in
> i915_request_retire() has been called, if the compiler then reloads it
> afterwards, that would be a major compiler bug, since it's after that
> value could have been written in the local thread.

My understanding of how KCSAN works is that it installs a watchpoint
after a read and complains if the value changes within X us.

> So this doesn't explain it to me.
> 
> What it *sounds* like is that the "engine" lock that you do *not* hold
> initially, is not protecting some accessor to that list, so you have a
> race on the list at the time of that list_del().

We take an engine wakeref prior to creating a request, and release it
upon retiring the request [through the respective context/timelines,
across all the engines that might be used for that context]. When the
engine wakeref hits zero, that mutex is taken to prevent any other new
request being created, and under that mutex while we know that the
engine->kernel_context->timeline is empty, we submit a request to switch
the GPU to point into our scratch context.

That submission can run concurrently to the list iteration, but only
_after_ the final list_del.

> And that race may be what KASAN is reporting, and what that patch is
> _hiding_ from KASAN - but not fixing.

Yes, it was sent as a means to silence KCSAN with a query as to whether
or not it could happen in practice with an aggressive compiler.

> See what I am saying and why I find this patch questionable?
> 
> There may be something really subtle going on, but it really smells
> like "two threads are modifying the same list at the same time".

In strict succession.
 
> And there's no way that the READ_ONCE() will fix that bug, it will
> only make KASAN shut up about it.

There's some more shutting up required for KCSAN to bring the noise down
to usable levels which I hope has been done so I don't have to argue for
it, such as

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 1e6650ed066d..c7c8dd89f279 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -164,7 +164,7 @@ static inline void destroy_timer_on_stack(struct timer_list *timer) { }
  */
 static inline int timer_pending(const struct timer_list * timer)
 {
-	return timer->entry.pprev != NULL;
+	return READ_ONCE(timer->entry.pprev) != NULL;
 }

 extern void add_timer_on(struct timer_list *timer, int cpu);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5352ce50a97e..7461b3f33629 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -565,8 +565,9 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
 		/*
 		 * Use vcpu_is_preempted to detect lock holder preemption issue.
 		 */
-		if (!owner->on_cpu || need_resched() ||
-				vcpu_is_preempted(task_cpu(owner))) {
+		if (!READ_ONCE(owner->on_cpu) ||
+		    need_resched() ||
+		    vcpu_is_preempted(task_cpu(owner))) {
 			ret = false;
 			break;
 		}
@@ -602,7 +603,7 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
 	 * on cpu or its cpu is preempted
 	 */
 	if (owner)
-		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+		retval = READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
 	rcu_read_unlock();

 	/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 1f7734949ac8..4a81fba4cf70 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 		 * wait for either @lock to point to us, through its Step-B, or
 		 * wait for a new @node->next from its Step-C.
 		 */
-		if (node->next) {
+		if (READ_ONCE(node->next)) {
 			next = xchg(&node->next, NULL);
 			if (next)
 				break;
@@ -154,7 +154,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 */

 	for (;;) {
-		if (prev->next == node &&
+		if (READ_ONCE(prev->next) == node &&
 		    cmpxchg(&prev->next, node, NULL) == node)
 			break;

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 0d9b6be9ecc8..eef4835cecf2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -650,7 +650,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	 * As lock holder preemption issue, we both skip spinning if
 	 * task is not on cpu or its cpu is preempted
 	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+	return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
 }

 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
Linus Torvalds April 7, 2020, 8:04 p.m. UTC | #6
On Tue, Apr 7, 2020 at 11:04 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> That submission can run concurrently to the list iteration, but only
> _after_ the final list_del.

There is no such thing, Chris.

Not without locks and memory ordering.

The "list_del()" is already ordered on the CPU it happens.

And _there_ it's already ordered with the list_for_each_entry_safe()
by the compiler.

> > There may be something really subtle going on, but it really smells
> > like "two threads are modifying the same list at the same time".
>
> In strict succession.

See above.

There is no such thing as strict succession across two CPU's unless
you have memory barriers, locks, or things like release/acquire
operations.

So a "strict succession from list_del()" only makes sense on the
_local_ CPU. Not across threads.

You may be relying on some very subtle consistency guarantee that is
true on x86. For example, x86 does guarantee "causality".

Not everybody else does that.

> There's some more shutting up required for KCSAN to bring the noise down
> to usable levels which I hope has been done so I don't have to argue for
> it, such as

Stop using KCSAN if you can't deal with the problems it introduces.

I do NOT want to see bogus patches in the kernel that are introduced
by bad infrastructure.

That READ_ONCE() is _wrong_.

File a bug with the KCSAN people, don't send garbage upstream.

               Linus
Linus Torvalds April 7, 2020, 8:06 p.m. UTC | #7
On Tue, Apr 7, 2020 at 1:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> You may be relying on some very subtle consistency guarantee that is
> true on x86. For example, x86 does guarantee "causality".
>
> Not everybody else does that.

It's worth noting that maybe for the i915 driver it makes sense to
just assume x86 memory ordering.

But even if that's the case, then it doesn't make sense to change
list_prev_entry_safe() anywhere else.

                 Linus
Paul E. McKenney April 8, 2020, 8:26 p.m. UTC | #8
On Tue, Apr 07, 2020 at 08:44:35AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2020 at 8:10 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Subject: lib/list: prevent compiler reloads inside 'safe' list iteration
> >
> > Instruct the compiler to read the next element in the list iteration
> > once, and that it is not allowed to reload the value from the stale
> > element later. This is important as during the course of the safe
> > iteration, the stale element may be poisoned (unbeknownst to the
> > compiler).
> 
> Andrew, Chris, this one looks rather questionable to me.
> 
> How the heck would the ->next pointer be changed without the compiler
> being aware of it? That implies a bug to begin with - possibly an
> inline asm that changes kernel memory without having a memory clobber.
> 
> Quite fundamentally, the READ_ONCE() doesn't seem to fix anything. If
> something else can change the list _concurrently_, it's still
> completely broken, and hiding the KASAN report is just hiding a bug.
> 
> What and where was the actual KASAN issue? The commit log doesn't say...

OK, it sounds like we need to specify what needs to be present in this
sort of commit log.  Please accept my apologies for this specification
not already being in place.

How about the following?

o	The KCSAN splat, optionally including file/line numbers.

o	Any non-default Kconfig options that are required to reproduce
	the problem, along with any other repeat-by information.

o	Detailed description of the problem this splat identifies, for
	example, the code might fail to acquire a necessary lock, a plain
	load might vulnerable to compiler optimizations, and so on.

o	If available, references to or excerpts from the comments
	and documentation defining the design rules that the old code
	violates.

o	If the commit's effect is to silence the warning with no other
	algorithmic change, an explanation as to why this is the right
	thing to do.  Continuing the plain-load example above, that
	load might be controlling a loop and the compiler might choose
	to hoist the load out of the loop, potentially resulting in an
	infinite loop.

	Another example might be diagnostic code where the accesses are
	not really part of the concurrency design, but where we need
	KCSAN checking the other code that implements that design.

Thoughts?  Additions?  Subtractions?

							Thanx, Paul
Linus Torvalds April 8, 2020, 9:14 p.m. UTC | #9
On Wed, Apr 8, 2020 at 1:26 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> OK, it sounds like we need to specify what needs to be present in this
> sort of commit log.

That would have helped.

But in this case, after having looked more at it, it wasn't that the
commit log was not sufficient, it was that the change was wrong.

With a better commit log and a pointer to the KCSAN report, I would
have been able to make a better judgment call earlier, and not have
had to ask for more information.

But once I figured out what the problem was, it was clear that

 (a) what the i915 driver does is at a minimum questionable, and quite
likely actively buggy

 (b) even if it's not buggy in the i915 driver, changing the list
traversal macros to use READ_ONCE() would hide other places where it
most definitely is buggy.

> o       The KCSAN splat, optionally including file/line numbers.

I do think that the KCSAN splat - very preferably simplified and with
explanations - should basically always be included if KCSAN was the
reason.

Of course, the "simplified and with explanations" may be simplified to
the point where none of the original splat even remains. Those splats
aren't so legible that anybody should feel like they should be
included.

Put another way: an analysis that is so thorough that it makes the raw
splat data pointless is a *good* thing, and if that means that no
splat is needed, all the better.

> o       Detailed description of the problem this splat identifies, for
>         example, the code might fail to acquire a necessary lock, a plain
>         load might vulnerable to compiler optimizations, and so on.

See above: if the description is detailed enough, then the splat
itself becomes much less interesting.

But in this case, for example, it's worth pointing out that the "safe"
list accessors are used all over the kernel, and they are very much
_not_ thread-safe. The "safe" in them is purely about the current
thread removing an entry, not some kind of general thread-safeness.

Which is why I decided I really hated that patch - it basically would
make KCSAN unable to find _real_ races elsewhere, because it hid a
very special race in the i915 driver.

So I started out with the "that can't be right" kind of general
feeling of uneasiness about the patch. I ended up with a much more
specific "no, that's very much not right" and no amount of commit log
should have saved it.

As mentioned, I suspect that the i915 driver could actually use the
RCU-safe list walkers. Their particular use is not about RCU per se,
but it has some very similar rules about walking a list concurrently
with somebody adding an entry to it.

And the RCU list walkers do end up doing special things to load the
next pointer.

Whether we want to say - and document - that "ok, the RCU list walkers
are also useful for this non-RCU use case" in general might be worth
discussing.

It may be that the i915 use case is so special that it's only worth
documenting there.

              Linus
Paul E. McKenney April 8, 2020, 10:36 p.m. UTC | #10
On Tue, Apr 07, 2020 at 07:04:10PM +0100, Chris Wilson wrote:
> Quoting Linus Torvalds (2020-04-07 18:28:34)
> > On Tue, Apr 7, 2020 at 9:04 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:

[ . . . ]

> There's some more shutting up required for KCSAN to bring the noise down
> to usable levels which I hope has been done so I don't have to argue for
> it, such as
> 
> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index 1e6650ed066d..c7c8dd89f279 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
> @@ -164,7 +164,7 @@ static inline void destroy_timer_on_stack(struct timer_list *timer) { }
>   */
>  static inline int timer_pending(const struct timer_list * timer)
>  {
> -	return timer->entry.pprev != NULL;
> +	return READ_ONCE(timer->entry.pprev) != NULL;

This one is in mainline, courtesy of Eric Dumazet, though in a different
form.

The rest are still TBD.

							Thanx, Paul

>  }
> 
>  extern void add_timer_on(struct timer_list *timer, int cpu);
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 5352ce50a97e..7461b3f33629 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -565,8 +565,9 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
>  		/*
>  		 * Use vcpu_is_preempted to detect lock holder preemption issue.
>  		 */
> -		if (!owner->on_cpu || need_resched() ||
> -				vcpu_is_preempted(task_cpu(owner))) {
> +		if (!READ_ONCE(owner->on_cpu) ||
> +		    need_resched() ||
> +		    vcpu_is_preempted(task_cpu(owner))) {
>  			ret = false;
>  			break;
>  		}
> @@ -602,7 +603,7 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
>  	 * on cpu or its cpu is preempted
>  	 */
>  	if (owner)
> -		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
> +		retval = READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
>  	rcu_read_unlock();
> 
>  	/*
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 1f7734949ac8..4a81fba4cf70 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>  		 * wait for either @lock to point to us, through its Step-B, or
>  		 * wait for a new @node->next from its Step-C.
>  		 */
> -		if (node->next) {
> +		if (READ_ONCE(node->next)) {
>  			next = xchg(&node->next, NULL);
>  			if (next)
>  				break;
> @@ -154,7 +154,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	 */
> 
>  	for (;;) {
> -		if (prev->next == node &&
> +		if (READ_ONCE(prev->next) == node &&
>  		    cmpxchg(&prev->next, node, NULL) == node)
>  			break;
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 0d9b6be9ecc8..eef4835cecf2 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -650,7 +650,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>  	 * As lock holder preemption issue, we both skip spinning if
>  	 * task is not on cpu or its cpu is preempted
>  	 */
> -	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
> +	return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
>  }
> 
>  static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
>
Paul E. McKenney April 8, 2020, 10:37 p.m. UTC | #11
On Wed, Apr 08, 2020 at 02:14:38PM -0700, Linus Torvalds wrote:
> On Wed, Apr 8, 2020 at 1:26 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > OK, it sounds like we need to specify what needs to be present in this
> > sort of commit log.
> 
> That would have helped.
> 
> But in this case, after having looked more at it, it wasn't that the
> commit log was not sufficient, it was that the change was wrong.
> 
> With a better commit log and a pointer to the KCSAN report, I would
> have been able to make a better judgment call earlier, and not have
> had to ask for more information.
> 
> But once I figured out what the problem was, it was clear that
> 
>  (a) what the i915 driver does is at a minimum questionable, and quite
> likely actively buggy
> 
>  (b) even if it's not buggy in the i915 driver, changing the list
> traversal macros to use READ_ONCE() would hide other places where it
> most definitely is buggy.

Yeah, and I should especially have spotted this last one, given how many
times I have run into it while using KCSAN within RCU.  ;-/

In any case, we have review comments, discussion, and the occasional
controvery for non-KCSAN bug fixes, so it is only reasonable to expect
a similar process for KCSAN bug reports, right?  My main goal here is
to make things easier for people reviewing future KCSAN-inspired fixes.

> > o       The KCSAN splat, optionally including file/line numbers.
> 
> I do think that the KCSAN splat - very preferably simplified and with
> explanations - should basically always be included if KCSAN was the
> reason.
> 
> Of course, the "simplified and with explanations" may be simplified to
> the point where none of the original splat even remains. Those splats
> aren't so legible that anybody should feel like they should be
> included.
> 
> Put another way: an analysis that is so thorough that it makes the raw
> splat data pointless is a *good* thing, and if that means that no
> splat is needed, all the better.

Fair point.  The KCSAN splat has the added advantage of immediately
answering the "but does this really happen?" question, but then again
that question could also be answered by just stating that the issue was
found using KCSAN.

> > o       Detailed description of the problem this splat identifies, for
> >         example, the code might fail to acquire a necessary lock, a plain
> >         load might vulnerable to compiler optimizations, and so on.
> 
> See above: if the description is detailed enough, then the splat
> itself becomes much less interesting.
> 
> But in this case, for example, it's worth pointing out that the "safe"
> list accessors are used all over the kernel, and they are very much
> _not_ thread-safe. The "safe" in them is purely about the current
> thread removing an entry, not some kind of general thread-safeness.
> 
> Which is why I decided I really hated that patch - it basically would
> make KCSAN unable to find _real_ races elsewhere, because it hid a
> very special race in the i915 driver.
> 
> So I started out with the "that can't be right" kind of general
> feeling of uneasiness about the patch. I ended up with a much more
> specific "no, that's very much not right" and no amount of commit log
> should have saved it.
> 
> As mentioned, I suspect that the i915 driver could actually use the
> RCU-safe list walkers. Their particular use is not about RCU per se,
> but it has some very similar rules about walking a list concurrently
> with somebody adding an entry to it.
> 
> And the RCU list walkers do end up doing special things to load the
> next pointer.
> 
> Whether we want to say - and document - that "ok, the RCU list walkers
> are also useful for this non-RCU use case" in general might be worth
> discussing.
> 
> It may be that the i915 use case is so special that it's only worth
> documenting there.

If documenting is the right approach, KCSAN's data_race() could be
thought of as KCSAN-visible documentation.

I will touch base with Chris separately.

						Thanx, Paul
Paul E. McKenney April 8, 2020, 10:44 p.m. UTC | #12
On Wed, Apr 08, 2020 at 03:37:57PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 08, 2020 at 02:14:38PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 8, 2020 at 1:26 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > OK, it sounds like we need to specify what needs to be present in this
> > > sort of commit log.
> > 
> > That would have helped.
> > 
> > But in this case, after having looked more at it, it wasn't that the
> > commit log was not sufficient, it was that the change was wrong.
> > 
> > With a better commit log and a pointer to the KCSAN report, I would
> > have been able to make a better judgment call earlier, and not have
> > had to ask for more information.
> > 
> > But once I figured out what the problem was, it was clear that
> > 
> >  (a) what the i915 driver does is at a minimum questionable, and quite
> > likely actively buggy
> > 
> >  (b) even if it's not buggy in the i915 driver, changing the list
> > traversal macros to use READ_ONCE() would hide other places where it
> > most definitely is buggy.
> 
> Yeah, and I should especially have spotted this last one, given how many
> times I have run into it while using KCSAN within RCU.  ;-/
> 
> In any case, we have review comments, discussion, and the occasional
> controvery for non-KCSAN bug fixes, so it is only reasonable to expect
> a similar process for KCSAN bug reports, right?  My main goal here is
> to make things easier for people reviewing future KCSAN-inspired fixes.
> 
> > > o       The KCSAN splat, optionally including file/line numbers.
> > 
> > I do think that the KCSAN splat - very preferably simplified and with
> > explanations - should basically always be included if KCSAN was the
> > reason.
> > 
> > Of course, the "simplified and with explanations" may be simplified to
> > the point where none of the original splat even remains. Those splats
> > aren't so legible that anybody should feel like they should be
> > included.
> > 
> > Put another way: an analysis that is so thorough that it makes the raw
> > splat data pointless is a *good* thing, and if that means that no
> > splat is needed, all the better.
> 
> Fair point.  The KCSAN splat has the added advantage of immediately
> answering the "but does this really happen?" question, but then again
> that question could also be answered by just stating that the issue was
> found using KCSAN.
> 
> > > o       Detailed description of the problem this splat identifies, for
> > >         example, the code might fail to acquire a necessary lock, a plain
> > >         load might vulnerable to compiler optimizations, and so on.
> > 
> > See above: if the description is detailed enough, then the splat
> > itself becomes much less interesting.
> > 
> > But in this case, for example, it's worth pointing out that the "safe"
> > list accessors are used all over the kernel, and they are very much
> > _not_ thread-safe. The "safe" in them is purely about the current
> > thread removing an entry, not some kind of general thread-safeness.
> > 
> > Which is why I decided I really hated that patch - it basically would
> > make KCSAN unable to find _real_ races elsewhere, because it hid a
> > very special race in the i915 driver.
> > 
> > So I started out with the "that can't be right" kind of general
> > feeling of uneasiness about the patch. I ended up with a much more
> > specific "no, that's very much not right" and no amount of commit log
> > should have saved it.
> > 
> > As mentioned, I suspect that the i915 driver could actually use the
> > RCU-safe list walkers. Their particular use is not about RCU per se,
> > but it has some very similar rules about walking a list concurrently
> > with somebody adding an entry to it.
> > 
> > And the RCU list walkers do end up doing special things to load the
> > next pointer.
> > 
> > Whether we want to say - and document - that "ok, the RCU list walkers
> > are also useful for this non-RCU use case" in general might be worth
> > discussing.

This might work quite well, given the lockdep annotations that Joel
added to list_for_each_entry_rcu().  For example, lockdep_assert_held()
would tell it that it didn't need to be in an RCU reader as long as the
lock passed to lockdep_assert_held() was held at that point.

> > It may be that the i915 use case is so special that it's only worth
> > documenting there.
> 
> If documenting is the right approach, KCSAN's data_race() could be
> thought of as KCSAN-visible documentation.
> 
> I will touch base with Chris separately.

Of these options, what looks best to you?  Or would something else work
better?

							Thanx, Paul
Sultan Alsawaf April 8, 2020, 11:14 p.m. UTC | #13
On Wed, Apr 08, 2020 at 02:14:38PM -0700, Linus Torvalds wrote:
>  (a) what the i915 driver does is at a minimum questionable, and quite
> likely actively buggy

By the way, this doesn't even seem like the only place in i915 where locking and
object lifetime semantics are... special. I found a combined deadlock and
use-after-free last week, inside of interwoven recursive call chains, and posted
a patch here [1]. Unfortunately, it's been ignored thus far. But the interesting
thing is how the bizarre object lifetime semantics there required even more
weird things to enact a proper fix. That patch really should be merged (Chris?),
but moreover, it'd be nice to see a whole release cycle devoted to just cleaning
some of this stuff up. Showstopper bugs and hangs in i915 have really multiplied
in recent times.

Sultan

[1] https://lore.kernel.org/lkml/20200407064007.7599-1-sultan@kerneltoast.com
diff mbox series

Patch

--- a/include/linux/list.h~list-prevent-compiler-reloads-inside-safe-list-iteration
+++ a/include/linux/list.h
@@ -537,6 +537,17 @@  static inline void list_splice_tail_init
 	list_entry((pos)->member.next, typeof(*(pos)), member)
 
 /**
+ * list_next_entry_safe - get the next element in list [once]
+ * @pos:	the type * to cursor
+ * @member:	the name of the list_head within the struct.
+ *
+ * Like list_next_entry() but prevents the compiler from reloading the
+ * next element.
+ */
+#define list_next_entry_safe(pos, member) \
+	list_entry(READ_ONCE((pos)->member.next), typeof(*(pos)), member)
+
+/**
  * list_prev_entry - get the prev element in list
  * @pos:	the type * to cursor
  * @member:	the name of the list_head within the struct.
@@ -545,6 +556,17 @@  static inline void list_splice_tail_init
 	list_entry((pos)->member.prev, typeof(*(pos)), member)
 
 /**
+ * list_prev_entry_safe - get the prev element in list [once]
+ * @pos:	the type * to cursor
+ * @member:	the name of the list_head within the struct.
+ *
+ * Like list_prev_entry() but prevents the compiler from reloading the
+ * previous element.
+ */
+#define list_prev_entry_safe(pos, member) \
+	list_entry(READ_ONCE((pos)->member.prev), typeof(*(pos)), member)
+
+/**
  * list_for_each	-	iterate over a list
  * @pos:	the &struct list_head to use as a loop cursor.
  * @head:	the head for your list.
@@ -686,9 +708,9 @@  static inline void list_splice_tail_init
  */
 #define list_for_each_entry_safe(pos, n, head, member)			\
 	for (pos = list_first_entry(head, typeof(*pos), member),	\
-		n = list_next_entry(pos, member);			\
+		n = list_next_entry_safe(pos, member);			\
 	     &pos->member != (head); 					\
-	     pos = n, n = list_next_entry(n, member))
+	     pos = n, n = list_next_entry_safe(n, member))
 
 /**
  * list_for_each_entry_safe_continue - continue list iteration safe against removal
@@ -700,11 +722,11 @@  static inline void list_splice_tail_init
  * Iterate over list of given type, continuing after current point,
  * safe against removal of list entry.
  */
-#define list_for_each_entry_safe_continue(pos, n, head, member) 		\
-	for (pos = list_next_entry(pos, member), 				\
-		n = list_next_entry(pos, member);				\
-	     &pos->member != (head);						\
-	     pos = n, n = list_next_entry(n, member))
+#define list_for_each_entry_safe_continue(pos, n, head, member) 	\
+	for (pos = list_next_entry(pos, member), 			\
+		n = list_next_entry_safe(pos, member);			\
+	     &pos->member != (head);					\
+	     pos = n, n = list_next_entry_safe(n, member))
 
 /**
  * list_for_each_entry_safe_from - iterate over list from current point safe against removal
@@ -716,10 +738,10 @@  static inline void list_splice_tail_init
  * Iterate over list of given type from current point, safe against
  * removal of list entry.
  */
-#define list_for_each_entry_safe_from(pos, n, head, member) 			\
-	for (n = list_next_entry(pos, member);					\
-	     &pos->member != (head);						\
-	     pos = n, n = list_next_entry(n, member))
+#define list_for_each_entry_safe_from(pos, n, head, member) 		\
+	for (n = list_next_entry_safe(pos, member);			\
+	     &pos->member != (head);					\
+	     pos = n, n = list_next_entry_safe(n, member))
 
 /**
  * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
@@ -733,9 +755,9 @@  static inline void list_splice_tail_init
  */
 #define list_for_each_entry_safe_reverse(pos, n, head, member)		\
 	for (pos = list_last_entry(head, typeof(*pos), member),		\
-		n = list_prev_entry(pos, member);			\
+		n = list_prev_entry_safe(pos, member);			\
 	     &pos->member != (head); 					\
-	     pos = n, n = list_prev_entry(n, member))
+	     pos = n, n = list_prev_entry_safe(n, member))
 
 /**
  * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
@@ -750,7 +772,7 @@  static inline void list_splice_tail_init
  * completing the current iteration of the loop body.
  */
 #define list_safe_reset_next(pos, n, member)				\
-	n = list_next_entry(pos, member)
+	n = list_next_entry_safe(pos, member)
 
 /*
  * Double linked lists with a single pointer list head.