[RFC,1/2] sched: Fix exit_mm vs membarrier

Message ID	20200728160010.3314-1-mathieu.desnoyers@efficios.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=h896=BH=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5711920792 From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Will Deacon <will@kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Nicholas Piggin <npiggin@gmail.com>, Andy Lutomirski <luto@amacapital.net>, Thomas Gleixner <tglx@linutronix.de>, Linus Torvalds <torvalds@linux-foundation.org>, Alan Stern <stern@rowland.harvard.edu>, linux-mm@kvack.org Subject: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Date: Tue, 28 Jul 2020 12:00:09 -0400 Message-Id: <20200728160010.3314-1-mathieu.desnoyers@efficios.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC,1/2] sched: Fix exit_mm vs membarrier \| expand [RFC,1/2] sched: Fix exit_mm vs membarrier

Message ID

20200728160010.3314-1-mathieu.desnoyers@efficios.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5711920792
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Will Deacon <will@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Alan Stern <stern@rowland.harvard.edu>,
	linux-mm@kvack.org
Subject: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
Date: Tue, 28 Jul 2020 12:00:09 -0400
Message-Id: <20200728160010.3314-1-mathieu.desnoyers@efficios.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[RFC,1/2] sched: Fix exit_mm vs membarrier | expand

Commit Message

Mathieu Desnoyers July 28, 2020, 4 p.m. UTC

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

The following comment in exit_mm() seems rather puzzling though:

        /* more a memory barrier than a real lock */
        task_lock(current);

So considering that spinlocks are not full memory barriers nowadays,
some digging into the origins of this comment led me to 2002, at the
earliest commit tracked by Thomas Gleixner's bitkeeper era's history
at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/

At that point, this comment was followed by:

+               /* more a memory barrier than a real lock */
+               task_lock(tsk);
+               tsk->mm = NULL;
+               task_unlock(tsk);

Which seems to indicate that grabbing the lock is really about acting as
a memory barrier, but I wonder if it is meant to be a full barrier or
just an acquire.

If a full memory barrier happens to be needed even without membarrier,
perhaps this fix also corrects other issues ? It unclear from the
comment what this memory barrier orders though. Is it the chain
exit -> waitpid, or is it related to entering lazy TLB ?

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
 kernel/exit.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Peter Zijlstra Aug. 4, 2020, 2:34 p.m. UTC | #1

On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
> 
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.

I'm still wonder what the exact failure case is though; exit_mm() is on
the exit path (as the name very much implies) and the thread is about to
die. The context switch that follows guarantees a full barrier before we
run anything else again.

> The following comment in exit_mm() seems rather puzzling though:
> 
>         /* more a memory barrier than a real lock */
>         task_lock(current);
> 
> So considering that spinlocks are not full memory barriers nowadays,
> some digging into the origins of this comment led me to 2002, at the
> earliest commit tracked by Thomas Gleixner's bitkeeper era's history
> at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/
> 
> At that point, this comment was followed by:
> 
> +               /* more a memory barrier than a real lock */
> +               task_lock(tsk);
> +               tsk->mm = NULL;
> +               task_unlock(tsk);
> 
> Which seems to indicate that grabbing the lock is really about acting as
> a memory barrier, but I wonder if it is meant to be a full barrier or
> just an acquire.
> 
> If a full memory barrier happens to be needed even without membarrier,
> perhaps this fix also corrects other issues ? It unclear from the
> comment what this memory barrier orders though. Is it the chain
> exit -> waitpid, or is it related to entering lazy TLB ?

I'm as puzzled by that comment as are you.

It does seem to be required for glorious stuff like get_task_mm()
though.

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
>  kernel/exit.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 727150f28103..ce272ec55cdc 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -474,6 +474,14 @@ static void exit_mm(void)
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
>  	task_lock(current);
> +	/*
> +	 * When a thread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb();

So like stated above, I'm not sure how missing that IPI is going to be a
problem, we're dying...

>  	current->mm = NULL;
>  	mmap_read_unlock(mm);
>  	enter_lazy_tlb(mm, current);
> -- 
> 2.11.0
>

Mathieu Desnoyers Aug. 4, 2020, 2:48 p.m. UTC | #2

----- On Aug 4, 2020, at 10:34 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>> 
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
> 
> I'm still wonder what the exact failure case is though; exit_mm() is on
> the exit path (as the name very much implies) and the thread is about to
> die. The context switch that follows guarantees a full barrier before we
> run anything else again.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Thanks,

Mathieu

Peter Zijlstra Aug. 4, 2020, 4:51 p.m. UTC | #3

On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
> Here is the scenario I have in mind:

> Userspace variables:
> 
> int x = 0, y = 0;
> 
> CPU 0                   CPU 1
> Thread A                Thread B
> (in thread group A)     (in thread group B)
> 
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
>                         r1 = load y
>                         membarrier()
>                           skips CPU 0 (no IPI) because its current mm is NULL
>                         r2 = load x
>                         BUG_ON(r1 == 1 && r2 == 0)
> 

Ah, yes of course.

We really should have a bunch of these scenarios in membarrier.c.



Now, the above cannot happen because we have an unconditional
atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
relying on that is a wee bit dodgy.

Mathieu Desnoyers Aug. 4, 2020, 5:25 p.m. UTC | #4

----- On Aug 4, 2020, at 12:51 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
>> Here is the scenario I have in mind:
> 
>> Userspace variables:
>> 
>> int x = 0, y = 0;
>> 
>> CPU 0                   CPU 1
>> Thread A                Thread B
>> (in thread group A)     (in thread group B)
>> 
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>>                         r1 = load y
>>                         membarrier()
>>                           skips CPU 0 (no IPI) because its current mm is NULL
>>                         r2 = load x
>>                         BUG_ON(r1 == 1 && r2 == 0)
>> 
> 
> Ah, yes of course.
> 
> We really should have a bunch of these scenarios in membarrier.c.

Good point.

> 
> 
> 
> Now, the above cannot happen because we have an unconditional
> atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
> relying on that is a wee bit dodgy.

I am not against using this already existing barrier to provide the
guarantee we need, but it would have to be documented in the code.

Thanks,

Mathieu

diff --git a/kernel/exit.c b/kernel/exit.c
index 727150f28103..ce272ec55cdc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -474,6 +474,14 @@  static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb();
 	current->mm = NULL;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);

[RFC,1/2] sched: Fix exit_mm vs membarrier

Commit Message

Comments

Patch