[v2,10/10] kernel: might_fault does not imply might_sleep

Message ID	1f85dc8e6a0149677563a2dfb4cef9a9c7eaa391.1368702323.git.mst@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Thu, 16 May 2013 14:16:10 +0300 From: "Michael S. Tsirkin" <mst@redhat.com> To: linux-kernel@vger.kernel.org Subject: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep Message-ID: <1f85dc8e6a0149677563a2dfb4cef9a9c7eaa391.1368702323.git.mst@redhat.com> References: <cover.1368702323.git.mst@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <cover.1368702323.git.mst@redhat.com> summary: Content analysis details: (-7.5 points) pts rule name description ---- ---------------------- -------------------------------------------------- -5.0 RCVD_IN_DNSWL_HI RBL: Sender listed at http://www.dnswl.org/, high trust [209.132.183.28 listed in list.dnswl.org] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 SPF_PASS SPF: sender matches SPF record -0.6 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Cc: linux-m32r-ja@ml.linux-m32r.org, kvm@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com>, David Howells <dhowells@redhat.com>, linux-mm@kvack.org, Paul Mackerras <paulus@samba.org>, "H. Peter Anvin" <hpa@zytor.com>, linux-arch@vger.kernel.org, linux-am33-list@redhat.com, Hirokazu Takata <takata@linux-m32r.org>, x86@kernel.org, Ingo Molnar <mingo@redhat.com>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Arnd Bergmann <arnd@arndb.de>, microblaze-uclinux@itee.uq.edu.au, Chris Metcalf <cmetcalf@tilera.com>, Thomas Gleixner <tglx@linutronix.de>, linux-arm-kernel@lists.infradead.org, Michal Simek <monstr@monstr.eu>, linux-m32r@ml.linux-m32r.org, Koichi Yasutake <yasutake.koichi@jp.panasonic.com>, linuxppc-dev@lists.ozlabs.org Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Michael S. Tsirkin May 16, 2013, 11:16 a.m. UTC

There are several ways to make sure might_fault
calling function does not sleep.
One is to use it on kernel or otherwise locked memory - apparently
nfs/sunrpc does this. As noted by Ingo, this is handled by the
migh_fault() implementation in mm/memory.c but not the one in
linux/kernel.h so in the current code might_fault() schedules
differently depending on CONFIG_PROVE_LOCKING, which is an undesired
semantical side effect.

Another is to call pagefault_disable: in this case the page fault
handler will go to fixups processing and we get an error instead of
sleeping, so the might_sleep annotation is a false positive.
vhost driver wants to do this now in order to reuse socket ops
under a spinlock (and fall back on slower thread handler
on error).

Address both issues by:
	- dropping the unconditional call to might_sleep
	  from the fast might_fault code in linux/kernel.h
	- checking for pagefault_disable() in the
	  CONFIG_PROVE_LOCKING implementation

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/kernel.h |  1 -
 mm/memory.c            | 14 +++++++++-----
 2 files changed, 9 insertions(+), 6 deletions(-)

Peter Zijlstra May 16, 2013, 6:40 p.m. UTC | #1

On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote:
> There are several ways to make sure might_fault
> calling function does not sleep.
> One is to use it on kernel or otherwise locked memory - apparently
> nfs/sunrpc does this. As noted by Ingo, this is handled by the
> migh_fault() implementation in mm/memory.c but not the one in
> linux/kernel.h so in the current code might_fault() schedules
> differently depending on CONFIG_PROVE_LOCKING, which is an undesired
> semantical side effect.
> 
> Another is to call pagefault_disable: in this case the page fault
> handler will go to fixups processing and we get an error instead of
> sleeping, so the might_sleep annotation is a false positive.
> vhost driver wants to do this now in order to reuse socket ops
> under a spinlock (and fall back on slower thread handler
> on error).

Are you using the assumption that spin_lock() implies preempt_disable() implies
pagefault_disable()? Note that this assumption isn't valid for -rt where the
spinlock becomes preemptible but we'll not disable pagefaults.

> Address both issues by:
> 	- dropping the unconditional call to might_sleep
> 	  from the fast might_fault code in linux/kernel.h
> 	- checking for pagefault_disable() in the
> 	  CONFIG_PROVE_LOCKING implementation
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/kernel.h |  1 -
>  mm/memory.c            | 14 +++++++++-----
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index e96329c..322b065 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -198,7 +198,6 @@ void might_fault(void);
>  #else
>  static inline void might_fault(void)
>  {
> -	might_sleep();

This removes potential resched points for PREEMPT_VOLUNTARY -- was that
intentional?

>  }
>  #endif
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 6dc1882..1b8327b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4222,13 +4222,17 @@ void might_fault(void)
>  	if (segment_eq(get_fs(), KERNEL_DS))
>  		return;
>  
> -	might_sleep();
>  	/*
> -	 * it would be nicer only to annotate paths which are not under
> -	 * pagefault_disable, however that requires a larger audit and
> -	 * providing helpers like get_user_atomic.
> +	 * It would be nicer to annotate paths which are under preempt_disable
> +	 * but not under pagefault_disable, however that requires a new flag
> +	 * for differentiating between the two.

-rt has this, pagefault_disable() doesn't change the preempt count but pokes
at task_struct::pagefault_disable.

>  	 */
> -	if (!in_atomic() && current->mm)
> +	if (in_atomic())
> +		return;
> +
> +	might_sleep();
> +
> +	if (current->mm)
>  		might_lock_read(&current->mm->mmap_sem);
>  }
>  EXPORT_SYMBOL(might_fault);
> -- 
> MST

Michael S. Tsirkin May 19, 2013, 9:35 a.m. UTC | #2

On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote:
> On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote:
> > There are several ways to make sure might_fault
> > calling function does not sleep.
> > One is to use it on kernel or otherwise locked memory - apparently
> > nfs/sunrpc does this. As noted by Ingo, this is handled by the
> > migh_fault() implementation in mm/memory.c but not the one in
> > linux/kernel.h so in the current code might_fault() schedules
> > differently depending on CONFIG_PROVE_LOCKING, which is an undesired
> > semantical side effect.
> > 
> > Another is to call pagefault_disable: in this case the page fault
> > handler will go to fixups processing and we get an error instead of
> > sleeping, so the might_sleep annotation is a false positive.
> > vhost driver wants to do this now in order to reuse socket ops
> > under a spinlock (and fall back on slower thread handler
> > on error).
> 
> Are you using the assumption that spin_lock() implies preempt_disable() implies
> pagefault_disable()? Note that this assumption isn't valid for -rt where the
> spinlock becomes preemptible but we'll not disable pagefaults.

No, I was not assuming that. What I'm trying to say is that a caller
that does something like this under a spinlock:
	preempt_disable
	pagefault_disable
	error = copy_to_user
	pagefault_enable
	preempt_enable_no_resched

is not doing anything wrong and should not get a warning,
as long as error is handled correctly later.
Right?

> > Address both issues by:
> > 	- dropping the unconditional call to might_sleep
> > 	  from the fast might_fault code in linux/kernel.h
> > 	- checking for pagefault_disable() in the
> > 	  CONFIG_PROVE_LOCKING implementation
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  include/linux/kernel.h |  1 -
> >  mm/memory.c            | 14 +++++++++-----
> >  2 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index e96329c..322b065 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -198,7 +198,6 @@ void might_fault(void);
> >  #else
> >  static inline void might_fault(void)
> >  {
> > -	might_sleep();
> 
> This removes potential resched points for PREEMPT_VOLUNTARY -- was that
> intentional?

No it's a bug. Thanks for pointing this out.
OK so I guess it should be might_sleep_if(!in_atomic())
and this means might_fault would have to move from linux/kernel.h to
linux/uaccess.h, since in_atomic() is in linux/hardirq.h

Makes sense?

> >  }
> >  #endif
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dc1882..1b8327b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4222,13 +4222,17 @@ void might_fault(void)
> >  	if (segment_eq(get_fs(), KERNEL_DS))
> >  		return;
> >  
> > -	might_sleep();
> >  	/*
> > -	 * it would be nicer only to annotate paths which are not under
> > -	 * pagefault_disable, however that requires a larger audit and
> > -	 * providing helpers like get_user_atomic.
> > +	 * It would be nicer to annotate paths which are under preempt_disable
> > +	 * but not under pagefault_disable, however that requires a new flag
> > +	 * for differentiating between the two.
> 
> -rt has this, pagefault_disable() doesn't change the preempt count but pokes
> at task_struct::pagefault_disable.

Good to know.

So maybe we can import this at least for CONFIG_PROVE_LOCKING?
To make the patch smaller I'd prefer doing both for now,
this way this patchset does not have to poke in too many
mm internals.
I can try doing that - unless
someone else has plans to merge this part soon anyway?

> >  	 */
> > -	if (!in_atomic() && current->mm)
> > +	if (in_atomic())
> > +		return;
> > +
> > +	might_sleep();
> > +
> > +	if (current->mm)
> >  		might_lock_read(&current->mm->mmap_sem);
> >  }
> >  EXPORT_SYMBOL(might_fault);
> > -- 
> > MST

Steven Rostedt May 19, 2013, 12:34 p.m. UTC | #3

On Sun, 2013-05-19 at 12:35 +0300, Michael S. Tsirkin wrote:

> No, I was not assuming that. What I'm trying to say is that a caller
> that does something like this under a spinlock:
> 	preempt_disable
> 	pagefault_disable
> 	error = copy_to_user
> 	pagefault_enable
> 	preempt_enable_no_resched
> 
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
> Right?
> 

What I see wrong with the above is the preempt_enable_no_resched(). The
only place that should be ever used is right before a schedule(), as you
don't want to schedule twice (once for the preempt_enable() and then
again for the schedule itself).

Remember, in -rt, a spin lock does not disable preemption.

-- Steve

Michael S. Tsirkin May 19, 2013, 1:34 p.m. UTC | #4

On Sun, May 19, 2013 at 08:34:04AM -0400, Steven Rostedt wrote:
> On Sun, 2013-05-19 at 12:35 +0300, Michael S. Tsirkin wrote:
> 
> > No, I was not assuming that. What I'm trying to say is that a caller
> > that does something like this under a spinlock:
> > 	preempt_disable
> > 	pagefault_disable
> > 	error = copy_to_user
> > 	pagefault_enable
> > 	preempt_enable_no_resched
> > 
> > is not doing anything wrong and should not get a warning,
> > as long as error is handled correctly later.
> > Right?
> > 
> 
> What I see wrong with the above is the preempt_enable_no_resched(). The
> only place that should be ever used is right before a schedule(), as you
> don't want to schedule twice (once for the preempt_enable() and then
> again for the schedule itself).
> 
> Remember, in -rt, a spin lock does not disable preemption.
> 
> -- Steve

Right but we need to keep it working on upstream as well.
If I do preempt_enable under a spinlock upstream won't it
try to sleep under spinlock?

Steven Rostedt May 19, 2013, 4:06 p.m. UTC | #5

On Sun, 2013-05-19 at 16:34 +0300, Michael S. Tsirkin wrote:

> Right but we need to keep it working on upstream as well.
> If I do preempt_enable under a spinlock upstream won't it
> try to sleep under spinlock?

No it wont. A spinlock calls preempt_disable implicitly, and a
preempt_enable() will not schedule unless preempt_count is zero, which
it wont be under a spinlock.

If it did, there would be lots of bugs all over the place because this
is done throughout the kernel (a preempt_enable() under a spinlock).

In other words, don't ever use preempt_enable_no_resched().

-- Steve

Michael S. Tsirkin May 19, 2013, 4:40 p.m. UTC | #6

On Sun, May 19, 2013 at 12:06:19PM -0400, Steven Rostedt wrote:
> On Sun, 2013-05-19 at 16:34 +0300, Michael S. Tsirkin wrote:
> 
> > Right but we need to keep it working on upstream as well.
> > If I do preempt_enable under a spinlock upstream won't it
> > try to sleep under spinlock?
> 
> No it wont. A spinlock calls preempt_disable implicitly, and a
> preempt_enable() will not schedule unless preempt_count is zero, which
> it wont be under a spinlock.
> 
> If it did, there would be lots of bugs all over the place because this
> is done throughout the kernel (a preempt_enable() under a spinlock).
> 
> In other words, don't ever use preempt_enable_no_resched().
> 
> -- Steve
> 


OK I get it. So let me correct myself. The simple code
that does something like this under a spinlock:
>       preempt_disable
>       pagefault_disable
>       error = copy_to_user
>       pagefault_enable
>       preempt_enable
>
is not doing anything wrong and should not get a warning,
as long as error is handled correctly later.
Right?

Steven Rostedt May 19, 2013, 8:23 p.m. UTC | #7

On Sun, 2013-05-19 at 19:40 +0300, Michael S. Tsirkin wrote:

> OK I get it. So let me correct myself. The simple code
> that does something like this under a spinlock:
> >       preempt_disable
> >       pagefault_disable
> >       error = copy_to_user
> >       pagefault_enable
> >       preempt_enable
> >
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
> Right?

I came in mid thread and I don't know the context. Anyway, the above
looks to me as you just don't want to sleep. If you try to copy data to
user space that happens not to be currently mapped for any reason, you
will get an error. Even if the address space is completely valid. Is
that what you want?

-- Steve

Michael S. Tsirkin May 19, 2013, 8:35 p.m. UTC | #8

On Sun, May 19, 2013 at 04:23:22PM -0400, Steven Rostedt wrote:
> On Sun, 2013-05-19 at 19:40 +0300, Michael S. Tsirkin wrote:
> 
> > OK I get it. So let me correct myself. The simple code
> > that does something like this under a spinlock:
> > >       preempt_disable
> > >       pagefault_disable
> > >       error = copy_to_user
> > >       pagefault_enable
> > >       preempt_enable
> > >
> > is not doing anything wrong and should not get a warning,
> > as long as error is handled correctly later.
> > Right?
> 
> I came in mid thread and I don't know the context.

The context is that I want to change might_fault
from might_sleep to
might_sleep_if(!in_atomic())
so that above does not trigger warnings even with
CONFIG_DEBUG_ATOMIC_SLEEP enabled.

> Anyway, the above
> looks to me as you just don't want to sleep.

Exactly. upstream we can just do pagefault_disable but to make
this code -rt ready it's best to do preempt_disable as well.

> If you try to copy data to
> user space that happens not to be currently mapped for any reason, you
> will get an error. Even if the address space is completely valid. Is
> that what you want?
> 
> -- Steve
> 

Yes, this is by design.
We detect that and bounce the work to a thread outside
any locks.

Thanks,

Peter Zijlstra May 21, 2013, 11:18 a.m. UTC | #9

On Sun, May 19, 2013 at 07:40:09PM +0300, Michael S. Tsirkin wrote:
> OK I get it. So let me correct myself. The simple code
> that does something like this under a spinlock:
> >       preempt_disable
> >       pagefault_disable
> >       error = copy_to_user
> >       pagefault_enable
> >       preempt_enable
> >
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
> Right?

Indeed, but I don't get the point of the preempt_{disable,enable}()
here. Why does it have to disable preemption explicitly here? I thought
all you wanted was to avoid the pagefault handler and make it do the
exception table thing; for that pagefault_disable() is sufficient.

Peter Zijlstra May 21, 2013, 11:21 a.m. UTC | #10

On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote:
> On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote:
> > On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote:
> > > There are several ways to make sure might_fault
> > > calling function does not sleep.
> > > One is to use it on kernel or otherwise locked memory - apparently
> > > nfs/sunrpc does this. As noted by Ingo, this is handled by the
> > > migh_fault() implementation in mm/memory.c but not the one in
> > > linux/kernel.h so in the current code might_fault() schedules
> > > differently depending on CONFIG_PROVE_LOCKING, which is an undesired
> > > semantical side effect.
> > > 
> > > Another is to call pagefault_disable: in this case the page fault
> > > handler will go to fixups processing and we get an error instead of
> > > sleeping, so the might_sleep annotation is a false positive.
> > > vhost driver wants to do this now in order to reuse socket ops
> > > under a spinlock (and fall back on slower thread handler
> > > on error).
> > 
> > Are you using the assumption that spin_lock() implies preempt_disable() implies
> > pagefault_disable()? Note that this assumption isn't valid for -rt where the
> > spinlock becomes preemptible but we'll not disable pagefaults.
> 
> No, I was not assuming that. What I'm trying to say is that a caller
> that does something like this under a spinlock:
> 	preempt_disable
> 	pagefault_disable
> 	error = copy_to_user
> 	pagefault_enable
> 	preempt_enable_no_resched
> 
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
> Right?

Aside from the no_resched() thing which Steven already explained and my
previous email asking why you need the preempt_disable() at all, that
should indeed work.

The reason I was asking was that I wasn't sure you weren't doing:

  spin_lock(&my_lock);
  error = copy_to_user();
  spin_unlock(&my_lock);

and expecting the copy_to_user() to always take the exception table
route. This works on mainline (since spin_lock implies a preempt disable
and preempt_disable is the same as pagefault_disable). However as should
be clear by now, it doesn't quite work that way for -rt.

Michael S. Tsirkin May 22, 2013, 8:36 p.m. UTC | #11

On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote:
> On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote:
> > There are several ways to make sure might_fault
> > calling function does not sleep.
> > One is to use it on kernel or otherwise locked memory - apparently
> > nfs/sunrpc does this. As noted by Ingo, this is handled by the
> > migh_fault() implementation in mm/memory.c but not the one in
> > linux/kernel.h so in the current code might_fault() schedules
> > differently depending on CONFIG_PROVE_LOCKING, which is an undesired
> > semantical side effect.
> > 
> > Another is to call pagefault_disable: in this case the page fault
> > handler will go to fixups processing and we get an error instead of
> > sleeping, so the might_sleep annotation is a false positive.
> > vhost driver wants to do this now in order to reuse socket ops
> > under a spinlock (and fall back on slower thread handler
> > on error).
> 
> Are you using the assumption that spin_lock() implies preempt_disable() implies
> pagefault_disable()? Note that this assumption isn't valid for -rt where the
> spinlock becomes preemptible but we'll not disable pagefaults.
> 
> > Address both issues by:
> > 	- dropping the unconditional call to might_sleep
> > 	  from the fast might_fault code in linux/kernel.h
> > 	- checking for pagefault_disable() in the
> > 	  CONFIG_PROVE_LOCKING implementation
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  include/linux/kernel.h |  1 -
> >  mm/memory.c            | 14 +++++++++-----
> >  2 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index e96329c..322b065 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -198,7 +198,6 @@ void might_fault(void);
> >  #else
> >  static inline void might_fault(void)
> >  {
> > -	might_sleep();
> 
> This removes potential resched points for PREEMPT_VOLUNTARY -- was that
> intentional?

OK so I'm thinking of going back to this idea:
it has the advantage of being very simple,
and just might make some workloads faster
if they do lots of copy_XX_user in a loop.

Will have to be tested of course - anyone
has objections?

> >  }
> >  #endif
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dc1882..1b8327b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4222,13 +4222,17 @@ void might_fault(void)
> >  	if (segment_eq(get_fs(), KERNEL_DS))
> >  		return;
> >  
> > -	might_sleep();
> >  	/*
> > -	 * it would be nicer only to annotate paths which are not under
> > -	 * pagefault_disable, however that requires a larger audit and
> > -	 * providing helpers like get_user_atomic.
> > +	 * It would be nicer to annotate paths which are under preempt_disable
> > +	 * but not under pagefault_disable, however that requires a new flag
> > +	 * for differentiating between the two.
> 
> -rt has this, pagefault_disable() doesn't change the preempt count but pokes
> at task_struct::pagefault_disable.
> 
> >  	 */
> > -	if (!in_atomic() && current->mm)
> > +	if (in_atomic())
> > +		return;
> > +
> > +	might_sleep();
> > +
> > +	if (current->mm)
> >  		might_lock_read(&current->mm->mmap_sem);
> >  }
> >  EXPORT_SYMBOL(might_fault);
> > -- 
> > MST

[v2,10/10] kernel: might_fault does not imply might_sleep

Commit Message

Comments

Patch