diff mbox series

[v3,3/4] uaccess: Check no rescheduling function is called in unsafe region

Message ID 1547560709-56207-4-git-send-email-julien.thierry@arm.com (mailing list archive)
State New, archived
Headers show
Series uaccess: Add unsafe accessors for arm64 | expand

Commit Message

Julien Thierry Jan. 15, 2019, 1:58 p.m. UTC
While running a user_access regions, it is not supported to reschedule.
Add an overridable primitive to indicate whether a user_access region is
active and check that this is not the case when calling rescheduling
functions.

These checks are only performed when DEBUG_UACCESS_SLEEP is selected.

Also, add a comment clarifying the behaviour of user_access regions.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

---
 include/linux/kernel.h  | 11 +++++++++--
 include/linux/uaccess.h | 13 +++++++++++++
 kernel/sched/core.c     | 22 ++++++++++++++++++++++
 lib/Kconfig.debug       |  8 ++++++++
 4 files changed, 52 insertions(+), 2 deletions(-)

--
1.9.1

Comments

Valentin Schneider Jan. 30, 2019, 4:58 p.m. UTC | #1
On 15/01/2019 13:58, Julien Thierry wrote:
[...]> @@ -6151,6 +6159,20 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>  EXPORT_SYMBOL(___might_sleep);
>  #endif
> 
> +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
> +void __might_resched(const char *file, int line)
> +{
> +	if (!unsafe_user_region_active())
> +		return;
> +
> +	printk(KERN_ERR
> +		"BUG: rescheduling function called from user access context at %s:%d\n",
> +			file, line);
> +	dump_stack();

Since I've been staring intensely at ___might_sleep() lately, I was thinking
we could "copy" it a bit more closely (sorry for going back on what I said
earlier).

Coming back to the double warnings (__might_resched() + schedule_debug()),
it might be better to drop the schedule_debug() warning and just have the
one in __might_resched() - if something goes wrong, there'll already be a
"BUG" in the log.

> +}
> +EXPORT_SYMBOL(__might_resched);
> +#endif
> +
>  #ifdef CONFIG_MAGIC_SYSRQ
>  void normalize_rt_tasks(void)
>  {
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d4df5b2..d030e31 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2069,6 +2069,14 @@ config IO_STRICT_DEVMEM
> 
>  	  If in doubt, say Y.
> 
> +config DEBUG_UACCESS_SLEEP
> +	bool "Check sleep inside a user access region"
> +	depends on DEBUG_KERNEL
> +	help
> +	  If you say Y here, various routines which may sleep will become very
> +	  noisy if they are called inside a user access region (i.e. between
> +	  a user_access_begin() and a user_access_end())

If it does get noisy, we should go for some ratelimiting - it's probably
good practice even if it is not noisy actually.

___might_sleep() has this:

  if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
  	return;
  prev_jiffy = jiffies;

> +
>  source "arch/$(SRCARCH)/Kconfig.debug"
> 
>  endmenu # Kernel hacking
> --
> 1.9.1
>
Julien Thierry Feb. 4, 2019, 1:27 p.m. UTC | #2
On 30/01/2019 16:58, Valentin Schneider wrote:
> On 15/01/2019 13:58, Julien Thierry wrote:
> [...]> @@ -6151,6 +6159,20 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>>  EXPORT_SYMBOL(___might_sleep);
>>  #endif
>>
>> +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
>> +void __might_resched(const char *file, int line)
>> +{
>> +	if (!unsafe_user_region_active())
>> +		return;
>> +
>> +	printk(KERN_ERR
>> +		"BUG: rescheduling function called from user access context at %s:%d\n",
>> +			file, line);
>> +	dump_stack();
> 
> Since I've been staring intensely at ___might_sleep() lately, I was thinking
> we could "copy" it a bit more closely (sorry for going back on what I said
> earlier).
> 
> Coming back to the double warnings (__might_resched() + schedule_debug()),
> it might be better to drop the schedule_debug() warning and just have the
> one in __might_resched() - if something goes wrong, there'll already be a
> "BUG" in the log.
> 

My only worry with that approach is that if someone has a function that
does resched but does not include the annotation __might_resched() we'd
miss the fact that something went wrong.

But that might be a lot of "if"s since that assumes that something does
go wrong.

Could I have a maintainers opinion on this to know if I respin it?

>> +}
>> +EXPORT_SYMBOL(__might_resched);
>> +#endif
>> +
>>  #ifdef CONFIG_MAGIC_SYSRQ
>>  void normalize_rt_tasks(void)
>>  {
>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>> index d4df5b2..d030e31 100644
>> --- a/lib/Kconfig.debug
>> +++ b/lib/Kconfig.debug
>> @@ -2069,6 +2069,14 @@ config IO_STRICT_DEVMEM
>>
>>  	  If in doubt, say Y.
>>
>> +config DEBUG_UACCESS_SLEEP
>> +	bool "Check sleep inside a user access region"
>> +	depends on DEBUG_KERNEL
>> +	help
>> +	  If you say Y here, various routines which may sleep will become very
>> +	  noisy if they are called inside a user access region (i.e. between
>> +	  a user_access_begin() and a user_access_end())
> 
> If it does get noisy, we should go for some ratelimiting - it's probably
> good practice even if it is not noisy actually.
> 
> ___might_sleep() has this:
> 
>   if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
>   	return;
>   prev_jiffy = jiffies;
> 

I guess the noisiness could depend on the arch specific handling of user
accesses. So yes I guess it might be a good idea to add this.

Thanks,
Ingo Molnar Feb. 11, 2019, 1:45 p.m. UTC | #3
* Julien Thierry <julien.thierry@arm.com> wrote:

> While running a user_access regions, it is not supported to reschedule.
> Add an overridable primitive to indicate whether a user_access region is
> active and check that this is not the case when calling rescheduling
> functions.
> 
> These checks are only performed when DEBUG_UACCESS_SLEEP is selected.
> 
> Also, add a comment clarifying the behaviour of user_access regions.
> 
> Signed-off-by: Julien Thierry <julien.thierry@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> 
> ---
>  include/linux/kernel.h  | 11 +++++++++--
>  include/linux/uaccess.h | 13 +++++++++++++
>  kernel/sched/core.c     | 22 ++++++++++++++++++++++
>  lib/Kconfig.debug       |  8 ++++++++
>  4 files changed, 52 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 8f0e68e..73f1f82 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -237,11 +237,18 @@
>  struct pt_regs;
>  struct user;
> 
> +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
> +extern void __might_resched(const char *file, int line);
> +#else
> +#define __might_resched(file, line) do { } while (0)
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_VOLUNTARY
>  extern int _cond_resched(void);
> -# define might_resched() _cond_resched()
> +# define might_resched() \
> +	do { __might_resched(__FILE__, __LINE__); _cond_resched(); } while (0)
>  #else
> -# define might_resched() do { } while (0)
> +# define might_resched() __might_resched(__FILE__, __LINE__)
>  #endif
> 
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 37b226e..2c0c39e 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -263,6 +263,15 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
>  #define probe_kernel_address(addr, retval)		\
>  	probe_kernel_read(&retval, addr, sizeof(retval))
> 
> +/*
> + * user_access_begin() and user_access_end() define a region where
> + * unsafe user accessors can be used. Exceptions and interrupt shall exit the
> + * user_access region and re-enter it when returning to the interrupted context.
> + *
> + * No sleeping function should get called during a user_access region - we rely
> + * on exception handling to take care of the user_access status for us, but that
> + * doesn't happen when directly calling schedule().
> + */
>  #ifndef user_access_begin
>  #define user_access_begin(ptr,len) access_ok(ptr, len)
>  #define user_access_end() do { } while (0)
> @@ -270,6 +279,10 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
>  #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0)
>  #endif
> 
> +#ifndef unsafe_user_region_active
> +#define unsafe_user_region_active()	false
> +#endif
> +
>  #ifdef CONFIG_HARDENED_USERCOPY
>  void usercopy_warn(const char *name, const char *detail, bool to_user,
>  		   unsigned long offset, unsigned long len);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a674c7db..b1bb7e9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
>  		__schedule_bug(prev);
>  		preempt_count_set(PREEMPT_DISABLED);
>  	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> +	    unlikely(unsafe_user_region_active())) {
> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> +		       prev->comm, prev->pid, preempt_count());
> +		dump_stack();
> +	}
> +
>  	rcu_sleep_check();
> 
>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> @@ -6151,6 +6159,20 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>  EXPORT_SYMBOL(___might_sleep);
>  #endif
> 
> +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
> +void __might_resched(const char *file, int line)
> +{
> +	if (!unsafe_user_region_active())
> +		return;

Could you please more clearly explain why you want/need an exception from 
the __might_resched() debug warning?

Thanks,

	Ingo
Peter Zijlstra Feb. 11, 2019, 1:51 p.m. UTC | #4
On Mon, Feb 11, 2019 at 02:45:27PM +0100, Ingo Molnar wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a674c7db..b1bb7e9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >  		__schedule_bug(prev);
> >  		preempt_count_set(PREEMPT_DISABLED);
> >  	}
> > +
> > +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> > +	    unlikely(unsafe_user_region_active())) {
> > +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> > +		       prev->comm, prev->pid, preempt_count());
> > +		dump_stack();
> > +	}
> > +
> >  	rcu_sleep_check();
> > 
> >  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > @@ -6151,6 +6159,20 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
> >  EXPORT_SYMBOL(___might_sleep);
> >  #endif
> > 
> > +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
> > +void __might_resched(const char *file, int line)
> > +{
> > +	if (!unsafe_user_region_active())
> > +		return;
> 
> Could you please more clearly explain why you want/need an exception from 
> the __might_resched() debug warning?

In specific; how is the addition in schedule_debug() not triggering on
PREEMPT=y kernels?

If code is preemptible, you can (get) schedule(d). If it is not
preemptible; you do not need these additional tests.
Julien Thierry Feb. 12, 2019, 9:15 a.m. UTC | #5
On 11/02/2019 13:51, Peter Zijlstra wrote:
> On Mon, Feb 11, 2019 at 02:45:27PM +0100, Ingo Molnar wrote:
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index a674c7db..b1bb7e9 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
>>>  		__schedule_bug(prev);
>>>  		preempt_count_set(PREEMPT_DISABLED);
>>>  	}
>>> +
>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
>>> +	    unlikely(unsafe_user_region_active())) {
>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
>>> +		       prev->comm, prev->pid, preempt_count());
>>> +		dump_stack();
>>> +	}
>>> +
>>>  	rcu_sleep_check();
>>>
>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
>>> @@ -6151,6 +6159,20 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>>>  EXPORT_SYMBOL(___might_sleep);
>>>  #endif
>>>
>>> +#ifdef CONFIG_DEBUG_UACCESS_SLEEP
>>> +void __might_resched(const char *file, int line)
>>> +{
>>> +	if (!unsafe_user_region_active())
>>> +		return;
>>
>> Could you please more clearly explain why you want/need an exception from 
>> the __might_resched() debug warning?

So, the scenarios I'm trying to avoid are of the following flavour:

	if (user_access_begin(ptr, size)) {

		[...]

		// Calling a function that might call schedule()

		[...]
		user_access_end();
	}


The thing is, as I understand, not all function that call schedule() are
annotated with might_resched(), and on the other hand, not every time we
call a function that might_resched() will it call schedule().

Now with Peter's remark I think I might have been overzealous.

> 
> In specific; how is the addition in schedule_debug() not triggering on
> PREEMPT=y kernels?
> 
> If code is preemptible, you can (get) schedule(d). If it is not
> preemptible; you do not need these additional tests.
> 

Yes that sounds right, might_resched() only potentially reschedules if
in a suitable context, so best case I issue two warnings, worst case I
actually be warn when the caller took care to disable preemption or
interrupts before calling a might_resched().

I guess I got a bit confused with might_sleep() which is "if you call
this in the wrong context I warn" whereas might_resched() is just "if
you call this in preemptible context, lets resched".

I guess I'll drop the might_resched() part of this patch if that sounds
alright.

Thanks,
Ingo Molnar Feb. 13, 2019, 8:21 a.m. UTC | #6
* Julien Thierry <julien.thierry@arm.com> wrote:

> I guess I'll drop the might_resched() part of this patch if that sounds
> alright.

That sounds perfect - that bit was pretty much the only problem I had 
with the series.

Thanks,

	Ingo
Peter Zijlstra Feb. 13, 2019, 10:35 a.m. UTC | #7
On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:

> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index a674c7db..b1bb7e9 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >>>  		__schedule_bug(prev);
> >>>  		preempt_count_set(PREEMPT_DISABLED);
> >>>  	}
> >>> +
> >>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> >>> +	    unlikely(unsafe_user_region_active())) {
> >>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> >>> +		       prev->comm, prev->pid, preempt_count());
> >>> +		dump_stack();
> >>> +	}
> >>> +
> >>>  	rcu_sleep_check();
> >>>
> >>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

> I guess I'll drop the might_resched() part of this patch if that sounds
> alright.

I'm still confused by the schedule_debug() part. How is that not broken?
Julien Thierry Feb. 13, 2019, 10:50 a.m. UTC | #8
On 13/02/2019 10:35, Peter Zijlstra wrote:
> On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:
> 
>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>> index a674c7db..b1bb7e9 100644
>>>>> --- a/kernel/sched/core.c
>>>>> +++ b/kernel/sched/core.c
>>>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
>>>>>  		__schedule_bug(prev);
>>>>>  		preempt_count_set(PREEMPT_DISABLED);
>>>>>  	}
>>>>> +
>>>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
>>>>> +	    unlikely(unsafe_user_region_active())) {
>>>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
>>>>> +		       prev->comm, prev->pid, preempt_count());
>>>>> +		dump_stack();
>>>>> +	}
>>>>> +
>>>>>  	rcu_sleep_check();
>>>>>
>>>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> 
>> I guess I'll drop the might_resched() part of this patch if that sounds
>> alright.
> 
> I'm still confused by the schedule_debug() part. How is that not broken?

Hmmm, I am not exactly sure which part you expect to be broken, I guess
it's because of the nature of the uaccess unsafe accessor usage.

Basically, the following is a definite no:
	if (user_access_begin(ptr, size)) {

		[...]

		//something that calls schedule

		[...]

		user_access_end();
	}
	

However the following is fine:

- user_access_begin(ptr, size)
- taking irq/exception
- get preempted
- get resumed at some point in time
- restore state + eret
- user_access_end()

That's because exceptions/irq implicitly "suspend" the user access
region. (That's what I'm trying to clarify with the comment)
So, unsafe_user_region_active() should return false in a irq/exception
context.

Is this what you were concerned about? Or there still something that
might be broken?

Thanks,
Peter Zijlstra Feb. 13, 2019, 1:17 p.m. UTC | #9
On Wed, Feb 13, 2019 at 10:50:21AM +0000, Julien Thierry wrote:
> On 13/02/2019 10:35, Peter Zijlstra wrote:
> > On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:
> > 
> >>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>>> index a674c7db..b1bb7e9 100644
> >>>>> --- a/kernel/sched/core.c
> >>>>> +++ b/kernel/sched/core.c
> >>>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >>>>>  		__schedule_bug(prev);
> >>>>>  		preempt_count_set(PREEMPT_DISABLED);
> >>>>>  	}
> >>>>> +
> >>>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> >>>>> +	    unlikely(unsafe_user_region_active())) {
> >>>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> >>>>> +		       prev->comm, prev->pid, preempt_count());
> >>>>> +		dump_stack();
> >>>>> +	}
> >>>>> +
> >>>>>  	rcu_sleep_check();
> >>>>>
> >>>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > 
> >> I guess I'll drop the might_resched() part of this patch if that sounds
> >> alright.
> > 
> > I'm still confused by the schedule_debug() part. How is that not broken?
> 
> Hmmm, I am not exactly sure which part you expect to be broken, I guess
> it's because of the nature of the uaccess unsafe accessor usage.
> 
> Basically, the following is a definite no:
> 	if (user_access_begin(ptr, size)) {
> 
> 		[...]
> 
> 		//something that calls schedule
> 
> 		[...]
> 
> 		user_access_end();
> 	}
> 	
> 
> However the following is fine:
> 
> - user_access_begin(ptr, size)
> - taking irq/exception
> - get preempted

This; how is getting preempted fundamentally different from scheduling
ourselves?

> - get resumed at some point in time
> - restore state + eret
> - user_access_end()
> 
> That's because exceptions/irq implicitly "suspend" the user access
> region. (That's what I'm trying to clarify with the comment)
> So, unsafe_user_region_active() should return false in a irq/exception
> context.
> 
> Is this what you were concerned about? Or there still something that
> might be broken?

I really hate the asymetry introduced between preemptible and being able
to schedule. Both end up calling __schedule() and there really should
not be a difference.
Peter Zijlstra Feb. 13, 2019, 1:20 p.m. UTC | #10
On Wed, Feb 13, 2019 at 02:17:20PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 10:50:21AM +0000, Julien Thierry wrote:
> > On 13/02/2019 10:35, Peter Zijlstra wrote:
> > > On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:
> > > 
> > >>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >>>>> index a674c7db..b1bb7e9 100644
> > >>>>> --- a/kernel/sched/core.c
> > >>>>> +++ b/kernel/sched/core.c
> > >>>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
> > >>>>>  		__schedule_bug(prev);
> > >>>>>  		preempt_count_set(PREEMPT_DISABLED);
> > >>>>>  	}
> > >>>>> +
> > >>>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> > >>>>> +	    unlikely(unsafe_user_region_active())) {
> > >>>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> > >>>>> +		       prev->comm, prev->pid, preempt_count());
> > >>>>> +		dump_stack();
> > >>>>> +	}
> > >>>>> +
> > >>>>>  	rcu_sleep_check();
> > >>>>>
> > >>>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > > 
> > >> I guess I'll drop the might_resched() part of this patch if that sounds
> > >> alright.
> > > 
> > > I'm still confused by the schedule_debug() part. How is that not broken?
> > 
> > Hmmm, I am not exactly sure which part you expect to be broken, I guess
> > it's because of the nature of the uaccess unsafe accessor usage.
> > 
> > Basically, the following is a definite no:
> > 	if (user_access_begin(ptr, size)) {
> > 
> > 		[...]
> > 
> > 		//something that calls schedule
> > 
> > 		[...]
> > 
> > 		user_access_end();
> > 	}
> > 	
> > 
> > However the following is fine:
> > 
> > - user_access_begin(ptr, size)
> > - taking irq/exception
> > - get preempted
> 
> This; how is getting preempted fundamentally different from scheduling
> ourselves?

This is also the thing that PREEMPT_VOLUNTARY hinges on; it inserts
'random' reschedule points through might_sleep() and cond_resched().

If you're preemptible; you must be able to schedule and vice-versa.

You're breaking that.

> > - get resumed at some point in time
> > - restore state + eret
> > - user_access_end()
> > 
> > That's because exceptions/irq implicitly "suspend" the user access
> > region. (That's what I'm trying to clarify with the comment)
> > So, unsafe_user_region_active() should return false in a irq/exception
> > context.
> > 
> > Is this what you were concerned about? Or there still something that
> > might be broken?
> 
> I really hate the asymetry introduced between preemptible and being able
> to schedule. Both end up calling __schedule() and there really should
> not be a difference.
Will Deacon Feb. 13, 2019, 2 p.m. UTC | #11
Hi Peter,

On Wed, Feb 13, 2019 at 02:17:20PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 10:50:21AM +0000, Julien Thierry wrote:
> > On 13/02/2019 10:35, Peter Zijlstra wrote:
> > > On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:
> > > 
> > >>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >>>>> index a674c7db..b1bb7e9 100644
> > >>>>> --- a/kernel/sched/core.c
> > >>>>> +++ b/kernel/sched/core.c
> > >>>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
> > >>>>>  		__schedule_bug(prev);
> > >>>>>  		preempt_count_set(PREEMPT_DISABLED);
> > >>>>>  	}
> > >>>>> +
> > >>>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
> > >>>>> +	    unlikely(unsafe_user_region_active())) {
> > >>>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
> > >>>>> +		       prev->comm, prev->pid, preempt_count());
> > >>>>> +		dump_stack();
> > >>>>> +	}
> > >>>>> +
> > >>>>>  	rcu_sleep_check();
> > >>>>>
> > >>>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > > 
> > >> I guess I'll drop the might_resched() part of this patch if that sounds
> > >> alright.
> > > 
> > > I'm still confused by the schedule_debug() part. How is that not broken?
> > 
> > Hmmm, I am not exactly sure which part you expect to be broken, I guess
> > it's because of the nature of the uaccess unsafe accessor usage.
> > 
> > Basically, the following is a definite no:
> > 	if (user_access_begin(ptr, size)) {
> > 
> > 		[...]
> > 
> > 		//something that calls schedule
> > 
> > 		[...]
> > 
> > 		user_access_end();
> > 	}
> > 	
> > 
> > However the following is fine:
> > 
> > - user_access_begin(ptr, size)
> > - taking irq/exception
> > - get preempted
> 
> This; how is getting preempted fundamentally different from scheduling
> ourselves?

The difference is because getting preempted in the sequence above is
triggered off the back of an interrupt. On arm64, and I think also on x86,
the user access state (SMAP or PAN) is saved and restored across exceptions
but not across context switch. Consequently, taking an irq in a
user_access_{begin,end} section and then scheduling is fine, but calling
schedule directly within such a section is not.

Julien -- please yell if I've missed some crucial detail, but I think that's
the gist of what we're trying to describe here.

Will
Julien Thierry Feb. 13, 2019, 2:07 p.m. UTC | #12
On 13/02/2019 14:00, Will Deacon wrote:
> Hi Peter,
> 
> On Wed, Feb 13, 2019 at 02:17:20PM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 13, 2019 at 10:50:21AM +0000, Julien Thierry wrote:
>>> On 13/02/2019 10:35, Peter Zijlstra wrote:
>>>> On Tue, Feb 12, 2019 at 09:15:13AM +0000, Julien Thierry wrote:
>>>>
>>>>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>>>>> index a674c7db..b1bb7e9 100644
>>>>>>>> --- a/kernel/sched/core.c
>>>>>>>> +++ b/kernel/sched/core.c
>>>>>>>> @@ -3289,6 +3289,14 @@ static inline void schedule_debug(struct task_struct *prev)
>>>>>>>>  		__schedule_bug(prev);
>>>>>>>>  		preempt_count_set(PREEMPT_DISABLED);
>>>>>>>>  	}
>>>>>>>> +
>>>>>>>> +	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
>>>>>>>> +	    unlikely(unsafe_user_region_active())) {
>>>>>>>> +		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
>>>>>>>> +		       prev->comm, prev->pid, preempt_count());
>>>>>>>> +		dump_stack();
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>  	rcu_sleep_check();
>>>>>>>>
>>>>>>>>  	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
>>>>
>>>>> I guess I'll drop the might_resched() part of this patch if that sounds
>>>>> alright.
>>>>
>>>> I'm still confused by the schedule_debug() part. How is that not broken?
>>>
>>> Hmmm, I am not exactly sure which part you expect to be broken, I guess
>>> it's because of the nature of the uaccess unsafe accessor usage.
>>>
>>> Basically, the following is a definite no:
>>> 	if (user_access_begin(ptr, size)) {
>>>
>>> 		[...]
>>>
>>> 		//something that calls schedule
>>>
>>> 		[...]
>>>
>>> 		user_access_end();
>>> 	}
>>> 	
>>>
>>> However the following is fine:
>>>
>>> - user_access_begin(ptr, size)
>>> - taking irq/exception
>>> - get preempted
>>
>> This; how is getting preempted fundamentally different from scheduling
>> ourselves?
> 
> The difference is because getting preempted in the sequence above is
> triggered off the back of an interrupt. On arm64, and I think also on x86,
> the user access state (SMAP or PAN) is saved and restored across exceptions
> but not across context switch. Consequently, taking an irq in a
> user_access_{begin,end} section and then scheduling is fine, but calling
> schedule directly within such a section is not.
> 
> Julien -- please yell if I've missed some crucial detail, but I think that's
> the gist of what we're trying to describe here.
> 

Yes, this summarizes things correctly. Thanks!

I might also stress out that this limitation is already existing for x86
(and is in the arm64 patches picked by Catalin for 5.1), as was
discussed in here:

https://lkml.org/lkml/2018/11/23/430

So this patch is not introducing new semantics, it is only making
existing ones explicit.

If the current state is not good, we need to re-discuss the semantics of
user_access regions.

Thanks,
Peter Zijlstra Feb. 13, 2019, 2:17 p.m. UTC | #13
On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> > This; how is getting preempted fundamentally different from scheduling
> > ourselves?
> 
> The difference is because getting preempted in the sequence above is
> triggered off the back of an interrupt. On arm64, and I think also on x86,
> the user access state (SMAP or PAN) is saved and restored across exceptions
> but not across context switch. Consequently, taking an irq in a
> user_access_{begin,end} section and then scheduling is fine, but calling
> schedule directly within such a section is not.

So how's this then:

	if (user_access_begin()) {

		preempt_disable();

		<IRQ>
			set_need_resched();
		</IRQ no preempt>

		preempt_enable()
		  __schedule();

		user_access_end();
	}

That _should_ work just fine but explodes with the proposed nonsense.
Julien Thierry Feb. 13, 2019, 2:24 p.m. UTC | #14
On 13/02/2019 14:17, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
>>> This; how is getting preempted fundamentally different from scheduling
>>> ourselves?
>>
>> The difference is because getting preempted in the sequence above is
>> triggered off the back of an interrupt. On arm64, and I think also on x86,
>> the user access state (SMAP or PAN) is saved and restored across exceptions
>> but not across context switch. Consequently, taking an irq in a
>> user_access_{begin,end} section and then scheduling is fine, but calling
>> schedule directly within such a section is not.
> 
> So how's this then:
> 
> 	if (user_access_begin()) {
> 
> 		preempt_disable();
> 
> 		<IRQ>
> 			set_need_resched();
> 		</IRQ no preempt>
> 
> 		preempt_enable()
> 		  __schedule();
> 
> 		user_access_end();
> 	}
> 
> That _should_ work just fine but explodes with the proposed nonsense.

AFAICT, This does not work properly because when you schedule out this
task, you won't be saving the EFLAGS.AF/PSTATE.PAN bit on the stack, and
next time you schedule the task back in, it might no longer have the
correct flag value (so an unsafe_get/put_user() will fail even though
you haven't reached user_access_end()).

One solution is to deal with this in task switching code, but so far
I've been told that calling schedule() in such a context is not expected
to be supported.

Cheers,
Peter Zijlstra Feb. 13, 2019, 2:25 p.m. UTC | #15
On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> The difference is because getting preempted in the sequence above is
> triggered off the back of an interrupt. On arm64, and I think also on x86,
> the user access state (SMAP or PAN) is saved and restored across exceptions
> but not across context switch.

A quick reading of the SDM seems to suggest the SMAP state is part of
EFLAGS, which is context switched just fine AFAIK.

SMAP {ab,re}uses the EFLAGS.AC bit.

> Consequently, taking an irq in a
> user_access_{begin,end} section and then scheduling is fine, but calling
> schedule directly within such a section is not.
Julien Thierry Feb. 13, 2019, 2:39 p.m. UTC | #16
Hi Peter,

On 13/02/2019 14:25, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
>> The difference is because getting preempted in the sequence above is
>> triggered off the back of an interrupt. On arm64, and I think also on x86,
>> the user access state (SMAP or PAN) is saved and restored across exceptions
>> but not across context switch.
> 
> A quick reading of the SDM seems to suggest the SMAP state is part of
> EFLAGS, which is context switched just fine AFAIK.
> 
I fail to see where this is happening when looking at the switch_to()
logic in x86_64.

And Peter A. didn't seem to suggest that this transfer of the eflags was
happening without them being saved on the stack through exception handling.

What am I missing?

Thanks,
Peter Zijlstra Feb. 13, 2019, 2:40 p.m. UTC | #17
On Wed, Feb 13, 2019 at 02:24:34PM +0000, Julien Thierry wrote:
> On 13/02/2019 14:17, Peter Zijlstra wrote:
> > On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> >>> This; how is getting preempted fundamentally different from scheduling
> >>> ourselves?
> >>
> >> The difference is because getting preempted in the sequence above is
> >> triggered off the back of an interrupt. On arm64, and I think also on x86,
> >> the user access state (SMAP or PAN) is saved and restored across exceptions
> >> but not across context switch. Consequently, taking an irq in a
> >> user_access_{begin,end} section and then scheduling is fine, but calling
> >> schedule directly within such a section is not.
> > 
> > So how's this then:
> > 
> > 	if (user_access_begin()) {
> > 
> > 		preempt_disable();
> > 
> > 		<IRQ>
> > 			set_need_resched();
> > 		</IRQ no preempt>
> > 
> > 		preempt_enable()
> > 		  __schedule();
> > 
> > 		user_access_end();
> > 	}
> > 
> > That _should_ work just fine but explodes with the proposed nonsense.
> 
> AFAICT, This does not work properly because when you schedule out this
> task, you won't be saving the EFLAGS.AF/PSTATE.PAN bit on the stack, and

EFLAGS.AC, but yes.

> next time you schedule the task back in, it might no longer have the
> correct flag value (so an unsafe_get/put_user() will fail even though
> you haven't reached user_access_end()).

/me looks at __switch_to_asm() and there is indeed a distinct lack of
pushing and popping EFLAGS :/

> One solution is to deal with this in task switching code, but so far
> I've been told that calling schedule() in such a context is not expected
> to be supported.

Well, per the above it breaks the preemption model. And I hates that.

And the added WARN doesn't even begin to cover it, since you'd have to
actually hit the preempt_enable() reschedule for it trigger.

So far, all 6 in-tree users are indeed free of dodgy code, but *groan*.
Peter Zijlstra Feb. 13, 2019, 2:41 p.m. UTC | #18
On Wed, Feb 13, 2019 at 02:39:22PM +0000, Julien Thierry wrote:
> Hi Peter,
> 
> On 13/02/2019 14:25, Peter Zijlstra wrote:
> > On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> >> The difference is because getting preempted in the sequence above is
> >> triggered off the back of an interrupt. On arm64, and I think also on x86,
> >> the user access state (SMAP or PAN) is saved and restored across exceptions
> >> but not across context switch.
> > 
> > A quick reading of the SDM seems to suggest the SMAP state is part of
> > EFLAGS, which is context switched just fine AFAIK.
> > 
> I fail to see where this is happening when looking at the switch_to()
> logic in x86_64.

Yeah, me too.. we obviously preserve EFLAGS for user context, but for
kernel-kernel switches we do not seem to preserve it :-(
Peter Zijlstra Feb. 13, 2019, 3:08 p.m. UTC | #19
On Wed, Feb 13, 2019 at 03:40:00PM +0100, Peter Zijlstra wrote:

> So far, all 6 in-tree users are indeed free of dodgy code, but *groan*.

because of this, there must also not be tracepoints (even implicit ones
like function-trace) between user_access_{begin,end}().

And while that is unlikely with the current code; it is not guaranteed
afaict.

What a ff'ing mess.
Peter Zijlstra Feb. 13, 2019, 3:45 p.m. UTC | #20
On Wed, Feb 13, 2019 at 03:41:45PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 02:39:22PM +0000, Julien Thierry wrote:
> > Hi Peter,
> > 
> > On 13/02/2019 14:25, Peter Zijlstra wrote:
> > > On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> > >> The difference is because getting preempted in the sequence above is
> > >> triggered off the back of an interrupt. On arm64, and I think also on x86,
> > >> the user access state (SMAP or PAN) is saved and restored across exceptions
> > >> but not across context switch.
> > > 
> > > A quick reading of the SDM seems to suggest the SMAP state is part of
> > > EFLAGS, which is context switched just fine AFAIK.
> > > 
> > I fail to see where this is happening when looking at the switch_to()
> > logic in x86_64.
> 
> Yeah, me too.. we obviously preserve EFLAGS for user context, but for
> kernel-kernel switches we do not seem to preserve it :-(

So I dug around the context switch code a little, and I think we lost it
here:

  0100301bfdf5 ("sched/x86: Rewrite the switch_to() code")

Before that, x86_64 switch_to() read like (much simplified):

	asm volatile ( /* do RSP twiddle */
	  : /* output */
	  : /* input */
	  : "memory", "cc", .... "flags");

(see __EXTRA_CLOBBER)

Which I suppose means that GCC generates the PUSHF/POPF to preserve the
EFLAGS, since we mark those explicitly clobbered.

Before that:

  f05e798ad4c0 ("Disintegrate asm/system.h for X86")

We had explicit PUSHF / POPF in SAVE_CONTEXT / RESTORE_CONTEXT resp.

Now I cannot see how the current code preserves EFLAGS (if indeed it
does), and the changelog doesn't mention this change _AT_ALL_.


For a little bit of context; it turns out that user_access_begin() /
user_access_end() sets EFLAGS.AC and scheduling in between there wrecks
that because we're apparently not saving that anymore.

Now, I'm tempted to add the PUSHF / POPF right back because of this, but
first I suppose we need to figure out if that change was on purpose and
why that went missing from the Changelog.
Peter Zijlstra Feb. 13, 2019, 6:54 p.m. UTC | #21
On Wed, Feb 13, 2019 at 04:45:32PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 03:41:45PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 13, 2019 at 02:39:22PM +0000, Julien Thierry wrote:
> > > Hi Peter,
> > > 
> > > On 13/02/2019 14:25, Peter Zijlstra wrote:
> > > > On Wed, Feb 13, 2019 at 02:00:26PM +0000, Will Deacon wrote:
> > > >> The difference is because getting preempted in the sequence above is
> > > >> triggered off the back of an interrupt. On arm64, and I think also on x86,
> > > >> the user access state (SMAP or PAN) is saved and restored across exceptions
> > > >> but not across context switch.
> > > > 
> > > > A quick reading of the SDM seems to suggest the SMAP state is part of
> > > > EFLAGS, which is context switched just fine AFAIK.
> > > > 
> > > I fail to see where this is happening when looking at the switch_to()
> > > logic in x86_64.
> > 
> > Yeah, me too.. we obviously preserve EFLAGS for user context, but for
> > kernel-kernel switches we do not seem to preserve it :-(
> 
> So I dug around the context switch code a little, and I think we lost it
> here:
> 
>   0100301bfdf5 ("sched/x86: Rewrite the switch_to() code")
> 
> Before that, x86_64 switch_to() read like (much simplified):
> 
> 	asm volatile ( /* do RSP twiddle */
> 	  : /* output */
> 	  : /* input */
> 	  : "memory", "cc", .... "flags");
> 
> (see __EXTRA_CLOBBER)
> 
> Which I suppose means that GCC generates the PUSHF/POPF to preserve the
> EFLAGS, since we mark those explicitly clobbered.
> 
> Before that:
> 
>   f05e798ad4c0 ("Disintegrate asm/system.h for X86")
> 
> We had explicit PUSHF / POPF in SAVE_CONTEXT / RESTORE_CONTEXT resp.
> 
> Now I cannot see how the current code preserves EFLAGS (if indeed it
> does), and the changelog doesn't mention this change _AT_ALL_.
> 
> 
> For a little bit of context; it turns out that user_access_begin() /
> user_access_end() sets EFLAGS.AC and scheduling in between there wrecks
> that because we're apparently not saving that anymore.
> 
> Now, I'm tempted to add the PUSHF / POPF right back because of this, but
> first I suppose we need to figure out if that change was on purpose and
> why that went missing from the Changelog.

Just for giggles; the below patch seems to boot.

---
 arch/x86/entry/entry_32.S        | 2 ++
 arch/x86/entry/entry_64.S        | 2 ++
 arch/x86/include/asm/switch_to.h | 1 +
 3 files changed, 5 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..5fc76b755510 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
 	pushl	%ebx
 	pushl	%edi
 	pushl	%esi
+	pushfl
 
 	/* switch stack */
 	movl	%esp, TASK_threadsp(%eax)
@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
 #endif
 
 	/* restore callee-saved registers */
+	popfl
 	popl	%esi
 	popl	%edi
 	popl	%ebx
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..4fe27b67d7e2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -291,6 +291,7 @@ ENTRY(__switch_to_asm)
 	pushq	%r13
 	pushq	%r14
 	pushq	%r15
+	pushfq
 
 	/* switch stack */
 	movq	%rsp, TASK_threadsp(%rdi)
@@ -313,6 +314,7 @@ ENTRY(__switch_to_asm)
 #endif
 
 	/* restore callee-saved registers */
+	popfq
 	popq	%r15
 	popq	%r14
 	popq	%r13
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 7cf1a270d891..157149d4129c 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
  * order of the fields must match the code in __switch_to_asm().
  */
 struct inactive_task_frame {
+	unsigned long flags;
 #ifdef CONFIG_X86_64
 	unsigned long r15;
 	unsigned long r14;
Peter Zijlstra Feb. 13, 2019, 10:21 p.m. UTC | #22
On Wed, Feb 13, 2019 at 10:51:24AM -0800, Andy Lutomirski wrote:
> > On Feb 13, 2019, at 7:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Which I suppose means that GCC generates the PUSHF/POPF to preserve the
> > EFLAGS, since we mark those explicitly clobbered.
> > 
> 
> Not quite.  A flags clobber doesn’t save the control bits like AC
> except on certain rather buggy llvm compilers. The change you’re
> looking for is:
> 
> http://git.kernel.org/tip/2c7577a7583747c9b71f26dced7f696b739da745

Indeed, failed to find that.

> > For a little bit of context; it turns out that user_access_begin() /
> > user_access_end() sets EFLAGS.AC and scheduling in between there wrecks
> > that because we're apparently not saving that anymore.
> 
> But only explicit scheduling — preemption and sleepy page faults are
> fine because the interrupt frame saves flags.

No, like pointed out elsewhere in this thread, anything that does
preempt_disable() is utterly broken with this.

Because at that point the IRQ return path doesn't reschedule but
preempt_enable() will, and that doesn't preserve EFLAGS again.

> > Now, I'm tempted to add the PUSHF / POPF right back because of this, but
> > first I suppose we need to figure out if that change was on purpose and
> > why that went missing from the Changelog.
> 
> That’s certainly the easy solution. Or we could teach the might_sleep
> checks about this, but that could be a mess.

That's not enough, we'd have to teach preempt_disable(), but worse,
preempt_disable_notrace().

Anything that lands in ftrace, which _will_ use
preempt_disable_notrace(), will screw this thing up.
Andy Lutomirski Feb. 13, 2019, 10:49 p.m. UTC | #23
> On Feb 13, 2019, at 2:21 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Feb 13, 2019 at 10:51:24AM -0800, Andy Lutomirski wrote:
>>> On Feb 13, 2019, at 7:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> Which I suppose means that GCC generates the PUSHF/POPF to preserve the
>>> EFLAGS, since we mark those explicitly clobbered.
>>> 
>> 
>> Not quite.  A flags clobber doesn’t save the control bits like AC
>> except on certain rather buggy llvm compilers. The change you’re
>> looking for is:
>> 
>> http://git.kernel.org/tip/2c7577a7583747c9b71f26dced7f696b739da745
> 
> Indeed, failed to find that.
> 
>>> For a little bit of context; it turns out that user_access_begin() /
>>> user_access_end() sets EFLAGS.AC and scheduling in between there wrecks
>>> that because we're apparently not saving that anymore.
>> 
>> But only explicit scheduling — preemption and sleepy page faults are
>> fine because the interrupt frame saves flags.
> 
> No, like pointed out elsewhere in this thread, anything that does
> preempt_disable() is utterly broken with this.
> 
> Because at that point the IRQ return path doesn't reschedule but
> preempt_enable() will, and that doesn't preserve EFLAGS again.
> 
>>> Now, I'm tempted to add the PUSHF / POPF right back because of this, but
>>> first I suppose we need to figure out if that change was on purpose and
>>> why that went missing from the Changelog.
>> 
>> That’s certainly the easy solution. Or we could teach the might_sleep
>> checks about this, but that could be a mess.
> 
> That's not enough, we'd have to teach preempt_disable(), but worse,
> preempt_disable_notrace().
> 
> Anything that lands in ftrace, which _will_ use
> preempt_disable_notrace(), will screw this thing up.

Ugh.  Consider your patch acked.  Do we need to backport this thing?  The problem can’t be too widespread or we would have heard of it before.
Linus Torvalds Feb. 13, 2019, 11:19 p.m. UTC | #24
On Wed, Feb 13, 2019 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Before that, x86_64 switch_to() read like (much simplified):
>
>         asm volatile ( /* do RSP twiddle */
>           : /* output */
>           : /* input */
>           : "memory", "cc", .... "flags");
>
> (see __EXTRA_CLOBBER)
>
> Which I suppose means that GCC generates the PUSHF/POPF to preserve the
> EFLAGS, since we mark those explicitly clobbered.

No, it only means that gcc won't keep conditionals in the flags over
the asm. It doesn't make gcc save anything.

The push/pop got removed elsewhere as Andy says.

That said, I do agree that it's probably a good idea to save/restore
flags anyway when scheduling. It's not just AC, actually, now that I
look at it again I worry a bit about DF too.

We have the rule that we run with DF clear in the kernel, and all the
kernel entry points do clear it properly (so that memcpy etc don't
need to). But there are a few places that set DF temporarily because
they do something odd (backwards memmove), and those atcually have the
*exact* same issue as stac/clac has: it's ok to take a trap or
interrupt, and schedule due to that (because the trap/irq will clear
DF), but it would be a horrible bug to have a synchronous scheduling
point there.

Arguably the DF issue really isn't even remotely likely to actually be
a real issue (the code that sets DF really is very special and should
never do any kind of preemption), but it's conceptually quite
similar..

Of course, if DF is ever set, and we end up calling any C code at all,
I guess it would already be a huge problem. If the C code then does
memcpy or something, it would corrupt things quite badly.

So I guess save/restore isn't going to save us wrt DF. If we get
anywhere close to the scheduler with the DF bit set, we've already
lost.

But I still do kind of prefer saving flags. We have other sticky state
in there too, although none of it matters in the kernel currently (eg
iopl etc - only matters in user space, and user space will always
reload eflags on return).

                 Linus
diff mbox series

Patch

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 8f0e68e..73f1f82 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -237,11 +237,18 @@ 
 struct pt_regs;
 struct user;

+#ifdef CONFIG_DEBUG_UACCESS_SLEEP
+extern void __might_resched(const char *file, int line);
+#else
+#define __might_resched(file, line) do { } while (0)
+#endif
+
 #ifdef CONFIG_PREEMPT_VOLUNTARY
 extern int _cond_resched(void);
-# define might_resched() _cond_resched()
+# define might_resched() \
+	do { __might_resched(__FILE__, __LINE__); _cond_resched(); } while (0)
 #else
-# define might_resched() do { } while (0)
+# define might_resched() __might_resched(__FILE__, __LINE__)
 #endif

 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 37b226e..2c0c39e 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -263,6 +263,15 @@  static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
 #define probe_kernel_address(addr, retval)		\
 	probe_kernel_read(&retval, addr, sizeof(retval))

+/*
+ * user_access_begin() and user_access_end() define a region where
+ * unsafe user accessors can be used. Exceptions and interrupt shall exit the
+ * user_access region and re-enter it when returning to the interrupted context.
+ *
+ * No sleeping function should get called during a user_access region - we rely
+ * on exception handling to take care of the user_access status for us, but that
+ * doesn't happen when directly calling schedule().
+ */
 #ifndef user_access_begin
 #define user_access_begin(ptr,len) access_ok(ptr, len)
 #define user_access_end() do { } while (0)
@@ -270,6 +279,10 @@  static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
 #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0)
 #endif

+#ifndef unsafe_user_region_active
+#define unsafe_user_region_active()	false
+#endif
+
 #ifdef CONFIG_HARDENED_USERCOPY
 void usercopy_warn(const char *name, const char *detail, bool to_user,
 		   unsigned long offset, unsigned long len);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a674c7db..b1bb7e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3289,6 +3289,14 @@  static inline void schedule_debug(struct task_struct *prev)
 		__schedule_bug(prev);
 		preempt_count_set(PREEMPT_DISABLED);
 	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_UACCESS_SLEEP) &&
+	    unlikely(unsafe_user_region_active())) {
+		printk(KERN_ERR "BUG: scheduling while user_access enabled: %s/%d/0x%08x\n",
+		       prev->comm, prev->pid, preempt_count());
+		dump_stack();
+	}
+
 	rcu_sleep_check();

 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
@@ -6151,6 +6159,20 @@  void ___might_sleep(const char *file, int line, int preempt_offset)
 EXPORT_SYMBOL(___might_sleep);
 #endif

+#ifdef CONFIG_DEBUG_UACCESS_SLEEP
+void __might_resched(const char *file, int line)
+{
+	if (!unsafe_user_region_active())
+		return;
+
+	printk(KERN_ERR
+		"BUG: rescheduling function called from user access context at %s:%d\n",
+			file, line);
+	dump_stack();
+}
+EXPORT_SYMBOL(__might_resched);
+#endif
+
 #ifdef CONFIG_MAGIC_SYSRQ
 void normalize_rt_tasks(void)
 {
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d4df5b2..d030e31 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2069,6 +2069,14 @@  config IO_STRICT_DEVMEM

 	  If in doubt, say Y.

+config DEBUG_UACCESS_SLEEP
+	bool "Check sleep inside a user access region"
+	depends on DEBUG_KERNEL
+	help
+	  If you say Y here, various routines which may sleep will become very
+	  noisy if they are called inside a user access region (i.e. between
+	  a user_access_begin() and a user_access_end())
+
 source "arch/$(SRCARCH)/Kconfig.debug"

 endmenu # Kernel hacking