diff mbox series

kernel.h: Add non_block_start/end()

Message ID 20190521100611.10089-1-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series kernel.h: Add non_block_start/end() | expand

Commit Message

Daniel Vetter May 21, 2019, 10:06 a.m. UTC
In some special cases we must not block, but there's not a
spinlock, preempt-off, irqs-off or similar critical section already
that arms the might_sleep() debug checks. Add a non_block_start/end()
pair to annotate these.

This will be used in the oom paths of mmu-notifiers, where blocking is
not allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an
indirect dependency upon the page allocator and hence close the loop
with the oom reaper.

Suggested by Michal Hocko.

v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)

v3: It works better when I actually squash in the fixup I had lying
around :-/

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Wei Wang <wvw@google.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/kernel.h | 10 +++++++++-
 include/linux/sched.h  |  4 ++++
 kernel/sched/core.c    | 19 ++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

Comments

Michal Hocko May 21, 2019, 10:46 a.m. UTC | #1
On Tue 21-05-19 12:06:11, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.
> 
> Suggested by Michal Hocko.
> 
> v2:
> - Improve commit message (Michal)
> - Also check in schedule, not just might_sleep (Peter)
> 
> v3: It works better when I actually squash in the fixup I had lying
> around :-/
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Wei Wang <wvw@google.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jann Horn <jannh@google.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Christian König <christian.koenig@amd.com> (v1)
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

I like this in general. The implementation looks reasonable to me but I
didn't check deeply enough to give my R-by or A-by.

> ---
>  include/linux/kernel.h | 10 +++++++++-
>  include/linux/sched.h  |  4 ++++
>  kernel/sched/core.c    | 19 ++++++++++++++-----
>  3 files changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 74b1ee9027f5..b5f2c2ff0eab 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -214,7 +214,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
>   * might_sleep - annotation for functions that can sleep
>   *
>   * this macro will print a stack trace if it is executed in an atomic
> - * context (spinlock, irq-handler, ...).
> + * context (spinlock, irq-handler, ...). Additional sections where blocking is
> + * not allowed can be annotated with non_block_start() and non_block_end()
> + * pairs.
>   *
>   * This is a useful debugging help to be able to catch problems early and not
>   * be bitten later when the calling function happens to sleep when it is not
> @@ -230,6 +232,10 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
>  # define cant_sleep() \
>  	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
>  # define sched_annotate_sleep()	(current->task_state_change = 0)
> +# define non_block_start() \
> +	do { current->non_block_count++; } while (0)
> +# define non_block_end() \
> +	do { WARN_ON(current->non_block_count-- == 0); } while (0)
>  #else
>    static inline void ___might_sleep(const char *file, int line,
>  				   int preempt_offset) { }
> @@ -238,6 +244,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
>  # define might_sleep() do { might_resched(); } while (0)
>  # define cant_sleep() do { } while (0)
>  # define sched_annotate_sleep() do { } while (0)
> +# define non_block_start() do { } while (0)
> +# define non_block_end() do { } while (0)
>  #endif
>  
>  #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11837410690f..7f5b293e72df 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -908,6 +908,10 @@ struct task_struct {
>  	struct mutex_waiter		*blocked_on;
>  #endif
>  
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> +	int				non_block_count;
> +#endif
> +
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  	unsigned int			irq_events;
>  	unsigned long			hardirq_enable_ip;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..ed7755a28465 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3264,13 +3264,22 @@ static noinline void __schedule_bug(struct task_struct *prev)
>  /*
>   * Various schedule()-time debugging checks and statistics:
>   */
> -static inline void schedule_debug(struct task_struct *prev)
> +static inline void schedule_debug(struct task_struct *prev, bool preempt)
>  {
>  #ifdef CONFIG_SCHED_STACK_END_CHECK
>  	if (task_stack_end_corrupted(prev))
>  		panic("corrupted stack end detected inside scheduler\n");
>  #endif
>  
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> +	if (!preempt && prev->state && prev->non_block_count) {
> +		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
> +			prev->comm, prev->pid, prev->non_block_count);
> +		dump_stack();
> +		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> +	}
> +#endif
> +
>  	if (unlikely(in_atomic_preempt_off())) {
>  		__schedule_bug(prev);
>  		preempt_count_set(PREEMPT_DISABLED);
> @@ -3377,7 +3386,7 @@ static void __sched notrace __schedule(bool preempt)
>  	rq = cpu_rq(cpu);
>  	prev = rq->curr;
>  
> -	schedule_debug(prev);
> +	schedule_debug(prev, preempt);
>  
>  	if (sched_feat(HRTICK))
>  		hrtick_clear(rq);
> @@ -6102,7 +6111,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>  	rcu_sleep_check();
>  
>  	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
> -	     !is_idle_task(current)) ||
> +	     !is_idle_task(current) && !current->non_block_count) ||
>  	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
>  	    oops_in_progress)
>  		return;
> @@ -6118,8 +6127,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
>  		"BUG: sleeping function called from invalid context at %s:%d\n",
>  			file, line);
>  	printk(KERN_ERR
> -		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
> -			in_atomic(), irqs_disabled(),
> +		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
> +			in_atomic(), irqs_disabled(), current->non_block_count,
>  			current->pid, current->comm);
>  
>  	if (task_stack_end_corrupted(current))
> -- 
> 2.20.1
>
Daniel Vetter May 21, 2019, 2:44 p.m. UTC | #2
On Tue, May 21, 2019 at 08:24:53PM +0900, Tetsuo Handa wrote:
> On 2019/05/21 20:11, Michal Hocko wrote:
> > On Tue 21-05-19 20:04:34, Tetsuo Handa wrote:
> >> On 2019/05/21 19:51, Michal Hocko wrote:
> >>> On Tue 21-05-19 19:44:01, Tetsuo Handa wrote:
> >>>> On 2019/05/21 19:06, Daniel Vetter wrote:
> >>>>> In some special cases we must not block, but there's not a
> >>>>> spinlock, preempt-off, irqs-off or similar critical section already
> >>>>> that arms the might_sleep() debug checks. Add a non_block_start/end()
> >>>>> pair to annotate these.
> >>>>>
> >>>>> This will be used in the oom paths of mmu-notifiers, where blocking is
> >>>>> not allowed to make sure there's forward progress. Quoting Michal:
> >>>>>
> >>>>> "The notifier is called from quite a restricted context - oom_reaper -
> >>>>> which shouldn't depend on any locks or sleepable conditionals. The code
> >>>>> should be swift as well but we mostly do care about it to make a forward
> >>>>> progress. Checking for sleepable context is the best thing we could come
> >>>>> up with that would describe these demands at least partially."
> >>>>>
> >>>>
> >>>> Can this be checked for OOM notifier as well?
> >>>>
> >>>>  	if (!is_memcg_oom(oc)) {
> >>>> +		non_block_start();
> >>>>  		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> >>>> +		non_block_end();
> >>>>  		if (freed > 0)
> >>>>  			/* Got some memory back in the last second. */
> >>>>  			return true;
> >>>>  	}
> >>>>
> >>>> It is not clear whether i915's oom_notifier function has such dependency.
> >>>
> >>> It is not but then we should be using the non-blocking API if this is
> >>> a real problem. The above code just doesn't make any sense. We have a
> >>> blocking API called and wrapped by non-blocking one.
> >>
> >> OOM notifiers should not depend on any locks or sleepable conditionals.
> >> If some lock directly or indirectly depended on __GFP_DIRECT_RECLAIM,
> >> it will deadlock. Thus, despite blocking API, this should effectively be
> >> non-blocking. All OOM notifier users except i915 seems to be atomic, but
> >> I can't evaluate i915 part...
> > 
> > Read again what I've written, please
> > 
> 
> Question to Daniel: Is i915's oom_notifier function atomic?

It's supposed to not block too much at least, I don't think it's entirely
atomic. Waking up the device (which we need to write some of the ptes)
will take some time and I think acquires a few mutexes, but not 100% sure.

If you want to see, send a patch to intel-gfx m-l and CI will pick it up
and test with our farm of machines.
-Daniel
Michal Hocko May 21, 2019, 2:47 p.m. UTC | #3
On Tue 21-05-19 14:43:38, Cristopher Lameter wrote:
> On Tue, 21 May 2019, Daniel Vetter wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> 
> Just putting preempt on/off around these is not sufficient?

It is not a critical section. It is a _debugging_ facility to help
discover blocking contexts.
Daniel Vetter May 21, 2019, 2:48 p.m. UTC | #4
On Tue, May 21, 2019 at 12:46:38PM +0200, Michal Hocko wrote:
> On Tue 21-05-19 12:06:11, Daniel Vetter wrote:
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> > 
> > Suggested by Michal Hocko.
> > 
> > v2:
> > - Improve commit message (Michal)
> > - Also check in schedule, not just might_sleep (Peter)
> > 
> > v3: It works better when I actually squash in the fixup I had lying
> > around :-/
> > 
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > Cc: linux-mm@kvack.org
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Cc: Wei Wang <wvw@google.com>
> > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Feng Tang <feng.tang@intel.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: linux-kernel@vger.kernel.org
> > Acked-by: Christian König <christian.koenig@amd.com> (v1)
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> 
> I like this in general. The implementation looks reasonable to me but I
> didn't check deeply enough to give my R-by or A-by.

Thanks for all your comments. I'll ask Jerome Glisse to look into this, I
think it'd could be useful for all the HMM work too.

And I sent this out without reply-to the patch it's supposed to replace,
will need to do that again so patchwork and 0day pick up the correct
series. Sry about that noise :-/
-Daniel

> 
> > ---
> >  include/linux/kernel.h | 10 +++++++++-
> >  include/linux/sched.h  |  4 ++++
> >  kernel/sched/core.c    | 19 ++++++++++++++-----
> >  3 files changed, 27 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 74b1ee9027f5..b5f2c2ff0eab 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -214,7 +214,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
> >   * might_sleep - annotation for functions that can sleep
> >   *
> >   * this macro will print a stack trace if it is executed in an atomic
> > - * context (spinlock, irq-handler, ...).
> > + * context (spinlock, irq-handler, ...). Additional sections where blocking is
> > + * not allowed can be annotated with non_block_start() and non_block_end()
> > + * pairs.
> >   *
> >   * This is a useful debugging help to be able to catch problems early and not
> >   * be bitten later when the calling function happens to sleep when it is not
> > @@ -230,6 +232,10 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
> >  # define cant_sleep() \
> >  	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
> >  # define sched_annotate_sleep()	(current->task_state_change = 0)
> > +# define non_block_start() \
> > +	do { current->non_block_count++; } while (0)
> > +# define non_block_end() \
> > +	do { WARN_ON(current->non_block_count-- == 0); } while (0)
> >  #else
> >    static inline void ___might_sleep(const char *file, int line,
> >  				   int preempt_offset) { }
> > @@ -238,6 +244,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
> >  # define might_sleep() do { might_resched(); } while (0)
> >  # define cant_sleep() do { } while (0)
> >  # define sched_annotate_sleep() do { } while (0)
> > +# define non_block_start() do { } while (0)
> > +# define non_block_end() do { } while (0)
> >  #endif
> >  
> >  #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 11837410690f..7f5b293e72df 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -908,6 +908,10 @@ struct task_struct {
> >  	struct mutex_waiter		*blocked_on;
> >  #endif
> >  
> > +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > +	int				non_block_count;
> > +#endif
> > +
> >  #ifdef CONFIG_TRACE_IRQFLAGS
> >  	unsigned int			irq_events;
> >  	unsigned long			hardirq_enable_ip;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 102dfcf0a29a..ed7755a28465 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3264,13 +3264,22 @@ static noinline void __schedule_bug(struct task_struct *prev)
> >  /*
> >   * Various schedule()-time debugging checks and statistics:
> >   */
> > -static inline void schedule_debug(struct task_struct *prev)
> > +static inline void schedule_debug(struct task_struct *prev, bool preempt)
> >  {
> >  #ifdef CONFIG_SCHED_STACK_END_CHECK
> >  	if (task_stack_end_corrupted(prev))
> >  		panic("corrupted stack end detected inside scheduler\n");
> >  #endif
> >  
> > +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > +	if (!preempt && prev->state && prev->non_block_count) {
> > +		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
> > +			prev->comm, prev->pid, prev->non_block_count);
> > +		dump_stack();
> > +		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> > +	}
> > +#endif
> > +
> >  	if (unlikely(in_atomic_preempt_off())) {
> >  		__schedule_bug(prev);
> >  		preempt_count_set(PREEMPT_DISABLED);
> > @@ -3377,7 +3386,7 @@ static void __sched notrace __schedule(bool preempt)
> >  	rq = cpu_rq(cpu);
> >  	prev = rq->curr;
> >  
> > -	schedule_debug(prev);
> > +	schedule_debug(prev, preempt);
> >  
> >  	if (sched_feat(HRTICK))
> >  		hrtick_clear(rq);
> > @@ -6102,7 +6111,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
> >  	rcu_sleep_check();
> >  
> >  	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
> > -	     !is_idle_task(current)) ||
> > +	     !is_idle_task(current) && !current->non_block_count) ||
> >  	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
> >  	    oops_in_progress)
> >  		return;
> > @@ -6118,8 +6127,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
> >  		"BUG: sleeping function called from invalid context at %s:%d\n",
> >  			file, line);
> >  	printk(KERN_ERR
> > -		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
> > -			in_atomic(), irqs_disabled(),
> > +		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
> > +			in_atomic(), irqs_disabled(), current->non_block_count,
> >  			current->pid, current->comm);
> >  
> >  	if (task_stack_end_corrupted(current))
> > -- 
> > 2.20.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Daniel Vetter May 22, 2019, 7:04 a.m. UTC | #5
On Tue, May 21, 2019 at 11:06 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2019/05/21 23:44, Daniel Vetter wrote:
> >>>> OOM notifiers should not depend on any locks or sleepable conditionals.
> >>>> If some lock directly or indirectly depended on __GFP_DIRECT_RECLAIM,
> >>>> it will deadlock. Thus, despite blocking API, this should effectively be
> >>>> non-blocking. All OOM notifier users except i915 seems to be atomic, but
> >>>> I can't evaluate i915 part...
> >>>
> >>> Read again what I've written, please
> >>>
> >>
> >> Question to Daniel: Is i915's oom_notifier function atomic?
> >
> > It's supposed to not block too much at least, I don't think it's entirely
> > atomic. Waking up the device (which we need to write some of the ptes)
> > will take some time and I think acquires a few mutexes, but not 100% sure.
> >
> > If you want to see, send a patch to intel-gfx m-l and CI will pick it up
> > and test with our farm of machines.
>
> As soon as a mutex is held, we can't expect it is atomic. We need to
> manually inspect whether there is __GFP_DIRECT_RECLAIM dependency...
>
> Since OOM notifier will be called after shrinkers are attempted,
> can i915 move from OOM notifier to shrinker?

We also have a shrinker. The trouble is a bit that locking design in
i915 is still not great (it's a lot better than it's bit), and iirc
that's why we had the oom fallback. It unconditionally throws out a
bunch of things we can do with less locking. Maybe we could stuff that
into the shrinker now. Adding Chris.
-Daniel
Chris Wilson May 22, 2019, 7:13 a.m. UTC | #6
Quoting Michal Hocko (2019-05-22 07:34:42)
> On Wed 22-05-19 06:06:31, Tetsuo Handa wrote:
> [...]
> > Since OOM notifier will be called after shrinkers are attempted,
> > can i915 move from OOM notifier to shrinker?
> 
> That would be indeed preferable. OOM notifier is an API from hell.

We were^W are still trying to make the shrinker nonblocking to avoid
incurring horrible latencies for light direct reclaim. The consequence
of avoiding heavy work in the shrinker is that we moved it to the oom
notifier as being the last chance we have to return all (can be literally
all) the system memory.

The alternative to using a separate oom notifier would be more
reclaim/shrinker phases?
-Chris
diff mbox series

Patch

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 74b1ee9027f5..b5f2c2ff0eab 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -214,7 +214,9 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
  * might_sleep - annotation for functions that can sleep
  *
  * this macro will print a stack trace if it is executed in an atomic
- * context (spinlock, irq-handler, ...).
+ * context (spinlock, irq-handler, ...). Additional sections where blocking is
+ * not allowed can be annotated with non_block_start() and non_block_end()
+ * pairs.
  *
  * This is a useful debugging help to be able to catch problems early and not
  * be bitten later when the calling function happens to sleep when it is not
@@ -230,6 +232,10 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define cant_sleep() \
 	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()	(current->task_state_change = 0)
+# define non_block_start() \
+	do { current->non_block_count++; } while (0)
+# define non_block_end() \
+	do { WARN_ON(current->non_block_count-- == 0); } while (0)
 #else
   static inline void ___might_sleep(const char *file, int line,
 				   int preempt_offset) { }
@@ -238,6 +244,8 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define might_sleep() do { might_resched(); } while (0)
 # define cant_sleep() do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
+# define non_block_start() do { } while (0)
+# define non_block_end() do { } while (0)
 #endif
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..7f5b293e72df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -908,6 +908,10 @@  struct task_struct {
 	struct mutex_waiter		*blocked_on;
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	int				non_block_count;
+#endif
+
 #ifdef CONFIG_TRACE_IRQFLAGS
 	unsigned int			irq_events;
 	unsigned long			hardirq_enable_ip;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..ed7755a28465 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3264,13 +3264,22 @@  static noinline void __schedule_bug(struct task_struct *prev)
 /*
  * Various schedule()-time debugging checks and statistics:
  */
-static inline void schedule_debug(struct task_struct *prev)
+static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
 	if (task_stack_end_corrupted(prev))
 		panic("corrupted stack end detected inside scheduler\n");
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	if (!preempt && prev->state && prev->non_block_count) {
+		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
+			prev->comm, prev->pid, prev->non_block_count);
+		dump_stack();
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+	}
+#endif
+
 	if (unlikely(in_atomic_preempt_off())) {
 		__schedule_bug(prev);
 		preempt_count_set(PREEMPT_DISABLED);
@@ -3377,7 +3386,7 @@  static void __sched notrace __schedule(bool preempt)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
-	schedule_debug(prev);
+	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
@@ -6102,7 +6111,7 @@  void ___might_sleep(const char *file, int line, int preempt_offset)
 	rcu_sleep_check();
 
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
-	     !is_idle_task(current)) ||
+	     !is_idle_task(current) && !current->non_block_count) ||
 	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
 	    oops_in_progress)
 		return;
@@ -6118,8 +6127,8 @@  void ___might_sleep(const char *file, int line, int preempt_offset)
 		"BUG: sleeping function called from invalid context at %s:%d\n",
 			file, line);
 	printk(KERN_ERR
-		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
-			in_atomic(), irqs_disabled(),
+		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
+			in_atomic(), irqs_disabled(), current->non_block_count,
 			current->pid, current->comm);
 
 	if (task_stack_end_corrupted(current))