diff mbox

mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

Message ID 1529493638-6389-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Tetsuo Handa June 20, 2018, 11:20 a.m. UTC
Sleeping with oom_lock held can cause AB-BA lockup bug because
__alloc_pages_may_oom() does not wait for oom_lock. Since
blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
with oom_lock held is currently an unavoidable problem.

As a preparation for not to sleep with oom_lock held, this patch brings
OOM notifier callbacks to outside of OOM killer, with two small behavior
changes explained below.

One is that this patch makes it impossible for SysRq-f and PF-OOM to
reclaim via OOM notifier. But such change should be tolerable because
"we unlikely try to use SysRq-f for reclaiming memory via OOM notifier
callbacks" and "pagefault_out_of_memory() will be called when OOM killer
selected current thread as an OOM victim after OOM notifier callbacks
already failed to reclaim memory".

The other is that this patch makes it possible to reclaim memory via OOM
notifier after OOM killer is disabled (that is, suspend/hibernate is in
progress). But such change should be safe because of pm_suspended_storage()
check.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 35 ++++++++++++++++++------
 mm/page_alloc.c     | 76 +++++++++++++++++++++++++++++++----------------------
 3 files changed, 73 insertions(+), 39 deletions(-)

Comments

Michal Hocko June 20, 2018, 11:55 a.m. UTC | #1
On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
> Sleeping with oom_lock held can cause AB-BA lockup bug because
> __alloc_pages_may_oom() does not wait for oom_lock. Since
> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
> with oom_lock held is currently an unavoidable problem.

Could you be more specific about the potential deadlock? Sleeping while
holding oom lock is certainly not nice but I do not see how that would
result in a deadlock assuming that the sleeping context doesn't sleep on
the memory allocation obviously.

> As a preparation for not to sleep with oom_lock held, this patch brings
> OOM notifier callbacks to outside of OOM killer, with two small behavior
> changes explained below.

Can we just eliminate this ugliness and remove it altogether? We do not
have that many notifiers. Is there anything fundamental that would
prevent us from moving them to shrinkers instead?
Tetsuo Handa June 20, 2018, 12:21 p.m. UTC | #2
On 2018/06/20 20:55, Michal Hocko wrote:
> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>> with oom_lock held is currently an unavoidable problem.
> 
> Could you be more specific about the potential deadlock? Sleeping while
> holding oom lock is certainly not nice but I do not see how that would
> result in a deadlock assuming that the sleeping context doesn't sleep on
> the memory allocation obviously.

"A" is "owns oom_lock" and "B" is "owns CPU resources". It was demonstrated
at "mm,oom: Don't call schedule_timeout_killable() with oom_lock held." proposal.

But since you don't accept preserving the short sleep which is a heuristic for
reducing the possibility of AB-BA lockup, the only way we would accept will be
wait for the owner of oom_lock (e.g. by s/mutex_trylock/mutex_lock/ or whatever)
which is free of heuristic and free of AB-BA lockup.

> 
>> As a preparation for not to sleep with oom_lock held, this patch brings
>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>> changes explained below.
> 
> Can we just eliminate this ugliness and remove it altogether? We do not
> have that many notifiers. Is there anything fundamental that would
> prevent us from moving them to shrinkers instead?
> 

For long term, it would be possible. But not within this patch. For example,
I think that virtio_balloon wants to release memory only when we have no
choice but OOM kill. If virtio_balloon trivially releases memory, it will
increase the risk of killing the entire guest by OOM-killer from the host
side.
Michal Hocko June 20, 2018, 1:07 p.m. UTC | #3
On Wed 20-06-18 21:21:21, Tetsuo Handa wrote:
> On 2018/06/20 20:55, Michal Hocko wrote:
> > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
> >> Sleeping with oom_lock held can cause AB-BA lockup bug because
> >> __alloc_pages_may_oom() does not wait for oom_lock. Since
> >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
> >> with oom_lock held is currently an unavoidable problem.
> > 
> > Could you be more specific about the potential deadlock? Sleeping while
> > holding oom lock is certainly not nice but I do not see how that would
> > result in a deadlock assuming that the sleeping context doesn't sleep on
> > the memory allocation obviously.
> 
> "A" is "owns oom_lock" and "B" is "owns CPU resources". It was demonstrated
> at "mm,oom: Don't call schedule_timeout_killable() with oom_lock held." proposal.

This is not a deadlock but merely a resource starvation AFAIU.

> But since you don't accept preserving the short sleep which is a heuristic for
> reducing the possibility of AB-BA lockup, the only way we would accept will be
> wait for the owner of oom_lock (e.g. by s/mutex_trylock/mutex_lock/ or whatever)
> which is free of heuristic and free of AB-BA lockup.
> 
> > 
> >> As a preparation for not to sleep with oom_lock held, this patch brings
> >> OOM notifier callbacks to outside of OOM killer, with two small behavior
> >> changes explained below.
> > 
> > Can we just eliminate this ugliness and remove it altogether? We do not
> > have that many notifiers. Is there anything fundamental that would
> > prevent us from moving them to shrinkers instead?
> > 
> 
> For long term, it would be possible. But not within this patch. For example,
> I think that virtio_balloon wants to release memory only when we have no
> choice but OOM kill. If virtio_balloon trivially releases memory, it will
> increase the risk of killing the entire guest by OOM-killer from the host
> side.

I would _prefer_ to think long term here. The sleep inside the oom lock is
not something real workload are seeing out there AFAICS. Adding quite
some code to address such a case doesn't justify the inclusion IMHO.
David Rientjes June 20, 2018, 10:36 p.m. UTC | #4
On Wed, 20 Jun 2018, Tetsuo Handa wrote:

> Sleeping with oom_lock held can cause AB-BA lockup bug because
> __alloc_pages_may_oom() does not wait for oom_lock. Since
> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
> with oom_lock held is currently an unavoidable problem.
> 
> As a preparation for not to sleep with oom_lock held, this patch brings
> OOM notifier callbacks to outside of OOM killer, with two small behavior
> changes explained below.
> 
> One is that this patch makes it impossible for SysRq-f and PF-OOM to
> reclaim via OOM notifier. But such change should be tolerable because
> "we unlikely try to use SysRq-f for reclaiming memory via OOM notifier
> callbacks" and "pagefault_out_of_memory() will be called when OOM killer
> selected current thread as an OOM victim after OOM notifier callbacks
> already failed to reclaim memory".
> 
> The other is that this patch makes it possible to reclaim memory via OOM
> notifier after OOM killer is disabled (that is, suspend/hibernate is in
> progress). But such change should be safe because of pm_suspended_storage()
> check.
> 

Makes sense in general and I don't think that getting around these two 
caveats is problematic.  We should be handling everything that can free 
memory to preempt an oom kill in the page allocator and get rid of the 
notion that this is "oom notification" when in reality it is intended as 
the last form of reclaim prior to oom kill.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/oom.h |  1 +
>  mm/oom_kill.c       | 35 ++++++++++++++++++------
>  mm/page_alloc.c     | 76 +++++++++++++++++++++++++++++++----------------------
>  3 files changed, 73 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac11..085b033 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -101,6 +101,7 @@ extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>  		unsigned long totalpages);
>  
> +extern unsigned long try_oom_notifier(void);
>  extern bool out_of_memory(struct oom_control *oc);
>  
>  extern void exit_oom_victim(void);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e7..2ff5db2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb)
>  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>  
>  /**
> + * try_oom_notifier - Try to reclaim memory from OOM notifier list.
> + *
> + * Returns non-zero if notifier callbacks released something, zero otherwise.
> + */
> +unsigned long try_oom_notifier(void)

It certainly is tried, but based on its usage it would probably be better 
to describe what is being returned (it's going to set *did_some_progress, 
which is a page count).

That makes me think that "oom_notify_list" isn't very intuitive: it can 
free memory as a last step prior to oom kill.  OOM notify, to me, sounds 
like its only notifying some callbacks about the condition.  Maybe 
oom_reclaim_list and then rename this to oom_reclaim_pages()?

> +{
> +	static DEFINE_MUTEX(oom_notifier_lock);
> +	unsigned long freed = 0;
> +
> +	/*
> +	 * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM
> +	 * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe.
> +	 * If lockdep reports possible deadlock dependency, it will be a bug in
> +	 * OOM notifier callbacks.
> +	 *
> +	 * If SIGKILL is pending, it is likely that current thread was selected
> +	 * as an OOM victim. In that case, current thread should return as soon
> +	 * as possible using memory reserves.
> +	 */
> +	if (mutex_lock_killable(&oom_notifier_lock))
> +		return 0;
> +	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> +	mutex_unlock(&oom_notifier_lock);
> +	return freed;
> +}

If __blocking_notifier_call_chain() used down_read_killable(), could we 
eliminate oom_notifier_lock?

> +
> +/**
>   * out_of_memory - kill the "best" process when we run out of memory
>   * @oc: pointer to struct oom_control
>   *
> @@ -1020,19 +1047,11 @@ int unregister_oom_notifier(struct notifier_block *nb)
>   */
>  bool out_of_memory(struct oom_control *oc)
>  {
> -	unsigned long freed = 0;
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	if (oom_killer_disabled)
>  		return false;
>  
> -	if (!is_memcg_oom(oc)) {
> -		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -		if (freed > 0)
> -			/* Got some memory back in the last second. */
> -			return true;
> -	}
> -
>  	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1521100..c72ef1e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3447,10 +3447,50 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	return page;
>  }
>  
> +static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order,
> +			       const struct alloc_context *ac)

I'd suggest a more mm/page_alloc.c friendly name, something like 
oom_kill_allowed().

> +{
> +	/* Coredumps can quickly deplete all memory reserves */
> +	if (current->flags & PF_DUMPCORE)
> +		return false;
> +	/* The OOM killer will not help higher order allocs */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +		return false;
> +	/*
> +	 * We have already exhausted all our reclaim opportunities without any
> +	 * success so it is time to admit defeat. We will skip the OOM killer
> +	 * because it is very likely that the caller has a more reasonable
> +	 * fallback than shooting a random task.
> +	 */
> +	if (gfp_mask & __GFP_RETRY_MAYFAIL)
> +		return false;
> +	/* The OOM killer does not needlessly kill tasks for lowmem */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		return false;
> +	if (pm_suspended_storage())
> +		return false;
> +	/*
> +	 * XXX: GFP_NOFS allocations should rather fail than rely on
> +	 * other request to make a forward progress.
> +	 * We are in an unfortunate situation where out_of_memory cannot
> +	 * do much for this context but let's try it to at least get
> +	 * access to memory reserved if the current task is killed (see
> +	 * out_of_memory). Once filesystems are ready to handle allocation
> +	 * failures more gracefully we should just bail out here.
> +	 */
> +
> +	/* The OOM killer may not free memory on a specific node */
> +	if (gfp_mask & __GFP_THISNODE)
> +		return false;
> +
> +	return true;
> +}
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
>  {
> +	const bool oomkill = can_oomkill(gfp_mask, order, ac);
>  	struct oom_control oc = {
>  		.zonelist = ac->zonelist,
>  		.nodemask = ac->nodemask,
> @@ -3462,6 +3502,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  
>  	*did_some_progress = 0;
>  
> +	/* Try to reclaim via OOM notifier callback. */
> +	if (oomkill)
> +		*did_some_progress = try_oom_notifier();
> +
>  	/*
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.

*did_some_progress = oom_kill_allowed ? oom_reclaim_pages() : 0;

This patch is certainly an improvement because it does the last 
get_page_from_freelist() call after invoking the oom notifiers that can 
free memory and we've otherwise pointlessly redirected it elsewhere.

> @@ -3485,37 +3529,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	if (page)
>  		goto out;
>  
> -	/* Coredumps can quickly deplete all memory reserves */
> -	if (current->flags & PF_DUMPCORE)
> -		goto out;
> -	/* The OOM killer will not help higher order allocs */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		goto out;
> -	/*
> -	 * We have already exhausted all our reclaim opportunities without any
> -	 * success so it is time to admit defeat. We will skip the OOM killer
> -	 * because it is very likely that the caller has a more reasonable
> -	 * fallback than shooting a random task.
> -	 */
> -	if (gfp_mask & __GFP_RETRY_MAYFAIL)
> -		goto out;
> -	/* The OOM killer does not needlessly kill tasks for lowmem */
> -	if (ac->high_zoneidx < ZONE_NORMAL)
> -		goto out;
> -	if (pm_suspended_storage())
> -		goto out;
> -	/*
> -	 * XXX: GFP_NOFS allocations should rather fail than rely on
> -	 * other request to make a forward progress.
> -	 * We are in an unfortunate situation where out_of_memory cannot
> -	 * do much for this context but let's try it to at least get
> -	 * access to memory reserved if the current task is killed (see
> -	 * out_of_memory). Once filesystems are ready to handle allocation
> -	 * failures more gracefully we should just bail out here.
> -	 */
> -
> -	/* The OOM killer may not free memory on a specific node */
> -	if (gfp_mask & __GFP_THISNODE)
> +	if (!oomkill)
>  		goto out;
>  
>  	/* Exhausted what can be done so it's blame time */
Michal Hocko June 21, 2018, 7:31 a.m. UTC | #5
On Wed 20-06-18 15:36:45, David Rientjes wrote:
[...]
> That makes me think that "oom_notify_list" isn't very intuitive: it can 
> free memory as a last step prior to oom kill.  OOM notify, to me, sounds 
> like its only notifying some callbacks about the condition.  Maybe 
> oom_reclaim_list and then rename this to oom_reclaim_pages()?

Yes agreed and that is the reason I keep saying we want to get rid of
this yet-another-reclaim mechanism. We already have shrinkers which are
the main source of non-lru pages reclaim. Why do we even need
oom_reclaim_pages? What is fundamentally different here? Sure those
pages should be reclaimed as the last resort but we already do have
priority for slab shrinking so we know that the system is struggling
when reaching the lowest priority. Isn't that enough to express the need
for current oom notifier implementations?
Tetsuo Handa June 21, 2018, 11:27 a.m. UTC | #6
On 2018/06/21 7:36, David Rientjes wrote:
>> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb)
>>  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>>  
>>  /**
>> + * try_oom_notifier - Try to reclaim memory from OOM notifier list.
>> + *
>> + * Returns non-zero if notifier callbacks released something, zero otherwise.
>> + */
>> +unsigned long try_oom_notifier(void)
> 
> It certainly is tried, but based on its usage it would probably be better 
> to describe what is being returned (it's going to set *did_some_progress, 
> which is a page count).

Well, it depends on what the callbacks are doing. Currently, we have 5 users.

  arch/powerpc/platforms/pseries/cmm.c
  arch/s390/mm/cmm.c
  drivers/gpu/drm/i915/i915_gem_shrinker.c
  drivers/virtio/virtio_balloon.c
  kernel/rcu/tree_plugin.h

Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether
the callback helped releasing memory, for it does not update the "freed" argument.

>> +{
>> +	static DEFINE_MUTEX(oom_notifier_lock);
>> +	unsigned long freed = 0;
>> +
>> +	/*
>> +	 * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM
>> +	 * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe.
>> +	 * If lockdep reports possible deadlock dependency, it will be a bug in
>> +	 * OOM notifier callbacks.
>> +	 *
>> +	 * If SIGKILL is pending, it is likely that current thread was selected
>> +	 * as an OOM victim. In that case, current thread should return as soon
>> +	 * as possible using memory reserves.
>> +	 */
>> +	if (mutex_lock_killable(&oom_notifier_lock))
>> +		return 0;
>> +	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>> +	mutex_unlock(&oom_notifier_lock);
>> +	return freed;
>> +}
> 
> If __blocking_notifier_call_chain() used down_read_killable(), could we 
> eliminate oom_notifier_lock?

I don't think we can eliminate it now, for it is a serialization lock
(while trying to respond to SIGKILL as soon as possible) which is currently
achieved by mutex_trylock(&oom_lock).

(1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being
    called concurrently.

----------
static int rcu_oom_notify(struct notifier_block *self,
			  unsigned long notused, void *nfreed)
{
	int cpu;

	/* Wait for callbacks from earlier instance to complete. */
	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time.
	smp_mb(); /* Ensure callback reuse happens after callback invocation. */

	/*
	 * Prevent premature wakeup: ensure that all increments happen
	 * before there is a chance of the counter reaching zero.
	 */
	atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time.

	for_each_online_cpu(cpu) {
		smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
		cond_resched_tasks_rcu_qs();
	}

	/* Unconditionally decrement: no need to wake ourselves up. */
	atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ?

	return NOTIFY_OK;
}
----------

    The counter inconsistency problem could be fixed by

-	atomic_set(&oom_callback_count, 1);
+	atomic_inc(&oom_callback_count);

    but who becomes happy if rcu_oom_notify() became ready to be called
    concurrently? We want to wait for the callback to complete before
    proceeding to the OOM killer. I think that we should save CPU resource
    by serializing concurrent callers.

(2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends
    on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from
    i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched()
    becomes true in order to avoid returning without reclaiming memory.

> This patch is certainly an improvement because it does the last 
> get_page_from_freelist() call after invoking the oom notifiers that can 
> free memory and we've otherwise pointlessly redirected it elsewhere.

Thanks, but this patch might break subtle balance which is currently
achieved by mutex_trylock(&oom_lock) serialization/exclusion.

(3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default
    tries to release 256 pages. Since this value is configurable, one might
    set 1048576 pages. If virtballoon_oom_notify() is concurrently called by
    many threads, it might needlessly deflate the memory balloon.

We might want to remember and reuse the last result among serialized callers
(feedback mechanism) like

{
	static DEFINE_MUTEX(oom_notifier_lock);
	static unsigned long last_freed;
	unsigned long freed = 0;
	if (mutex_trylock(&oom_notifier_lock)) {
		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
		last_freed = freed;
	} else {
		mutex_lock(&oom_notifier_lock);
		freed = last_freed;
	}
	mutex_unlock(&oom_notifier_lock);
	return freed;

}

or

{
	static DEFINE_MUTEX(oom_notifier_lock);
	static unsigned long last_freed;
	unsigned long freed = 0;
	if (mutex_lock_killable(&oom_notifier_lock)) {
		freed = last_freed;
		last_freed >>= 1;
		return freed;
	} else if (last_freed) {
		freed = last_freed;
		last_freed >>= 1;
	} else {
		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
		last_freed = freed;
	}
	mutex_unlock(&oom_notifier_lock);
	return freed;
}

. Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization
could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock)
serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would
implement feedback mechanism themselves, by examining watermark from OOM notifier
hooks.



On 2018/06/21 16:31, Michal Hocko wrote:
> On Wed 20-06-18 15:36:45, David Rientjes wrote:
> [...]
>> That makes me think that "oom_notify_list" isn't very intuitive: it can 
>> free memory as a last step prior to oom kill.  OOM notify, to me, sounds 
>> like its only notifying some callbacks about the condition.  Maybe 
>> oom_reclaim_list and then rename this to oom_reclaim_pages()?
> 
> Yes agreed and that is the reason I keep saying we want to get rid of
> this yet-another-reclaim mechanism. We already have shrinkers which are
> the main source of non-lru pages reclaim. Why do we even need
> oom_reclaim_pages? What is fundamentally different here? Sure those
> pages should be reclaimed as the last resort but we already do have
> priority for slab shrinking so we know that the system is struggling
> when reaching the lowest priority. Isn't that enough to express the need
> for current oom notifier implementations?
> 

Even if we update OOM notifier users to use shrinker hooks, they will need a
subtle balance which is currently achieved by mutex_trylock(&oom_lock).

Removing OOM notifier is not doable right now. It is not suitable as a regression
fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper
unmap, v3"). What we could afford for this regression is
https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested
in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
Michal Hocko June 21, 2018, 12:05 p.m. UTC | #7
On Thu 21-06-18 20:27:41, Tetsuo Handa wrote:
[....]
> On 2018/06/21 16:31, Michal Hocko wrote:
> > On Wed 20-06-18 15:36:45, David Rientjes wrote:
> > [...]
> >> That makes me think that "oom_notify_list" isn't very intuitive: it can 
> >> free memory as a last step prior to oom kill.  OOM notify, to me, sounds 
> >> like its only notifying some callbacks about the condition.  Maybe 
> >> oom_reclaim_list and then rename this to oom_reclaim_pages()?
> > 
> > Yes agreed and that is the reason I keep saying we want to get rid of
> > this yet-another-reclaim mechanism. We already have shrinkers which are
> > the main source of non-lru pages reclaim. Why do we even need
> > oom_reclaim_pages? What is fundamentally different here? Sure those
> > pages should be reclaimed as the last resort but we already do have
> > priority for slab shrinking so we know that the system is struggling
> > when reaching the lowest priority. Isn't that enough to express the need
> > for current oom notifier implementations?
> > 
> 
> Even if we update OOM notifier users to use shrinker hooks, they will need a
> subtle balance which is currently achieved by mutex_trylock(&oom_lock).

No they do not. They do not want to rely on an unrelated locks to work
properly. That is completely wrong design. We do want locks to protect
specific data rather than code.

> Removing OOM notifier is not doable right now.

I haven't heard any technical argument why.

> It is not suitable as a regression
> fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper
> unmap, v3").

What is the regression?

> What we could afford for this regression is
> https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested
> in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
Peter Enderborg June 25, 2018, 1:03 p.m. UTC | #8
On 06/20/2018 01:55 PM, Michal Hocko wrote:
> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>> with oom_lock held is currently an unavoidable problem.
> Could you be more specific about the potential deadlock? Sleeping while
> holding oom lock is certainly not nice but I do not see how that would
> result in a deadlock assuming that the sleeping context doesn't sleep on
> the memory allocation obviously.

It is a mutex you are supposed to be able to sleep.  It's even exported.

>> As a preparation for not to sleep with oom_lock held, this patch brings
>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>> changes explained below.
> Can we just eliminate this ugliness and remove it altogether? We do not
> have that many notifiers. Is there anything fundamental that would
> prevent us from moving them to shrinkers instead?


@Hocko Do you remember the lowmemorykiller from android? Some things might not be the right thing for shrinkers.
Michal Hocko June 25, 2018, 1:07 p.m. UTC | #9
On Mon 25-06-18 15:03:40, peter enderborg wrote:
> On 06/20/2018 01:55 PM, Michal Hocko wrote:
> > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
> >> Sleeping with oom_lock held can cause AB-BA lockup bug because
> >> __alloc_pages_may_oom() does not wait for oom_lock. Since
> >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
> >> with oom_lock held is currently an unavoidable problem.
> > Could you be more specific about the potential deadlock? Sleeping while
> > holding oom lock is certainly not nice but I do not see how that would
> > result in a deadlock assuming that the sleeping context doesn't sleep on
> > the memory allocation obviously.
> 
> It is a mutex you are supposed to be able to sleep.  It's even exported.

What do you mean? oom_lock is certainly not exported for general use. It
is not local to oom_killer.c just because it is needed in other _mm_
code.
 
> >> As a preparation for not to sleep with oom_lock held, this patch brings
> >> OOM notifier callbacks to outside of OOM killer, with two small behavior
> >> changes explained below.
> > Can we just eliminate this ugliness and remove it altogether? We do not
> > have that many notifiers. Is there anything fundamental that would
> > prevent us from moving them to shrinkers instead?
> 
> 
> @Hocko Do you remember the lowmemorykiller from android? Some things
> might not be the right thing for shrinkers.

Just that lmk did it wrong doesn't mean others have to follow.
Peter Enderborg June 25, 2018, 2:02 p.m. UTC | #10
On 06/25/2018 03:07 PM, Michal Hocko wrote:
> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>>
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right)
But if you don’t have a way to interact with the memory system we will get attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.
Peter Enderborg June 25, 2018, 2:04 p.m. UTC | #11
On 06/25/2018 03:07 PM, Michal Hocko wrote:

> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right)
But if you don’t have a way to interact with the memory system we will get attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.
Michal Hocko June 25, 2018, 2:12 p.m. UTC | #12
On Mon 25-06-18 16:04:04, peter enderborg wrote:
> On 06/25/2018 03:07 PM, Michal Hocko wrote:
> 
> > On Mon 25-06-18 15:03:40, peter enderborg wrote:
> >> On 06/20/2018 01:55 PM, Michal Hocko wrote:
> >>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
> >>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
> >>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
> >>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
> >>>> with oom_lock held is currently an unavoidable problem.
> >>> Could you be more specific about the potential deadlock? Sleeping while
> >>> holding oom lock is certainly not nice but I do not see how that would
> >>> result in a deadlock assuming that the sleeping context doesn't sleep on
> >>> the memory allocation obviously.
> >> It is a mutex you are supposed to be able to sleep.  It's even exported.
> > What do you mean? oom_lock is certainly not exported for general use. It
> > is not local to oom_killer.c just because it is needed in other _mm_
> > code.
> >  
> 
> It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
> be in mm/ and a documented note about the special rules. It is only used
> in drivers/tty/sysrq.c and that be replaced by a help function in mm that
> do the  oom stuff.

Well, there are many things defined in kernel header files and not meant
for wider use. Using random locks is generally discouraged I would say
unless you are sure you know what you are doing. We could do some more
work to hide internals for sure, though.
 
> >>>> As a preparation for not to sleep with oom_lock held, this patch brings
> >>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
> >>>> changes explained below.
> >>> Can we just eliminate this ugliness and remove it altogether? We do not
> >>> have that many notifiers. Is there anything fundamental that would
> >>> prevent us from moving them to shrinkers instead?
> >> @Hocko Do you remember the lowmemorykiller from android? Some things
> >> might not be the right thing for shrinkers.
> > Just that lmk did it wrong doesn't mean others have to follow.
> >
> If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right)
> But if you don’t have a way to interact with the memory system we will get attempts like lmk. 
> Oom notifiers and vmpressure is for this task better than shrinkers.

A lack of feature should be a trigger for a discussion rather than a
quick hack that seems to work for a particular usecase and live out of
tree, then get to staging and hope it will fix itself. Seriously, the
kernel development is not a nail hammering.
Paul E. McKenney June 26, 2018, 5:03 p.m. UTC | #13
On Thu, Jun 21, 2018 at 08:27:41PM +0900, Tetsuo Handa wrote:
> On 2018/06/21 7:36, David Rientjes wrote:
> >> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb)
> >>  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
> >>  
> >>  /**
> >> + * try_oom_notifier - Try to reclaim memory from OOM notifier list.
> >> + *
> >> + * Returns non-zero if notifier callbacks released something, zero otherwise.
> >> + */
> >> +unsigned long try_oom_notifier(void)
> > 
> > It certainly is tried, but based on its usage it would probably be better 
> > to describe what is being returned (it's going to set *did_some_progress, 
> > which is a page count).
> 
> Well, it depends on what the callbacks are doing. Currently, we have 5 users.
> 
>   arch/powerpc/platforms/pseries/cmm.c
>   arch/s390/mm/cmm.c
>   drivers/gpu/drm/i915/i915_gem_shrinker.c
>   drivers/virtio/virtio_balloon.c
>   kernel/rcu/tree_plugin.h
> 
> Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether
> the callback helped releasing memory, for it does not update the "freed" argument.

It does not immediately release memory, but instead takes steps to
encourage memory to be released more quickly.

> >> +{
> >> +	static DEFINE_MUTEX(oom_notifier_lock);
> >> +	unsigned long freed = 0;
> >> +
> >> +	/*
> >> +	 * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM
> >> +	 * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe.
> >> +	 * If lockdep reports possible deadlock dependency, it will be a bug in
> >> +	 * OOM notifier callbacks.
> >> +	 *
> >> +	 * If SIGKILL is pending, it is likely that current thread was selected
> >> +	 * as an OOM victim. In that case, current thread should return as soon
> >> +	 * as possible using memory reserves.
> >> +	 */
> >> +	if (mutex_lock_killable(&oom_notifier_lock))
> >> +		return 0;
> >> +	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> >> +	mutex_unlock(&oom_notifier_lock);
> >> +	return freed;
> >> +}
> > 
> > If __blocking_notifier_call_chain() used down_read_killable(), could we 
> > eliminate oom_notifier_lock?
> 
> I don't think we can eliminate it now, for it is a serialization lock
> (while trying to respond to SIGKILL as soon as possible) which is currently
> achieved by mutex_trylock(&oom_lock).
> 
> (1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being
>     called concurrently.
> 
> ----------
> static int rcu_oom_notify(struct notifier_block *self,
> 			  unsigned long notused, void *nfreed)
> {
> 	int cpu;
> 
> 	/* Wait for callbacks from earlier instance to complete. */
> 	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time.
> 	smp_mb(); /* Ensure callback reuse happens after callback invocation. */
> 
> 	/*
> 	 * Prevent premature wakeup: ensure that all increments happen
> 	 * before there is a chance of the counter reaching zero.
> 	 */
> 	atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time.
> 
> 	for_each_online_cpu(cpu) {
> 		smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
> 		cond_resched_tasks_rcu_qs();
> 	}
> 
> 	/* Unconditionally decrement: no need to wake ourselves up. */
> 	atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ?
> 
> 	return NOTIFY_OK;
> }
> ----------
> 
>     The counter inconsistency problem could be fixed by
> 
> -	atomic_set(&oom_callback_count, 1);
> +	atomic_inc(&oom_callback_count);
> 
>     but who becomes happy if rcu_oom_notify() became ready to be called
>     concurrently? We want to wait for the callback to complete before
>     proceeding to the OOM killer. I think that we should save CPU resource
>     by serializing concurrent callers.

There are a lot of ways it could be made concurrency safe.  If you need
me to do this, please do let me know.

That said, the way it is now written, if you invoke rcu_oom_notify()
twice in a row, the second invocation will wait until the memory from
the first invocation is freed.  What do you want me to do if you invoke
me concurrently?

1.	One invocation "wins", waits for the earlier callbacks to
	complete, then encourages any subsequent callbacks to be
	processed more quickly.  The other invocations return
	immediately without doing anything.

2.	The invocations serialize, with each invocation waiting for
	the callbacks from previous invocation (in mutex_lock() order
	or some such), and then starting a new round.

3.	Something else?

							Thanx, Paul

> (2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends
>     on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from
>     i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched()
>     becomes true in order to avoid returning without reclaiming memory.
> 
> > This patch is certainly an improvement because it does the last 
> > get_page_from_freelist() call after invoking the oom notifiers that can 
> > free memory and we've otherwise pointlessly redirected it elsewhere.
> 
> Thanks, but this patch might break subtle balance which is currently
> achieved by mutex_trylock(&oom_lock) serialization/exclusion.
> 
> (3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default
>     tries to release 256 pages. Since this value is configurable, one might
>     set 1048576 pages. If virtballoon_oom_notify() is concurrently called by
>     many threads, it might needlessly deflate the memory balloon.
> 
> We might want to remember and reuse the last result among serialized callers
> (feedback mechanism) like
> 
> {
> 	static DEFINE_MUTEX(oom_notifier_lock);
> 	static unsigned long last_freed;
> 	unsigned long freed = 0;
> 	if (mutex_trylock(&oom_notifier_lock)) {
> 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> 		last_freed = freed;
> 	} else {
> 		mutex_lock(&oom_notifier_lock);
> 		freed = last_freed;
> 	}
> 	mutex_unlock(&oom_notifier_lock);
> 	return freed;
> 
> }
> 
> or
> 
> {
> 	static DEFINE_MUTEX(oom_notifier_lock);
> 	static unsigned long last_freed;
> 	unsigned long freed = 0;
> 	if (mutex_lock_killable(&oom_notifier_lock)) {
> 		freed = last_freed;
> 		last_freed >>= 1;
> 		return freed;
> 	} else if (last_freed) {
> 		freed = last_freed;
> 		last_freed >>= 1;
> 	} else {
> 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> 		last_freed = freed;
> 	}
> 	mutex_unlock(&oom_notifier_lock);
> 	return freed;
> }
> 
> . Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization
> could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock)
> serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would
> implement feedback mechanism themselves, by examining watermark from OOM notifier
> hooks.
> 
> 
> 
> On 2018/06/21 16:31, Michal Hocko wrote:
> > On Wed 20-06-18 15:36:45, David Rientjes wrote:
> > [...]
> >> That makes me think that "oom_notify_list" isn't very intuitive: it can 
> >> free memory as a last step prior to oom kill.  OOM notify, to me, sounds 
> >> like its only notifying some callbacks about the condition.  Maybe 
> >> oom_reclaim_list and then rename this to oom_reclaim_pages()?
> > 
> > Yes agreed and that is the reason I keep saying we want to get rid of
> > this yet-another-reclaim mechanism. We already have shrinkers which are
> > the main source of non-lru pages reclaim. Why do we even need
> > oom_reclaim_pages? What is fundamentally different here? Sure those
> > pages should be reclaimed as the last resort but we already do have
> > priority for slab shrinking so we know that the system is struggling
> > when reaching the lowest priority. Isn't that enough to express the need
> > for current oom notifier implementations?
> > 
> 
> Even if we update OOM notifier users to use shrinker hooks, they will need a
> subtle balance which is currently achieved by mutex_trylock(&oom_lock).
> 
> Removing OOM notifier is not doable right now. It is not suitable as a regression
> fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper
> unmap, v3"). What we could afford for this regression is
> https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested
> in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
>
Tetsuo Handa June 26, 2018, 8:10 p.m. UTC | #14
On 2018/06/27 2:03, Paul E. McKenney wrote:
> There are a lot of ways it could be made concurrency safe.  If you need
> me to do this, please do let me know.
> 
> That said, the way it is now written, if you invoke rcu_oom_notify()
> twice in a row, the second invocation will wait until the memory from
> the first invocation is freed.  What do you want me to do if you invoke
> me concurrently?
> 
> 1.	One invocation "wins", waits for the earlier callbacks to
> 	complete, then encourages any subsequent callbacks to be
> 	processed more quickly.  The other invocations return
> 	immediately without doing anything.
> 
> 2.	The invocations serialize, with each invocation waiting for
> 	the callbacks from previous invocation (in mutex_lock() order
> 	or some such), and then starting a new round.
> 
> 3.	Something else?
> 
> 							Thanx, Paul

As far as I can see,

-	atomic_set(&oom_callback_count, 1);
+	atomic_inc(&oom_callback_count);

should be sufficient.
Paul E. McKenney June 26, 2018, 11:50 p.m. UTC | #15
On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote:
> On 2018/06/27 2:03, Paul E. McKenney wrote:
> > There are a lot of ways it could be made concurrency safe.  If you need
> > me to do this, please do let me know.
> > 
> > That said, the way it is now written, if you invoke rcu_oom_notify()
> > twice in a row, the second invocation will wait until the memory from
> > the first invocation is freed.  What do you want me to do if you invoke
> > me concurrently?
> > 
> > 1.	One invocation "wins", waits for the earlier callbacks to
> > 	complete, then encourages any subsequent callbacks to be
> > 	processed more quickly.  The other invocations return
> > 	immediately without doing anything.
> > 
> > 2.	The invocations serialize, with each invocation waiting for
> > 	the callbacks from previous invocation (in mutex_lock() order
> > 	or some such), and then starting a new round.
> > 
> > 3.	Something else?
> > 
> > 							Thanx, Paul
> 
> As far as I can see,
> 
> -	atomic_set(&oom_callback_count, 1);
> +	atomic_inc(&oom_callback_count);
> 
> should be sufficient.

I don't see how that helps.  For example, suppose that two tasks
invoked rcu_oom_notify() at about the same time.  Then they could
both see oom_callback_count equal to zero, both atomically increment
oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu()
on each online CPU.

So far, so good.  But rcu_oom_notify_cpu() enqueues a per-CPU RCU
callback, and enqueuing the same callback twice in quick succession
would fatally tangle RCU's callback lists.

What am I missing here?

							Thanx, Paul
Michal Hocko June 27, 2018, 7:22 a.m. UTC | #16
On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
[...]
> 3.	Something else?

How hard it would be to use a different API than oom notifiers? E.g. a
shrinker which just kicks all the pending callbacks if the reclaim
priority reaches low values (e.g. 0)?
Tetsuo Handa June 27, 2018, 10:52 a.m. UTC | #17
On 2018/06/27 8:50, Paul E. McKenney wrote:
> On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote:
>> As far as I can see,
>>
>> -	atomic_set(&oom_callback_count, 1);
>> +	atomic_inc(&oom_callback_count);
>>
>> should be sufficient.
> 
> I don't see how that helps.  For example, suppose that two tasks
> invoked rcu_oom_notify() at about the same time.  Then they could
> both see oom_callback_count equal to zero, both atomically increment
> oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu()
> on each online CPU.
> 
> So far, so good.  But rcu_oom_notify_cpu() enqueues a per-CPU RCU
> callback, and enqueuing the same callback twice in quick succession
> would fatally tangle RCU's callback lists.
> 
> What am I missing here?
> 
> 							Thanx, Paul

You are pointing out that "number of rsp->call() is called" > "number of
rcu_oom_callback() is called" can happen if concurrently called, aren't you?
Then, you are not missing anything. You will need to use something equivalent
to oom_lock even if you can convert rcu_oom_notify() to use shrinkers.
Paul E. McKenney June 27, 2018, 2:28 p.m. UTC | #18
On Wed, Jun 27, 2018 at 07:52:23PM +0900, Tetsuo Handa wrote:
> On 2018/06/27 8:50, Paul E. McKenney wrote:
> > On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote:
> >> As far as I can see,
> >>
> >> -	atomic_set(&oom_callback_count, 1);
> >> +	atomic_inc(&oom_callback_count);
> >>
> >> should be sufficient.
> > 
> > I don't see how that helps.  For example, suppose that two tasks
> > invoked rcu_oom_notify() at about the same time.  Then they could
> > both see oom_callback_count equal to zero, both atomically increment
> > oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu()
> > on each online CPU.
> > 
> > So far, so good.  But rcu_oom_notify_cpu() enqueues a per-CPU RCU
> > callback, and enqueuing the same callback twice in quick succession
> > would fatally tangle RCU's callback lists.
> > 
> > What am I missing here?
> 
> You are pointing out that "number of rsp->call() is called" > "number of
> rcu_oom_callback() is called" can happen if concurrently called, aren't you?

Yes.  Reusing an rcu_head before invocation of the earlier use is
very bad indeed.  ;-)

> Then, you are not missing anything. You will need to use something equivalent
> to oom_lock even if you can convert rcu_oom_notify() to use shrinkers.

What should I look at to work out whether it makes sense to convert
rcu_oom_notify() to shrinkers, and if so, how to go about it?

Or are you simply asking me to serialize rcu_oom_notify()?  (Which is
of course not difficult, so please just let me know.)

							Thanx, Paul
Paul E. McKenney June 27, 2018, 2:31 p.m. UTC | #19
On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote:
> On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
> [...]
> > 3.	Something else?
> 
> How hard it would be to use a different API than oom notifiers? E.g. a
> shrinker which just kicks all the pending callbacks if the reclaim
> priority reaches low values (e.g. 0)?

Beats me.  What is a shrinker?  ;-)

More seriously, could you please point me at an exemplary shrinker
use case so I can see what is involved?

							Thanx, Paul
Michal Hocko June 28, 2018, 11:39 a.m. UTC | #20
On Wed 27-06-18 07:31:25, Paul E. McKenney wrote:
> On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote:
> > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
> > [...]
> > > 3.	Something else?
> > 
> > How hard it would be to use a different API than oom notifiers? E.g. a
> > shrinker which just kicks all the pending callbacks if the reclaim
> > priority reaches low values (e.g. 0)?
> 
> Beats me.  What is a shrinker?  ;-)

This is a generich mechanism to reclaim memory that is not on standard
LRU lists. Lwn.net surely has some nice coverage (e.g.
https://lwn.net/Articles/548092/).

> More seriously, could you please point me at an exemplary shrinker
> use case so I can see what is involved?

Well, I am not really sure what is the objective of the oom notifier to
point you to the right direction. IIUC you just want to kick callbacks
to be handled sooner under a heavy memory pressure, right? How is that
achieved? Kick a worker?
Paul E. McKenney June 28, 2018, 9:31 p.m. UTC | #21
On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
> On Wed 27-06-18 07:31:25, Paul E. McKenney wrote:
> > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote:
> > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
> > > [...]
> > > > 3.	Something else?
> > > 
> > > How hard it would be to use a different API than oom notifiers? E.g. a
> > > shrinker which just kicks all the pending callbacks if the reclaim
> > > priority reaches low values (e.g. 0)?
> > 
> > Beats me.  What is a shrinker?  ;-)
> 
> This is a generich mechanism to reclaim memory that is not on standard
> LRU lists. Lwn.net surely has some nice coverage (e.g.
> https://lwn.net/Articles/548092/).

"In addition, there is little agreement over what a call to a shrinker
really means or how the called subsystem should respond."  ;-)

Is this set up using register_shrinker() in mm/vmscan.c?  I am guessing
that the many mentions of shrinker in DRM are irrelevant.

If my guess is correct, the API seems a poor fit for RCU.  I can
produce an approximate number of RCU callbacks for ->count_objects(),
but a given callback might free a lot of memory or none at all.  Plus,
to actually have ->scan_objects() free them before returning, I would
need to use something like rcu_barrier(), which might involve longer
delays than desired.

Or am I missing something here?

> > More seriously, could you please point me at an exemplary shrinker
> > use case so I can see what is involved?
> 
> Well, I am not really sure what is the objective of the oom notifier to
> point you to the right direction. IIUC you just want to kick callbacks
> to be handled sooner under a heavy memory pressure, right? How is that
> achieved? Kick a worker?

That is achieved by enqueuing a non-lazy callback on each CPU's callback
list, but only for those CPUs having non-empty lists.  This causes
CPUs with lists containing only lazy callbacks to be more aggressive,
in particular, it prevents such CPUs from hanging out idle for seconds
at a time while they have callbacks on their lists.

The enqueuing happens via an IPI to the CPU in question.

						Thanx, Paul
Michal Hocko June 29, 2018, 9:04 a.m. UTC | #22
On Thu 28-06-18 14:31:05, Paul E. McKenney wrote:
> On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
> > On Wed 27-06-18 07:31:25, Paul E. McKenney wrote:
> > > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote:
> > > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
> > > > [...]
> > > > > 3.	Something else?
> > > > 
> > > > How hard it would be to use a different API than oom notifiers? E.g. a
> > > > shrinker which just kicks all the pending callbacks if the reclaim
> > > > priority reaches low values (e.g. 0)?
> > > 
> > > Beats me.  What is a shrinker?  ;-)
> > 
> > This is a generich mechanism to reclaim memory that is not on standard
> > LRU lists. Lwn.net surely has some nice coverage (e.g.
> > https://lwn.net/Articles/548092/).
> 
> "In addition, there is little agreement over what a call to a shrinker
> really means or how the called subsystem should respond."  ;-)
> 
> Is this set up using register_shrinker() in mm/vmscan.c?  I am guessing

Yes, exactly. You are supposed to implement the two methods in struct
shrink_control

> that the many mentions of shrinker in DRM are irrelevant.
> 
> If my guess is correct, the API seems a poor fit for RCU.  I can
> produce an approximate number of RCU callbacks for ->count_objects(),
> but a given callback might free a lot of memory or none at all.  Plus,
> to actually have ->scan_objects() free them before returning, I would
> need to use something like rcu_barrier(), which might involve longer
> delays than desired.`

Well, I am not yet sure how good fit this is because I still do not
understand the underlying problem your notifier is trying to solve. So I
will get back to this once that is settled.
> 
> Or am I missing something here?
> 
> > > More seriously, could you please point me at an exemplary shrinker
> > > use case so I can see what is involved?
> > 
> > Well, I am not really sure what is the objective of the oom notifier to
> > point you to the right direction. IIUC you just want to kick callbacks
> > to be handled sooner under a heavy memory pressure, right? How is that
> > achieved? Kick a worker?
> 
> That is achieved by enqueuing a non-lazy callback on each CPU's callback
> list, but only for those CPUs having non-empty lists.  This causes
> CPUs with lists containing only lazy callbacks to be more aggressive,
> in particular, it prevents such CPUs from hanging out idle for seconds
> at a time while they have callbacks on their lists.
> 
> The enqueuing happens via an IPI to the CPU in question.

I am afraid this is too low level for my to understand what is going on
here. What are lazy callbacks and why do they need any specific action
when we are getting close to OOM? I mean, I do understand that we might
have many callers of call_rcu and free memory lazily. But there is quite
a long way before we start the reclaim until we reach the OOM killer path.
So why don't those callbacks get called during that time period? How are
their triggered when we are not hitting the OOM path? They surely cannot
sit there for ever, right? Can we trigger them sooner? Maybe the
shrinker is not the best fit but we have a retry feedback loop in the page
allocator, maybe we can kick this processing from there.
Paul E. McKenney June 29, 2018, 12:52 p.m. UTC | #23
On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote:
> On Thu 28-06-18 14:31:05, Paul E. McKenney wrote:
> > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
> > > On Wed 27-06-18 07:31:25, Paul E. McKenney wrote:
> > > > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote:
> > > > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
> > > > > [...]
> > > > > > 3.	Something else?
> > > > > 
> > > > > How hard it would be to use a different API than oom notifiers? E.g. a
> > > > > shrinker which just kicks all the pending callbacks if the reclaim
> > > > > priority reaches low values (e.g. 0)?
> > > > 
> > > > Beats me.  What is a shrinker?  ;-)
> > > 
> > > This is a generich mechanism to reclaim memory that is not on standard
> > > LRU lists. Lwn.net surely has some nice coverage (e.g.
> > > https://lwn.net/Articles/548092/).
> > 
> > "In addition, there is little agreement over what a call to a shrinker
> > really means or how the called subsystem should respond."  ;-)
> > 
> > Is this set up using register_shrinker() in mm/vmscan.c?  I am guessing
> 
> Yes, exactly. You are supposed to implement the two methods in struct
> shrink_control
> 
> > that the many mentions of shrinker in DRM are irrelevant.
> > 
> > If my guess is correct, the API seems a poor fit for RCU.  I can
> > produce an approximate number of RCU callbacks for ->count_objects(),
> > but a given callback might free a lot of memory or none at all.  Plus,
> > to actually have ->scan_objects() free them before returning, I would
> > need to use something like rcu_barrier(), which might involve longer
> > delays than desired.`
> 
> Well, I am not yet sure how good fit this is because I still do not
> understand the underlying problem your notifier is trying to solve. So I
> will get back to this once that is settled.
> > 
> > Or am I missing something here?
> > 
> > > > More seriously, could you please point me at an exemplary shrinker
> > > > use case so I can see what is involved?
> > > 
> > > Well, I am not really sure what is the objective of the oom notifier to
> > > point you to the right direction. IIUC you just want to kick callbacks
> > > to be handled sooner under a heavy memory pressure, right? How is that
> > > achieved? Kick a worker?
> > 
> > That is achieved by enqueuing a non-lazy callback on each CPU's callback
> > list, but only for those CPUs having non-empty lists.  This causes
> > CPUs with lists containing only lazy callbacks to be more aggressive,
> > in particular, it prevents such CPUs from hanging out idle for seconds
> > at a time while they have callbacks on their lists.
> > 
> > The enqueuing happens via an IPI to the CPU in question.
> 
> I am afraid this is too low level for my to understand what is going on
> here. What are lazy callbacks and why do they need any specific action
> when we are getting close to OOM? I mean, I do understand that we might
> have many callers of call_rcu and free memory lazily. But there is quite
> a long way before we start the reclaim until we reach the OOM killer path.
> So why don't those callbacks get called during that time period? How are
> their triggered when we are not hitting the OOM path? They surely cannot
> sit there for ever, right? Can we trigger them sooner? Maybe the
> shrinker is not the best fit but we have a retry feedback loop in the page
> allocator, maybe we can kick this processing from there.

The effect of RCU's current OOM code is to speed up callback invocation
by at most a few seconds (assuming no stalled CPUs, in which case
it is not possible to speed up callback invocation).

Given that, I should just remove RCU's OOM code entirely?

							Thanx, Paul
Michal Hocko June 29, 2018, 1:26 p.m. UTC | #24
On Fri 29-06-18 05:52:18, Paul E. McKenney wrote:
> On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote:
> > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote:
> > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
[...]
> > > > Well, I am not really sure what is the objective of the oom notifier to
> > > > point you to the right direction. IIUC you just want to kick callbacks
> > > > to be handled sooner under a heavy memory pressure, right? How is that
> > > > achieved? Kick a worker?
> > > 
> > > That is achieved by enqueuing a non-lazy callback on each CPU's callback
> > > list, but only for those CPUs having non-empty lists.  This causes
> > > CPUs with lists containing only lazy callbacks to be more aggressive,
> > > in particular, it prevents such CPUs from hanging out idle for seconds
> > > at a time while they have callbacks on their lists.
> > > 
> > > The enqueuing happens via an IPI to the CPU in question.
> > 
> > I am afraid this is too low level for my to understand what is going on
> > here. What are lazy callbacks and why do they need any specific action
> > when we are getting close to OOM? I mean, I do understand that we might
> > have many callers of call_rcu and free memory lazily. But there is quite
> > a long way before we start the reclaim until we reach the OOM killer path.
> > So why don't those callbacks get called during that time period? How are
> > their triggered when we are not hitting the OOM path? They surely cannot
> > sit there for ever, right? Can we trigger them sooner? Maybe the
> > shrinker is not the best fit but we have a retry feedback loop in the page
> > allocator, maybe we can kick this processing from there.
> 
> The effect of RCU's current OOM code is to speed up callback invocation
> by at most a few seconds (assuming no stalled CPUs, in which case
> it is not possible to speed up callback invocation).
> 
> Given that, I should just remove RCU's OOM code entirely?

Yeah, it seems so. I do not see how this would really help much. If we
really need some way to kick callbacks then we should do so much earlier
in the reclaim process - e.g. when we start struggling to reclaim any
memory.

I am curious. Has the notifier been motivated by a real world use case
or it was "nice thing to do"?
Tetsuo Handa June 29, 2018, 2:35 p.m. UTC | #25
On 2018/06/29 21:52, Paul E. McKenney wrote:
> The effect of RCU's current OOM code is to speed up callback invocation
> by at most a few seconds (assuming no stalled CPUs, in which case
> it is not possible to speed up callback invocation).
> 
> Given that, I should just remove RCU's OOM code entirely?

out_of_memory() will start selecting an OOM victim without calling
get_page_from_freelist() since rcu_oom_notify() does not set non-zero
value to "freed" field.

I think that rcu_oom_notify() needs to wait for completion of callback
invocations (possibly with timeout in case there are stalling CPUs) and
set non-zero value to "freed" field if pending callbacks did release memory.

However, what will be difficult to tell is whether invocation of pending callbacks
did release memory. Lack of last second get_page_from_freelist() call after
blocking_notifier_call_chain(&oom_notify_list, 0, &freed) forces rcu_oom_notify()
to set appropriate value (i.e. zero or non-zero) to "freed" field.

We have tried to move really last second get_page_from_freelist() call to inside
out_of_memory() after blocking_notifier_call_chain(&oom_notify_list, 0, &freed).
But that proposal was not accepted...

We could move blocking_notifier_call_chain(&oom_notify_list, 0, &freed) to
before last second get_page_from_freelist() call (and this is what this patch
is trying to do) which would allow rcu_oom_notify() to always return 0...
or update rcu_oom_notify() to use shrinker API...
Paul E. McKenney June 30, 2018, 5:05 p.m. UTC | #26
On Fri, Jun 29, 2018 at 03:26:38PM +0200, Michal Hocko wrote:
> On Fri 29-06-18 05:52:18, Paul E. McKenney wrote:
> > On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote:
> > > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote:
> > > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > Well, I am not really sure what is the objective of the oom notifier to
> > > > > point you to the right direction. IIUC you just want to kick callbacks
> > > > > to be handled sooner under a heavy memory pressure, right? How is that
> > > > > achieved? Kick a worker?
> > > > 
> > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback
> > > > list, but only for those CPUs having non-empty lists.  This causes
> > > > CPUs with lists containing only lazy callbacks to be more aggressive,
> > > > in particular, it prevents such CPUs from hanging out idle for seconds
> > > > at a time while they have callbacks on their lists.
> > > > 
> > > > The enqueuing happens via an IPI to the CPU in question.
> > > 
> > > I am afraid this is too low level for my to understand what is going on
> > > here. What are lazy callbacks and why do they need any specific action
> > > when we are getting close to OOM? I mean, I do understand that we might
> > > have many callers of call_rcu and free memory lazily. But there is quite
> > > a long way before we start the reclaim until we reach the OOM killer path.
> > > So why don't those callbacks get called during that time period? How are
> > > their triggered when we are not hitting the OOM path? They surely cannot
> > > sit there for ever, right? Can we trigger them sooner? Maybe the
> > > shrinker is not the best fit but we have a retry feedback loop in the page
> > > allocator, maybe we can kick this processing from there.
> > 
> > The effect of RCU's current OOM code is to speed up callback invocation
> > by at most a few seconds (assuming no stalled CPUs, in which case
> > it is not possible to speed up callback invocation).
> > 
> > Given that, I should just remove RCU's OOM code entirely?
> 
> Yeah, it seems so. I do not see how this would really help much. If we
> really need some way to kick callbacks then we should do so much earlier
> in the reclaim process - e.g. when we start struggling to reclaim any
> memory.

One approach would be to tell RCU "It is time to trade CPU for memory"
at the beginning of that struggle and then tell RCU "Go back to optimizing
for CPU" at the end of that struggle.  Is there already a way to do this?
If so, RCU should probably just switch to it.

But what is the typical duration of such a struggle?  Does this duration
change with workload?  (I suspect that the answers are "who knows?" and
"yes", but you tell me!)  Are there other oom handlers that would prefer
the approach of the previous paragraph?

> I am curious. Has the notifier been motivated by a real world use case
> or it was "nice thing to do"?

It was introduced by b626c1b689364 ("rcu: Provide OOM handler to motivate
lazy RCU callbacks").  The motivation for this commit was a set of changes
that improved energy efficiency by making CPUs sleep for longer when all
of their pending callbacks were known to only free memory (as opposed
to doing a wakeup or some such).  Prior to this set of changes, a CPU
with callbacks would invoke those callbacks (thus freeing the memory)
within a jiffy or so of the end of a grace period.  After this set of
changes, a CPU might wait several seconds.  This was a concern to people
with small-memory systems, hence commit b626c1b689364.

							Thanx, Paul
Paul E. McKenney June 30, 2018, 5:19 p.m. UTC | #27
On Fri, Jun 29, 2018 at 11:35:48PM +0900, Tetsuo Handa wrote:
> On 2018/06/29 21:52, Paul E. McKenney wrote:
> > The effect of RCU's current OOM code is to speed up callback invocation
> > by at most a few seconds (assuming no stalled CPUs, in which case
> > it is not possible to speed up callback invocation).
> > 
> > Given that, I should just remove RCU's OOM code entirely?
> 
> out_of_memory() will start selecting an OOM victim without calling
> get_page_from_freelist() since rcu_oom_notify() does not set non-zero
> value to "freed" field.
> 
> I think that rcu_oom_notify() needs to wait for completion of callback
> invocations (possibly with timeout in case there are stalling CPUs) and
> set non-zero value to "freed" field if pending callbacks did release memory.

Waiting for the callbacks is easy.  Timeouts would be a bit harder, but
still doable.  I have no idea how to tell which callbacks freed memory
and how much -- all RCU does is invoke a function, and that function
can do whatever its developer wants.

> However, what will be difficult to tell is whether invocation of pending callbacks
> did release memory. Lack of last second get_page_from_freelist() call after
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed) forces rcu_oom_notify()
> to set appropriate value (i.e. zero or non-zero) to "freed" field.
> 
> We have tried to move really last second get_page_from_freelist() call to inside
> out_of_memory() after blocking_notifier_call_chain(&oom_notify_list, 0, &freed).
> But that proposal was not accepted...
> 
> We could move blocking_notifier_call_chain(&oom_notify_list, 0, &freed) to
> before last second get_page_from_freelist() call (and this is what this patch
> is trying to do) which would allow rcu_oom_notify() to always return 0...
> or update rcu_oom_notify() to use shrinker API...

Would it be possible to tell RCU that memory was starting to get tight
with one call, and then tell it that things are OK with another call?
That would make much more sense from an RCU perspective.

							Thanx, Paul
Michal Hocko July 2, 2018, noon UTC | #28
On Sat 30-06-18 10:05:22, Paul E. McKenney wrote:
> On Fri, Jun 29, 2018 at 03:26:38PM +0200, Michal Hocko wrote:
> > On Fri 29-06-18 05:52:18, Paul E. McKenney wrote:
> > > On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote:
> > > > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote:
> > > > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote:
> > [...]
> > > > > > Well, I am not really sure what is the objective of the oom notifier to
> > > > > > point you to the right direction. IIUC you just want to kick callbacks
> > > > > > to be handled sooner under a heavy memory pressure, right? How is that
> > > > > > achieved? Kick a worker?
> > > > > 
> > > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback
> > > > > list, but only for those CPUs having non-empty lists.  This causes
> > > > > CPUs with lists containing only lazy callbacks to be more aggressive,
> > > > > in particular, it prevents such CPUs from hanging out idle for seconds
> > > > > at a time while they have callbacks on their lists.
> > > > > 
> > > > > The enqueuing happens via an IPI to the CPU in question.
> > > > 
> > > > I am afraid this is too low level for my to understand what is going on
> > > > here. What are lazy callbacks and why do they need any specific action
> > > > when we are getting close to OOM? I mean, I do understand that we might
> > > > have many callers of call_rcu and free memory lazily. But there is quite
> > > > a long way before we start the reclaim until we reach the OOM killer path.
> > > > So why don't those callbacks get called during that time period? How are
> > > > their triggered when we are not hitting the OOM path? They surely cannot
> > > > sit there for ever, right? Can we trigger them sooner? Maybe the
> > > > shrinker is not the best fit but we have a retry feedback loop in the page
> > > > allocator, maybe we can kick this processing from there.
> > > 
> > > The effect of RCU's current OOM code is to speed up callback invocation
> > > by at most a few seconds (assuming no stalled CPUs, in which case
> > > it is not possible to speed up callback invocation).
> > > 
> > > Given that, I should just remove RCU's OOM code entirely?
> > 
> > Yeah, it seems so. I do not see how this would really help much. If we
> > really need some way to kick callbacks then we should do so much earlier
> > in the reclaim process - e.g. when we start struggling to reclaim any
> > memory.
> 
> One approach would be to tell RCU "It is time to trade CPU for memory"
> at the beginning of that struggle and then tell RCU "Go back to optimizing
> for CPU" at the end of that struggle.  Is there already a way to do this?
> If so, RCU should probably just switch to it.

Well, I can think of the first "we are strugling part". This would be
anytime we are strugling to reclaim any memory. If we knew how much can
be sitting in callbacks then it would certainly help to make some
educated decision but I am worried we do not have any number to look at.
Or maybe we have the number of callbacks to know to kick the processing?

The other part and go back to optimize for cpu is harder. We are
ususally not running the code when we do not struggle ;)

> But what is the typical duration of such a struggle?  Does this duration
> change with workload?  (I suspect that the answers are "who knows?" and
> "yes", but you tell me!)  Are there other oom handlers that would prefer
> the approach of the previous paragraph?
> 
> > I am curious. Has the notifier been motivated by a real world use case
> > or it was "nice thing to do"?
> 
> It was introduced by b626c1b689364 ("rcu: Provide OOM handler to motivate
> lazy RCU callbacks").  The motivation for this commit was a set of changes
> that improved energy efficiency by making CPUs sleep for longer when all
> of their pending callbacks were known to only free memory (as opposed
> to doing a wakeup or some such).  Prior to this set of changes, a CPU
> with callbacks would invoke those callbacks (thus freeing the memory)
> within a jiffy or so of the end of a grace period.  After this set of
> changes, a CPU might wait several seconds.  This was a concern to people
> with small-memory systems, hence commit b626c1b689364.

Do we have any real life examples of OOMs caused by the lazy execution
of rcu callbacks? If not then I would simply drop the notifier and get
back to _some_ implementation if somebody sees a real problem.
diff mbox

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac11..085b033 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -101,6 +101,7 @@  extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
 
+extern unsigned long try_oom_notifier(void);
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e7..2ff5db2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1010,6 +1010,33 @@  int unregister_oom_notifier(struct notifier_block *nb)
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
 /**
+ * try_oom_notifier - Try to reclaim memory from OOM notifier list.
+ *
+ * Returns non-zero if notifier callbacks released something, zero otherwise.
+ */
+unsigned long try_oom_notifier(void)
+{
+	static DEFINE_MUTEX(oom_notifier_lock);
+	unsigned long freed = 0;
+
+	/*
+	 * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM
+	 * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe.
+	 * If lockdep reports possible deadlock dependency, it will be a bug in
+	 * OOM notifier callbacks.
+	 *
+	 * If SIGKILL is pending, it is likely that current thread was selected
+	 * as an OOM victim. In that case, current thread should return as soon
+	 * as possible using memory reserves.
+	 */
+	if (mutex_lock_killable(&oom_notifier_lock))
+		return 0;
+	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+	mutex_unlock(&oom_notifier_lock);
+	return freed;
+}
+
+/**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
  *
@@ -1020,19 +1047,11 @@  int unregister_oom_notifier(struct notifier_block *nb)
  */
 bool out_of_memory(struct oom_control *oc)
 {
-	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	if (oom_killer_disabled)
 		return false;
 
-	if (!is_memcg_oom(oc)) {
-		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-		if (freed > 0)
-			/* Got some memory back in the last second. */
-			return true;
-	}
-
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100..c72ef1e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3447,10 +3447,50 @@  void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	return page;
 }
 
+static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order,
+			       const struct alloc_context *ac)
+{
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		return false;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return false;
+	/*
+	 * We have already exhausted all our reclaim opportunities without any
+	 * success so it is time to admit defeat. We will skip the OOM killer
+	 * because it is very likely that the caller has a more reasonable
+	 * fallback than shooting a random task.
+	 */
+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+		return false;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		return false;
+	if (pm_suspended_storage())
+		return false;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		return false;
+
+	return true;
+}
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
+	const bool oomkill = can_oomkill(gfp_mask, order, ac);
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
 		.nodemask = ac->nodemask,
@@ -3462,6 +3502,10 @@  void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	/* Try to reclaim via OOM notifier callback. */
+	if (oomkill)
+		*did_some_progress = try_oom_notifier();
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
@@ -3485,37 +3529,7 @@  void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	if (page)
 		goto out;
 
-	/* Coredumps can quickly deplete all memory reserves */
-	if (current->flags & PF_DUMPCORE)
-		goto out;
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
-		goto out;
-	/*
-	 * We have already exhausted all our reclaim opportunities without any
-	 * success so it is time to admit defeat. We will skip the OOM killer
-	 * because it is very likely that the caller has a more reasonable
-	 * fallback than shooting a random task.
-	 */
-	if (gfp_mask & __GFP_RETRY_MAYFAIL)
-		goto out;
-	/* The OOM killer does not needlessly kill tasks for lowmem */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		goto out;
-	if (pm_suspended_storage())
-		goto out;
-	/*
-	 * XXX: GFP_NOFS allocations should rather fail than rely on
-	 * other request to make a forward progress.
-	 * We are in an unfortunate situation where out_of_memory cannot
-	 * do much for this context but let's try it to at least get
-	 * access to memory reserved if the current task is killed (see
-	 * out_of_memory). Once filesystems are ready to handle allocation
-	 * failures more gracefully we should just bail out here.
-	 */
-
-	/* The OOM killer may not free memory on a specific node */
-	if (gfp_mask & __GFP_THISNODE)
+	if (!oomkill)
 		goto out;
 
 	/* Exhausted what can be done so it's blame time */