mm: mempolicy: don't select exited threads as OOM victims
diff mbox series

Message ID 1561807474-10317-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
State New
Headers show
Series
  • mm: mempolicy: don't select exited threads as OOM victims
Related show

Commit Message

Tetsuo Handa June 29, 2019, 11:24 a.m. UTC
Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
mempolicy_nodemask_intersects() considers exited threads (e.g. a process
with dying leader and live threads) as eligible. But it is possible that
all of live threads are still ineligible.

Since has_intersects_mems_allowed() returns true as soon as one of threads
is considered eligible, mempolicy_nodemask_intersects() needs to consider
exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
threads by checking whether mm is NULL.

While at it, since mempolicy_nodemask_intersects() is called by only
has_intersects_mems_allowed(), it is guaranteed that mask != NULL.

BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE}
and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use
find_lock_task_mm() instead of for_each_thread() for mask != NULL case
in has_intersects_mems_allowed().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/mempolicy.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Michal Hocko July 1, 2019, 11:17 a.m. UTC | #1
On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
> with dying leader and live threads) as eligible. But it is possible that
> all of live threads are still ineligible.
> 
> Since has_intersects_mems_allowed() returns true as soon as one of threads
> is considered eligible, mempolicy_nodemask_intersects() needs to consider
> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
> threads by checking whether mm is NULL.

Ok, this makes sense. For this change
Acked-by: Michal Hocko <mhocko@suse.com>

> While at it, since mempolicy_nodemask_intersects() is called by only
> has_intersects_mems_allowed(), it is guaranteed that mask != NULL.

Well, I am not really sure. It's true that mempolicy_nodemask_intersects
is currently only used by OOM path and never with mask == NULL but
the function seems to be more generic and hadnling NULL mask seems
reasonable to me. This is not a hot path that an additional check would
be harmful, right?
 
> BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE}
> and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use
> find_lock_task_mm() instead of for_each_thread() for mask != NULL case
> in has_intersects_mems_allowed().

I am afraid that mempolicy is allowed to be per thread which is quite
ugly and I am afraid we cannot change that right now.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/mempolicy.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 01600d8..938f0a0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1974,11 +1974,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
>  					const nodemask_t *mask)
>  {
>  	struct mempolicy *mempolicy;
> -	bool ret = true;
> +	bool ret;
>  
> -	if (!mask)
> -		return ret;
>  	task_lock(tsk);
> +	ret = tsk->mm;
>  	mempolicy = tsk->mempolicy;
>  	if (!mempolicy)
>  		goto out;
> -- 
> 1.8.3.1
Tetsuo Handa July 1, 2019, 1:04 p.m. UTC | #2
On 2019/07/01 20:17, Michal Hocko wrote:
> On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
>> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
>> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
>> with dying leader and live threads) as eligible. But it is possible that
>> all of live threads are still ineligible.
>>
>> Since has_intersects_mems_allowed() returns true as soon as one of threads
>> is considered eligible, mempolicy_nodemask_intersects() needs to consider
>> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
>> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
>> threads by checking whether mm is NULL.
> 
> Ok, this makes sense. For this change
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

But I realized that this patch was too optimistic. We need to wait for mm-less
threads until MMF_OOM_SKIP is set if the process was already an OOM victim. If
we fail to allow the process to reach MMF_OOM_SKIP test, the process will be
ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for
has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used.

Well, the problem is that exited threads prematurely set mempolicy = NULL.
Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when
__put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects()
path should be freed as well when __put_task_struct() is called?

diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7..02a60ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -897,7 +897,6 @@ void __noreturn do_exit(long code)
 	exit_tasks_rcu_start();
 	exit_notify(tsk, group_dead);
 	proc_exit_connector(tsk);
-	mpol_put_task_policy(tsk);
 #ifdef CONFIG_FUTEX
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6166790..c17e436 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -726,6 +726,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	mpol_put_task_policy(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
Michal Hocko July 1, 2019, 1:17 p.m. UTC | #3
On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> On 2019/07/01 20:17, Michal Hocko wrote:
> > On Sat 29-06-19 20:24:34, Tetsuo Handa wrote:
> >> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL,
> >> mempolicy_nodemask_intersects() considers exited threads (e.g. a process
> >> with dying leader and live threads) as eligible. But it is possible that
> >> all of live threads are still ineligible.
> >>
> >> Since has_intersects_mems_allowed() returns true as soon as one of threads
> >> is considered eligible, mempolicy_nodemask_intersects() needs to consider
> >> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL
> >> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited
> >> threads by checking whether mm is NULL.
> > 
> > Ok, this makes sense. For this change
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> 
> But I realized that this patch was too optimistic. We need to wait for mm-less
> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.

If the process is an oom victim then _all_ threads are so as well
because that is the address space property. And we already do check that
before reaching oom_badness IIRC. So what is the actual problem you are
trying to solve here?

> If
> we fail to allow the process to reach MMF_OOM_SKIP test, the process will be
> ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for
> has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used.
> 
> Well, the problem is that exited threads prematurely set mempolicy = NULL.
> Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when
> __put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects()
> path should be freed as well when __put_task_struct() is called?

I am sorry but I have hard time understanding what is the actual user
visible problem here.
Tetsuo Handa July 1, 2019, 1:38 p.m. UTC | #4
On 2019/07/01 22:17, Michal Hocko wrote:
> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
>> But I realized that this patch was too optimistic. We need to wait for mm-less
>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> 
> If the process is an oom victim then _all_ threads are so as well
> because that is the address space property. And we already do check that
> before reaching oom_badness IIRC. So what is the actual problem you are
> trying to solve here?

I'm talking about behavioral change after tsk became an OOM victim.

If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
false because mempolicy_nodemask_intersects() returned false because all thread's
mm became NULL (despite tsk->signal->oom_mm != NULL).

static int oom_evaluate_task(struct task_struct *task, void *arg)
{
  if (oom_unkillable_task(task, NULL, oc->nodemask))
    goto next;
  if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
    if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
      goto next;
    goto abort;
  }
}
Michal Hocko July 1, 2019, 1:48 p.m. UTC | #5
On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> On 2019/07/01 22:17, Michal Hocko wrote:
> > On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> >> But I realized that this patch was too optimistic. We need to wait for mm-less
> >> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> > 
> > If the process is an oom victim then _all_ threads are so as well
> > because that is the address space property. And we already do check that
> > before reaching oom_badness IIRC. So what is the actual problem you are
> > trying to solve here?
> 
> I'm talking about behavioral change after tsk became an OOM victim.
> 
> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> false because mempolicy_nodemask_intersects() returned false because all thread's
> mm became NULL (despite tsk->signal->oom_mm != NULL).

OK, I finally got your point. It was not clear that you are referring to
the code _after_ the patch you are proposing. You are indeed right that
this would have a side effect that an additional victim could be
selected even though the current process hasn't terminated yet. Sigh,
another example how the whole thing is subtle so I retract my Ack and
request a real life example of where this matters before we think about
a proper fix and make the code even more complex.
Tetsuo Handa July 1, 2019, 1:56 p.m. UTC | #6
On 2019/07/01 22:48, Michal Hocko wrote:
> On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
>> On 2019/07/01 22:17, Michal Hocko wrote:
>>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
>>>> But I realized that this patch was too optimistic. We need to wait for mm-less
>>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
>>>
>>> If the process is an oom victim then _all_ threads are so as well
>>> because that is the address space property. And we already do check that
>>> before reaching oom_badness IIRC. So what is the actual problem you are
>>> trying to solve here?
>>
>> I'm talking about behavioral change after tsk became an OOM victim.
>>
>> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
>> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
>> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
>> false because mempolicy_nodemask_intersects() returned false because all thread's
>> mm became NULL (despite tsk->signal->oom_mm != NULL).
> 
> OK, I finally got your point. It was not clear that you are referring to
> the code _after_ the patch you are proposing. You are indeed right that
> this would have a side effect that an additional victim could be
> selected even though the current process hasn't terminated yet. Sigh,
> another example how the whole thing is subtle so I retract my Ack and
> request a real life example of where this matters before we think about
> a proper fix and make the code even more complex.
> 

Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
do_exit() to __put_task_struct() ? That change will (if it is safe to do)
prevent exited threads from setting mempolicy = NULL (and confusing
mempolicy_nodemask_intersects() due to mempolicy == NULL).
Michal Hocko July 1, 2019, 2:04 p.m. UTC | #7
On Mon 01-07-19 22:56:12, Tetsuo Handa wrote:
> On 2019/07/01 22:48, Michal Hocko wrote:
> > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> >> On 2019/07/01 22:17, Michal Hocko wrote:
> >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> >>>> But I realized that this patch was too optimistic. We need to wait for mm-less
> >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> >>>
> >>> If the process is an oom victim then _all_ threads are so as well
> >>> because that is the address space property. And we already do check that
> >>> before reaching oom_badness IIRC. So what is the actual problem you are
> >>> trying to solve here?
> >>
> >> I'm talking about behavioral change after tsk became an OOM victim.
> >>
> >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> >> false because mempolicy_nodemask_intersects() returned false because all thread's
> >> mm became NULL (despite tsk->signal->oom_mm != NULL).
> > 
> > OK, I finally got your point. It was not clear that you are referring to
> > the code _after_ the patch you are proposing. You are indeed right that
> > this would have a side effect that an additional victim could be
> > selected even though the current process hasn't terminated yet. Sigh,
> > another example how the whole thing is subtle so I retract my Ack and
> > request a real life example of where this matters before we think about
> > a proper fix and make the code even more complex.
> > 
> 
> Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
> do_exit() to __put_task_struct() ? That change will (if it is safe to do)
> prevent exited threads from setting mempolicy = NULL (and confusing
> mempolicy_nodemask_intersects() due to mempolicy == NULL).

I am sorry but I would have to study it much more and I am not convinced
the time spent on it would be well spent.
Michal Hocko July 1, 2019, 2:16 p.m. UTC | #8
On Mon 01-07-19 16:04:34, Michal Hocko wrote:
> On Mon 01-07-19 22:56:12, Tetsuo Handa wrote:
> > On 2019/07/01 22:48, Michal Hocko wrote:
> > > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> > >> On 2019/07/01 22:17, Michal Hocko wrote:
> > >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> > >>>> But I realized that this patch was too optimistic. We need to wait for mm-less
> > >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> > >>>
> > >>> If the process is an oom victim then _all_ threads are so as well
> > >>> because that is the address space property. And we already do check that
> > >>> before reaching oom_badness IIRC. So what is the actual problem you are
> > >>> trying to solve here?
> > >>
> > >> I'm talking about behavioral change after tsk became an OOM victim.
> > >>
> > >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> > >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> > >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> > >> false because mempolicy_nodemask_intersects() returned false because all thread's
> > >> mm became NULL (despite tsk->signal->oom_mm != NULL).
> > > 
> > > OK, I finally got your point. It was not clear that you are referring to
> > > the code _after_ the patch you are proposing. You are indeed right that
> > > this would have a side effect that an additional victim could be
> > > selected even though the current process hasn't terminated yet. Sigh,
> > > another example how the whole thing is subtle so I retract my Ack and
> > > request a real life example of where this matters before we think about
> > > a proper fix and make the code even more complex.
> > > 
> > 
> > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
> > do_exit() to __put_task_struct() ? That change will (if it is safe to do)
> > prevent exited threads from setting mempolicy = NULL (and confusing
> > mempolicy_nodemask_intersects() due to mempolicy == NULL).
> 
> I am sorry but I would have to study it much more and I am not convinced
> the time spent on it would be well spent.

Thinking about it some more it seems that we can go with your original
fix if we also reorder oom_evaluate_task
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f719b64741d6..e5feb0f72e3b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	struct oom_control *oc = arg;
 	unsigned long points;
 
-	if (oom_unkillable_task(task, NULL, oc->nodemask))
-		goto next;
-
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
+	if (oom_unkillable_task(task, NULL, oc->nodemask))
+		goto next;
+
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.

I do not see any strong reason to keep the current ordering. OOM victim
check is trivial so it shouldn't add a visible overhead for few
unkillable tasks that we might encounter.
Tetsuo Handa July 2, 2019, 1:19 p.m. UTC | #9
On 2019/07/01 23:16, Michal Hocko wrote:
> Thinking about it some more it seems that we can go with your original
> fix if we also reorder oom_evaluate_task
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f719b64741d6..e5feb0f72e3b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  	struct oom_control *oc = arg;
>  	unsigned long points;
>  
> -	if (oom_unkillable_task(task, NULL, oc->nodemask))
> -		goto next;
> -
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves unless
> @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  		goto abort;
>  	}
>  
> +	if (oom_unkillable_task(task, NULL, oc->nodemask))
> +		goto next;
> +
>  	/*
>  	 * If task is allocating a lot of memory and has been marked to be
>  	 * killed first if it triggers an oom, then select it.
> 
> I do not see any strong reason to keep the current ordering. OOM victim
> check is trivial so it shouldn't add a visible overhead for few
> unkillable tasks that we might encounter.
> 

Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
(because an OOM victim in a different OOM context will hit "goto abort;" path).



Thinking again, I think that the same problem exists for mask == NULL path
as long as "a process with dying leader and live threads" is possible. Then,
fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects()
judged that some thread is eligible is better.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1c9c4e..43e499e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start,
 			 */
 			ret = cpuset_mems_allowed_intersects(current, tsk);
 		}
-		if (ret)
-			break;
+		if (ret) {
+			/*
+			 * Exclude dead threads as ineligible when selecting
+			 * an OOM victim. But include dead threads as eligible
+			 * when waiting for OOM victims to get MMF_OOM_SKIP.
+			 *
+			 * Strictly speaking, tsk->mm should be checked under
+			 * task lock because cpuset_mems_allowed_intersects()
+			 * does not take task lock. But racing with exit_mm()
+			 * is not fatal. Thus, use cheaper barrier rather than
+			 * strict task lock.
+			 */
+			smp_rmb();
+			if (tsk->mm || tsk_is_oom_victim(tsk))
+				break;
+			ret = false;
+		}
 	}
 	rcu_read_unlock();
Michal Hocko July 2, 2019, 1:51 p.m. UTC | #10
On Tue 02-07-19 22:19:27, Tetsuo Handa wrote:
> On 2019/07/01 23:16, Michal Hocko wrote:
> > Thinking about it some more it seems that we can go with your original
> > fix if we also reorder oom_evaluate_task
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f719b64741d6..e5feb0f72e3b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  	struct oom_control *oc = arg;
> >  	unsigned long points;
> >  
> > -	if (oom_unkillable_task(task, NULL, oc->nodemask))
> > -		goto next;
> > -
> >  	/*
> >  	 * This task already has access to memory reserves and is being killed.
> >  	 * Don't allow any other task to have access to the reserves unless
> > @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  		goto abort;
> >  	}
> >  
> > +	if (oom_unkillable_task(task, NULL, oc->nodemask))
> > +		goto next;
> > +
> >  	/*
> >  	 * If task is allocating a lot of memory and has been marked to be
> >  	 * killed first if it triggers an oom, then select it.
> > 
> > I do not see any strong reason to keep the current ordering. OOM victim
> > check is trivial so it shouldn't add a visible overhead for few
> > unkillable tasks that we might encounter.
> > 
> 
> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
> (because an OOM victim in a different OOM context will hit "goto abort;" path).

You are right. Considering that we now have a guarantee of a forward
progress then this should be tolerateable (a victim in a disjoint
numaset will go away and other one can go ahead and trigger its own
OOM).
 
> Thinking again, I think that the same problem exists for mask == NULL path
> as long as "a process with dying leader and live threads" is possible. Then,
> fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects()
> judged that some thread is eligible is better.

This is getting more and more hair for something that is not really
clear to be an actual problem. Don't you think?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d1c9c4e..43e499e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start,
>  			 */
>  			ret = cpuset_mems_allowed_intersects(current, tsk);
>  		}
> -		if (ret)
> -			break;
> +		if (ret) {
> +			/*
> +			 * Exclude dead threads as ineligible when selecting
> +			 * an OOM victim. But include dead threads as eligible
> +			 * when waiting for OOM victims to get MMF_OOM_SKIP.
> +			 *
> +			 * Strictly speaking, tsk->mm should be checked under
> +			 * task lock because cpuset_mems_allowed_intersects()
> +			 * does not take task lock. But racing with exit_mm()
> +			 * is not fatal. Thus, use cheaper barrier rather than
> +			 * strict task lock.
> +			 */
> +			smp_rmb();
> +			if (tsk->mm || tsk_is_oom_victim(tsk))
> +				break;
> +			ret = false;
> +		}
>  	}
>  	rcu_read_unlock();
>
Tetsuo Handa July 2, 2019, 9:26 p.m. UTC | #11
On 2019/07/02 22:51, Michal Hocko wrote:
>>> I do not see any strong reason to keep the current ordering. OOM victim
>>> check is trivial so it shouldn't add a visible overhead for few
>>> unkillable tasks that we might encounter.
>>>
>>
>> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
>> (because an OOM victim in a different OOM context will hit "goto abort;" path).
> 
> You are right. Considering that we now have a guarantee of a forward
> progress then this should be tolerateable (a victim in a disjoint
> numaset will go away and other one can go ahead and trigger its own
> OOM).

But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper
or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't
make forward progress.
Michal Hocko July 3, 2019, 7:06 a.m. UTC | #12
On Wed 03-07-19 06:26:55, Tetsuo Handa wrote:
> On 2019/07/02 22:51, Michal Hocko wrote:
> >>> I do not see any strong reason to keep the current ordering. OOM victim
> >>> check is trivial so it shouldn't add a visible overhead for few
> >>> unkillable tasks that we might encounter.
> >>>
> >>
> >> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events
> >> (because an OOM victim in a different OOM context will hit "goto abort;" path).
> > 
> > You are right. Considering that we now have a guarantee of a forward
> > progress then this should be tolerateable (a victim in a disjoint
> > numaset will go away and other one can go ahead and trigger its own
> > OOM).
> 
> But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper
> or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't
> make forward progress.

If that is a concern then I would stick with the current status quo
until we see the issue to be reported by real workloads.

Patch
diff mbox series

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01600d8..938f0a0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1974,11 +1974,10 @@  bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 					const nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
-	bool ret = true;
+	bool ret;
 
-	if (!mask)
-		return ret;
 	task_lock(tsk);
+	ret = tsk->mm;
 	mempolicy = tsk->mempolicy;
 	if (!mempolicy)
 		goto out;