diff mbox series

[1/1] mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary

Message ID 20200820002053.1424000-1-surenb@google.com (mailing list archive)
State New, archived
Headers show
Series [1/1] mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary | expand

Commit Message

Suren Baghdasaryan Aug. 20, 2020, 12:20 a.m. UTC
Currently __set_oom_adj loops through all processes in the system to
keep oom_score_adj and oom_score_adj_min in sync between processes
sharing their mm. This is done for any task with more that one mm_users,
which includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.
Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making
it more/less important. Such operation can happen frequently.
We noticed that updates to oom_score_adj became more expensive and after
further investigation found out that the patch mentioned in "Fixes"
introduced a regression. Using Pixel 4 with a typical Android workload,
write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
this regression linearly depends on the number of multi-threaded
processes running on the system.
Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
instead of mm_users to decide whether oom_score_adj update should be
synchronized between multiple processes. To prevent races between clone()
and __set_oom_adj(), when oom_score_adj of the process being cloned might
be modified from userspace, we use oom_adj_mutex. Its scope is changed to
global and it is renamed into oom_adj_lock for naming consistency with
oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
used the additional mutex lock in that path of the clone() syscall should
not affect its overall performance. Clearing the MMF_PROC_SHARED flag
(when the last process sharing the mm exits) is left out of this patch to
keep it simple and because it is believed that this threading model is
rare. Should there ever be a need for optimizing that case as well, it
can be done by hooking into the exit path, likely following the
mm_update_next_owner pattern.
With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
regression is gone after the change is applied.

Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/base.c                 | 7 +++----
 include/linux/oom.h            | 1 +
 include/linux/sched/coredump.h | 1 +
 kernel/fork.c                  | 9 +++++++++
 mm/oom_kill.c                  | 2 ++
 5 files changed, 16 insertions(+), 4 deletions(-)

Comments

Michal Hocko Aug. 20, 2020, 5:56 a.m. UTC | #1
On Wed 19-08-20 17:20:53, Suren Baghdasaryan wrote:
> Currently __set_oom_adj loops through all processes in the system to
> keep oom_score_adj and oom_score_adj_min in sync between processes
> sharing their mm. This is done for any task with more that one mm_users,
> which includes processes with multiple threads (sharing mm and signals).
> However for such processes the loop is unnecessary because their signal
> structure is shared as well.
> Android updates oom_score_adj whenever a tasks changes its role
> (background/foreground/...) or binds to/unbinds from a service, making
> it more/less important. Such operation can happen frequently.
> We noticed that updates to oom_score_adj became more expensive and after
> further investigation found out that the patch mentioned in "Fixes"
> introduced a regression. Using Pixel 4 with a typical Android workload,
> write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> this regression linearly depends on the number of multi-threaded
> processes running on the system.
> Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> instead of mm_users to decide whether oom_score_adj update should be
> synchronized between multiple processes. To prevent races between clone()
> and __set_oom_adj(), when oom_score_adj of the process being cloned might
> be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> global and it is renamed into oom_adj_lock for naming consistency with
> oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> used the additional mutex lock in that path of the clone() syscall should
> not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> (when the last process sharing the mm exits) is left out of this patch to
> keep it simple and because it is believed that this threading model is
> rare. Should there ever be a need for optimizing that case as well, it
> can be done by hooking into the exit path, likely following the
> mm_update_next_owner pattern.
> With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> regression is gone after the change is applied.

Yes this seems like a reasonable approach. Should we learn about a
usecase which relies on this threading model we might need a more
sophisticated synchronization but for now this sounds like good enough.

> Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> Reported-by: Tim Murray <timmurray@google.com>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  fs/proc/base.c                 | 7 +++----
>  include/linux/oom.h            | 1 +
>  include/linux/sched/coredump.h | 1 +
>  kernel/fork.c                  | 9 +++++++++
>  mm/oom_kill.c                  | 2 ++
>  5 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 617db4e0faa0..cff1a58a236c 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
>  
>  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  {
> -	static DEFINE_MUTEX(oom_adj_mutex);
>  	struct mm_struct *mm = NULL;
>  	struct task_struct *task;
>  	int err = 0;
> @@ -1064,7 +1063,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  	if (!task)
>  		return -ESRCH;
>  
> -	mutex_lock(&oom_adj_mutex);
> +	mutex_lock(&oom_adj_lock);
>  	if (legacy) {
>  		if (oom_adj < task->signal->oom_score_adj &&
>  				!capable(CAP_SYS_RESOURCE)) {
> @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		struct task_struct *p = find_lock_task_mm(task);
>  
>  		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> +			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
>  				mm = p->mm;
>  				mmgrab(mm);
>  			}
> @@ -1132,7 +1131,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		mmdrop(mm);
>  	}
>  err_unlock:
> -	mutex_unlock(&oom_adj_mutex);
> +	mutex_unlock(&oom_adj_lock);
>  	put_task_struct(task);
>  	return err;
>  }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index f022f581ac29..861f22bd4706 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -55,6 +55,7 @@ struct oom_control {
>  };
>  
>  extern struct mutex oom_lock;
> +extern struct mutex oom_adj_lock;
>  
>  static inline void set_current_oom_origin(void)
>  {
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ecdc6542070f..070629b722df 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
>  #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
> +#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
>  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4d32190861bd..9177a76bf840 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_VM) {
>  		mmget(oldmm);
>  		mm = oldmm;
> +		if (!(clone_flags & CLONE_SIGHAND)) {
> +			/* We need to synchronize with __set_oom_adj */
> +			mutex_lock(&oom_adj_lock);
> +			set_bit(MMF_PROC_SHARED, &mm->flags);
> +			/* Update the values in case they were changed after copy_signal */
> +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> +			mutex_unlock(&oom_adj_lock);
> +		}
>  		goto good_mm;
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e90f25d6385d..c22f07c986cb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -64,6 +64,8 @@ int sysctl_oom_dump_tasks = 1;
>   * and mark_oom_victim
>   */
>  DEFINE_MUTEX(oom_lock);
> +/* Serializes oom_score_adj and oom_score_adj_min updates */
> +DEFINE_MUTEX(oom_adj_lock);
>  
>  static inline bool is_memcg_oom(struct oom_control *oc)
>  {
> -- 
> 2.28.0.297.g1956fa8f8d-goog
Christian Brauner Aug. 20, 2020, 8:46 a.m. UTC | #2
On Wed, Aug 19, 2020 at 05:20:53PM -0700, Suren Baghdasaryan wrote:
> Currently __set_oom_adj loops through all processes in the system to
> keep oom_score_adj and oom_score_adj_min in sync between processes
> sharing their mm. This is done for any task with more that one mm_users,
> which includes processes with multiple threads (sharing mm and signals).
> However for such processes the loop is unnecessary because their signal
> structure is shared as well.
> Android updates oom_score_adj whenever a tasks changes its role
> (background/foreground/...) or binds to/unbinds from a service, making
> it more/less important. Such operation can happen frequently.
> We noticed that updates to oom_score_adj became more expensive and after
> further investigation found out that the patch mentioned in "Fixes"
> introduced a regression. Using Pixel 4 with a typical Android workload,
> write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> this regression linearly depends on the number of multi-threaded
> processes running on the system.
> Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> instead of mm_users to decide whether oom_score_adj update should be
> synchronized between multiple processes. To prevent races between clone()
> and __set_oom_adj(), when oom_score_adj of the process being cloned might
> be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> global and it is renamed into oom_adj_lock for naming consistency with
> oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> used the additional mutex lock in that path of the clone() syscall should
> not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> (when the last process sharing the mm exits) is left out of this patch to
> keep it simple and because it is believed that this threading model is
> rare. Should there ever be a need for optimizing that case as well, it
> can be done by hooking into the exit path, likely following the
> mm_update_next_owner pattern.
> With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> regression is gone after the change is applied.
> 
> Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> Reported-by: Tim Murray <timmurray@google.com>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/base.c                 | 7 +++----
>  include/linux/oom.h            | 1 +
>  include/linux/sched/coredump.h | 1 +
>  kernel/fork.c                  | 9 +++++++++
>  mm/oom_kill.c                  | 2 ++
>  5 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 617db4e0faa0..cff1a58a236c 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
>  
>  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  {
> -	static DEFINE_MUTEX(oom_adj_mutex);
>  	struct mm_struct *mm = NULL;
>  	struct task_struct *task;
>  	int err = 0;
> @@ -1064,7 +1063,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  	if (!task)
>  		return -ESRCH;
>  
> -	mutex_lock(&oom_adj_mutex);
> +	mutex_lock(&oom_adj_lock);
>  	if (legacy) {
>  		if (oom_adj < task->signal->oom_score_adj &&
>  				!capable(CAP_SYS_RESOURCE)) {
> @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		struct task_struct *p = find_lock_task_mm(task);
>  
>  		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> +			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
>  				mm = p->mm;
>  				mmgrab(mm);
>  			}
> @@ -1132,7 +1131,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		mmdrop(mm);
>  	}
>  err_unlock:
> -	mutex_unlock(&oom_adj_mutex);
> +	mutex_unlock(&oom_adj_lock);
>  	put_task_struct(task);
>  	return err;
>  }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index f022f581ac29..861f22bd4706 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -55,6 +55,7 @@ struct oom_control {
>  };
>  
>  extern struct mutex oom_lock;
> +extern struct mutex oom_adj_lock;
>  
>  static inline void set_current_oom_origin(void)
>  {
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ecdc6542070f..070629b722df 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
>  #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
> +#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
>  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4d32190861bd..9177a76bf840 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_VM) {
>  		mmget(oldmm);
>  		mm = oldmm;
> +		if (!(clone_flags & CLONE_SIGHAND)) {
> +			/* We need to synchronize with __set_oom_adj */
> +			mutex_lock(&oom_adj_lock);
> +			set_bit(MMF_PROC_SHARED, &mm->flags);

This seems fine.

> +			/* Update the values in case they were changed after copy_signal */
> +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;

But this seems wrong to me.
copy_signal() should be the only place where ->signal is set. Just from
a pure conceptual perspective. The copy_*() should be as self-contained
as possible imho.
Also, now I have to remember/look for two different locations where
oom_score_adj{_min} is initialized during fork. And this also creates a
dependency between copy_signal() and copy_mm() that doesn't need to be
there imho. I'm not a fan.

Also, you're in a branch where you're not sharing signal struct. So
after copy_signal() why would the parent changing their
oom_score_adj{_min} value matter? If that "race" is really important to
handle than you need to verify that the child has expected oom settings
after process creation anyway and should probably set it again from
userspace to make sure. This seems like an unnecessary optimization that
just makes the code that tiny bit more complex than it needs to be.

I'd really just do:

if (!(clone_flags & CLONE_SIGHAND)) {
	/* We need to synchronize with __set_oom_adj */
	mutex_lock(&oom_adj_lock);
	set_bit(MMF_PROC_SHARED, &mm->flags);
	mutex_unlock(&oom_adj_lock);
}

makes this look way cleaner too imho.

Christian
Michal Hocko Aug. 20, 2020, 9:09 a.m. UTC | #3
On Thu 20-08-20 10:46:54, Christian Brauner wrote:
> On Wed, Aug 19, 2020 at 05:20:53PM -0700, Suren Baghdasaryan wrote:
> > Currently __set_oom_adj loops through all processes in the system to
> > keep oom_score_adj and oom_score_adj_min in sync between processes
> > sharing their mm. This is done for any task with more that one mm_users,
> > which includes processes with multiple threads (sharing mm and signals).
> > However for such processes the loop is unnecessary because their signal
> > structure is shared as well.
> > Android updates oom_score_adj whenever a tasks changes its role
> > (background/foreground/...) or binds to/unbinds from a service, making
> > it more/less important. Such operation can happen frequently.
> > We noticed that updates to oom_score_adj became more expensive and after
> > further investigation found out that the patch mentioned in "Fixes"
> > introduced a regression. Using Pixel 4 with a typical Android workload,
> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> > this regression linearly depends on the number of multi-threaded
> > processes running on the system.
> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> > instead of mm_users to decide whether oom_score_adj update should be
> > synchronized between multiple processes. To prevent races between clone()
> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> > global and it is renamed into oom_adj_lock for naming consistency with
> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > used the additional mutex lock in that path of the clone() syscall should
> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > (when the last process sharing the mm exits) is left out of this patch to
> > keep it simple and because it is believed that this threading model is
> > rare. Should there ever be a need for optimizing that case as well, it
> > can be done by hooking into the exit path, likely following the
> > mm_update_next_owner pattern.
> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> > regression is gone after the change is applied.
> > 
> > Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> > Reported-by: Tim Murray <timmurray@google.com>
> > Suggested-by: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/base.c                 | 7 +++----
> >  include/linux/oom.h            | 1 +
> >  include/linux/sched/coredump.h | 1 +
> >  kernel/fork.c                  | 9 +++++++++
> >  mm/oom_kill.c                  | 2 ++
> >  5 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 617db4e0faa0..cff1a58a236c 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
> >  
> >  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  {
> > -	static DEFINE_MUTEX(oom_adj_mutex);
> >  	struct mm_struct *mm = NULL;
> >  	struct task_struct *task;
> >  	int err = 0;
> > @@ -1064,7 +1063,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  	if (!task)
> >  		return -ESRCH;
> >  
> > -	mutex_lock(&oom_adj_mutex);
> > +	mutex_lock(&oom_adj_lock);
> >  	if (legacy) {
> >  		if (oom_adj < task->signal->oom_score_adj &&
> >  				!capable(CAP_SYS_RESOURCE)) {
> > @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  		struct task_struct *p = find_lock_task_mm(task);
> >  
> >  		if (p) {
> > -			if (atomic_read(&p->mm->mm_users) > 1) {
> > +			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
> >  				mm = p->mm;
> >  				mmgrab(mm);
> >  			}
> > @@ -1132,7 +1131,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >  		mmdrop(mm);
> >  	}
> >  err_unlock:
> > -	mutex_unlock(&oom_adj_mutex);
> > +	mutex_unlock(&oom_adj_lock);
> >  	put_task_struct(task);
> >  	return err;
> >  }
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index f022f581ac29..861f22bd4706 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -55,6 +55,7 @@ struct oom_control {
> >  };
> >  
> >  extern struct mutex oom_lock;
> > +extern struct mutex oom_adj_lock;
> >  
> >  static inline void set_current_oom_origin(void)
> >  {
> > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> > index ecdc6542070f..070629b722df 100644
> > --- a/include/linux/sched/coredump.h
> > +++ b/include/linux/sched/coredump.h
> > @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> >  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> >  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
> >  #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
> > +#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
> >  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> >  
> >  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 4d32190861bd..9177a76bf840 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> >  	if (clone_flags & CLONE_VM) {
> >  		mmget(oldmm);
> >  		mm = oldmm;
> > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > +			/* We need to synchronize with __set_oom_adj */
> > +			mutex_lock(&oom_adj_lock);
> > +			set_bit(MMF_PROC_SHARED, &mm->flags);
> 
> This seems fine.
> 
> > +			/* Update the values in case they were changed after copy_signal */
> > +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> > +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> 
> But this seems wrong to me.
> copy_signal() should be the only place where ->signal is set. Just from
> a pure conceptual perspective. The copy_*() should be as self-contained
> as possible imho.
> Also, now I have to remember/look for two different locations where
> oom_score_adj{_min} is initialized during fork. And this also creates a
> dependency between copy_signal() and copy_mm() that doesn't need to be
> there imho. I'm not a fan.

Yes, this is not great but we will need it because the __set_oom_adj
might happen between copy_signal and copy_mm. If that happens then
__set_oom_adj misses this newly created task and so it will have a
disagreeing oom_score_adj.
Christian Brauner Aug. 20, 2020, 10:32 a.m. UTC | #4
On Thu, Aug 20, 2020 at 11:09:01AM +0200, Michal Hocko wrote:
> On Thu 20-08-20 10:46:54, Christian Brauner wrote:
> > On Wed, Aug 19, 2020 at 05:20:53PM -0700, Suren Baghdasaryan wrote:
> > > Currently __set_oom_adj loops through all processes in the system to
> > > keep oom_score_adj and oom_score_adj_min in sync between processes
> > > sharing their mm. This is done for any task with more that one mm_users,
> > > which includes processes with multiple threads (sharing mm and signals).
> > > However for such processes the loop is unnecessary because their signal
> > > structure is shared as well.
> > > Android updates oom_score_adj whenever a tasks changes its role
> > > (background/foreground/...) or binds to/unbinds from a service, making
> > > it more/less important. Such operation can happen frequently.
> > > We noticed that updates to oom_score_adj became more expensive and after
> > > further investigation found out that the patch mentioned in "Fixes"
> > > introduced a regression. Using Pixel 4 with a typical Android workload,
> > > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> > > this regression linearly depends on the number of multi-threaded
> > > processes running on the system.
> > > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> > > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> > > instead of mm_users to decide whether oom_score_adj update should be
> > > synchronized between multiple processes. To prevent races between clone()
> > > and __set_oom_adj(), when oom_score_adj of the process being cloned might
> > > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> > > global and it is renamed into oom_adj_lock for naming consistency with
> > > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > used the additional mutex lock in that path of the clone() syscall should
> > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > (when the last process sharing the mm exits) is left out of this patch to
> > > keep it simple and because it is believed that this threading model is
> > > rare. Should there ever be a need for optimizing that case as well, it
> > > can be done by hooking into the exit path, likely following the
> > > mm_update_next_owner pattern.
> > > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> > > regression is gone after the change is applied.
> > > 
> > > Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> > > Reported-by: Tim Murray <timmurray@google.com>
> > > Suggested-by: Michal Hocko <mhocko@kernel.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/proc/base.c                 | 7 +++----
> > >  include/linux/oom.h            | 1 +
> > >  include/linux/sched/coredump.h | 1 +
> > >  kernel/fork.c                  | 9 +++++++++
> > >  mm/oom_kill.c                  | 2 ++
> > >  5 files changed, 16 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 617db4e0faa0..cff1a58a236c 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
> > >  
> > >  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  {
> > > -	static DEFINE_MUTEX(oom_adj_mutex);
> > >  	struct mm_struct *mm = NULL;
> > >  	struct task_struct *task;
> > >  	int err = 0;
> > > @@ -1064,7 +1063,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  	if (!task)
> > >  		return -ESRCH;
> > >  
> > > -	mutex_lock(&oom_adj_mutex);
> > > +	mutex_lock(&oom_adj_lock);
> > >  	if (legacy) {
> > >  		if (oom_adj < task->signal->oom_score_adj &&
> > >  				!capable(CAP_SYS_RESOURCE)) {
> > > @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  		struct task_struct *p = find_lock_task_mm(task);
> > >  
> > >  		if (p) {
> > > -			if (atomic_read(&p->mm->mm_users) > 1) {
> > > +			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
> > >  				mm = p->mm;
> > >  				mmgrab(mm);
> > >  			}
> > > @@ -1132,7 +1131,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > >  		mmdrop(mm);
> > >  	}
> > >  err_unlock:
> > > -	mutex_unlock(&oom_adj_mutex);
> > > +	mutex_unlock(&oom_adj_lock);
> > >  	put_task_struct(task);
> > >  	return err;
> > >  }
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index f022f581ac29..861f22bd4706 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -55,6 +55,7 @@ struct oom_control {
> > >  };
> > >  
> > >  extern struct mutex oom_lock;
> > > +extern struct mutex oom_adj_lock;
> > >  
> > >  static inline void set_current_oom_origin(void)
> > >  {
> > > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> > > index ecdc6542070f..070629b722df 100644
> > > --- a/include/linux/sched/coredump.h
> > > +++ b/include/linux/sched/coredump.h
> > > @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> > >  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> > >  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
> > >  #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
> > > +#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
> > >  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> > >  
> > >  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 4d32190861bd..9177a76bf840 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > >  	if (clone_flags & CLONE_VM) {
> > >  		mmget(oldmm);
> > >  		mm = oldmm;
> > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > > +			/* We need to synchronize with __set_oom_adj */
> > > +			mutex_lock(&oom_adj_lock);
> > > +			set_bit(MMF_PROC_SHARED, &mm->flags);
> > 
> > This seems fine.
> > 
> > > +			/* Update the values in case they were changed after copy_signal */
> > > +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> > > +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> > 
> > But this seems wrong to me.
> > copy_signal() should be the only place where ->signal is set. Just from
> > a pure conceptual perspective. The copy_*() should be as self-contained
> > as possible imho.
> > Also, now I have to remember/look for two different locations where
> > oom_score_adj{_min} is initialized during fork. And this also creates a
> > dependency between copy_signal() and copy_mm() that doesn't need to be
> > there imho. I'm not a fan.
> 
> Yes, this is not great but we will need it because the __set_oom_adj
> might happen between copy_signal and copy_mm. If that happens then
> __set_oom_adj misses this newly created task and so it will have a
> disagreeing oom_score_adj.

One more clarification. The commit message states that

> > > which includes processes with multiple threads (sharing mm and signals).
> > > However for such processes the loop is unnecessary because their signal
> > > structure is shared as well.

and it seems you want to exclude threads, i.e. only update mm that is
shared not among threads in the same thread-group.
But struct signal and struct sighand_struct are different things, i.e.
they can be shared or not independent of each other. A non-shared
signal_struct where oom_score_adj{_min} live is only implied by
!CLONE_THREAD. So shouldn't this be:

if (!(clone_flags & CLONE_THREAD)) rather than CLONE_SIGHAND?

I might be missing something obvious though...
Christian
Oleg Nesterov Aug. 20, 2020, 10:55 a.m. UTC | #5
On 08/19, Suren Baghdasaryan wrote:
>
> Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> used the additional mutex lock in that path of the clone() syscall should
> not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> (when the last process sharing the mm exits) is left out of this patch to
> keep it simple and because it is believed that this threading model is
> rare.

vfork() ?

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_VM) {
>  		mmget(oldmm);
>  		mm = oldmm;
> +		if (!(clone_flags & CLONE_SIGHAND)) {

I agree with Christian, you need CLONE_THREAD

> +			/* We need to synchronize with __set_oom_adj */
> +			mutex_lock(&oom_adj_lock);
> +			set_bit(MMF_PROC_SHARED, &mm->flags);
> +			/* Update the values in case they were changed after copy_signal */
> +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> +			mutex_unlock(&oom_adj_lock);

I don't understand how this can close the race with __set_oom_adj...

What if __set_oom_adj() is called right after mutex_unlock() ? It will see
MMF_PROC_SHARED, but for_each_process() won't find the new child until
copy_process() does list_add_tail_rcu(&p->tasks, &init_task.tasks) ?

Oleg.
Michal Hocko Aug. 20, 2020, 11:13 a.m. UTC | #6
On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> On 08/19, Suren Baghdasaryan wrote:
> >
> > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > used the additional mutex lock in that path of the clone() syscall should
> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > (when the last process sharing the mm exits) is left out of this patch to
> > keep it simple and because it is believed that this threading model is
> > rare.
> 
> vfork() ?

Could you be more specific?

> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> >  	if (clone_flags & CLONE_VM) {
> >  		mmget(oldmm);
> >  		mm = oldmm;
> > +		if (!(clone_flags & CLONE_SIGHAND)) {
> 
> I agree with Christian, you need CLONE_THREAD

This was my suggestion to Suren, likely because I've misrememberd which
clone flag is responsible for the signal delivery. But now, after double
checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
CLONE_THREAD is the right thing to check.

> > +			/* We need to synchronize with __set_oom_adj */
> > +			mutex_lock(&oom_adj_lock);
> > +			set_bit(MMF_PROC_SHARED, &mm->flags);
> > +			/* Update the values in case they were changed after copy_signal */
> > +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> > +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> > +			mutex_unlock(&oom_adj_lock);
> 
> I don't understand how this can close the race with __set_oom_adj...
> 
> What if __set_oom_adj() is called right after mutex_unlock() ? It will see
> MMF_PROC_SHARED, but for_each_process() won't find the new child until
> copy_process() does list_add_tail_rcu(&p->tasks, &init_task.tasks) ?

Good point. Then we will have to move this thing there.

Thanks!
Michal Hocko Aug. 20, 2020, 11:14 a.m. UTC | #7
On Thu 20-08-20 12:32:48, Christian Brauner wrote:
> On Thu, Aug 20, 2020 at 11:09:01AM +0200, Michal Hocko wrote:
> > On Thu 20-08-20 10:46:54, Christian Brauner wrote:
[...]
> > > > which includes processes with multiple threads (sharing mm and signals).
> > > > However for such processes the loop is unnecessary because their signal
> > > > structure is shared as well.
> 
> and it seems you want to exclude threads, i.e. only update mm that is
> shared not among threads in the same thread-group.
> But struct signal and struct sighand_struct are different things, i.e.
> they can be shared or not independent of each other. A non-shared
> signal_struct where oom_score_adj{_min} live is only implied by
> !CLONE_THREAD. So shouldn't this be:
> 
> if (!(clone_flags & CLONE_THREAD)) rather than CLONE_SIGHAND?

You are right as I have already replied to Oleg.
Michal Hocko Aug. 20, 2020, 11:29 a.m. UTC | #8
On Thu 20-08-20 13:13:55, Michal Hocko wrote:
> On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> > On 08/19, Suren Baghdasaryan wrote:
> > >
> > > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > used the additional mutex lock in that path of the clone() syscall should
> > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > (when the last process sharing the mm exits) is left out of this patch to
> > > keep it simple and because it is believed that this threading model is
> > > rare.
> > 
> > vfork() ?
> 
> Could you be more specific?
> 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > >  	if (clone_flags & CLONE_VM) {
> > >  		mmget(oldmm);
> > >  		mm = oldmm;
> > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > 
> > I agree with Christian, you need CLONE_THREAD
> 
> This was my suggestion to Suren, likely because I've misrememberd which
> clone flag is responsible for the signal delivery. But now, after double
> checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
> CLONE_THREAD is the right thing to check.

I have tried to remember but I have to say that after reading man page I
am still confused. So what is the actual difference between CLONE_THREAD
and CLONE_SIGHAND? Essentially all we care about from the OOM (and
oom_score_adj) POV is that signals are delivered to all entities and
that thay share signal struct. copy_signal is checking for CLONE_THREAD
but CLONE_THREAD requires CLONE_SIGHAND AFAIU. So is there any cae where
checking for CLONE_SIGHAND would wrong for our purpose?

This is mostly an academic question because I do agree that checking for
CLONE_THREAD is likely more readable. And in fact the MMF_PROC_SHARED is
likely more suitable to be set in copy_signal.
Christian Brauner Aug. 20, 2020, 11:30 a.m. UTC | #9
On Thu, Aug 20, 2020 at 01:13:49PM +0200, Michal Hocko wrote:
> On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> > On 08/19, Suren Baghdasaryan wrote:
> > >
> > > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > used the additional mutex lock in that path of the clone() syscall should
> > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > (when the last process sharing the mm exits) is left out of this patch to
> > > keep it simple and because it is believed that this threading model is
> > > rare.
> > 
> > vfork() ?
> 
> Could you be more specific?

vfork() implies CLONE_VM but !CLONE_THREAD. The way this patch is
written the mutex lock will be taken every time you do a vfork().

(It's honestly also debatable whether it's that rare. For one, userspace
stuff I maintain uses it too (see [1]).
[1]: https://github.com/lxc/lxc/blob/9d3b7c97f0443adc9f0b0438437657ab42f5a1c3/src/lxc/start.c#L1676
)

> 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > >  	if (clone_flags & CLONE_VM) {
> > >  		mmget(oldmm);
> > >  		mm = oldmm;
> > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > 
> > I agree with Christian, you need CLONE_THREAD
> 
> This was my suggestion to Suren, likely because I've misrememberd which
> clone flag is responsible for the signal delivery. But now, after double
> checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
> CLONE_THREAD is the right thing to check.
> 
> > > +			/* We need to synchronize with __set_oom_adj */
> > > +			mutex_lock(&oom_adj_lock);
> > > +			set_bit(MMF_PROC_SHARED, &mm->flags);
> > > +			/* Update the values in case they were changed after copy_signal */
> > > +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> > > +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> > > +			mutex_unlock(&oom_adj_lock);
> > 
> > I don't understand how this can close the race with __set_oom_adj...
> > 
> > What if __set_oom_adj() is called right after mutex_unlock() ? It will see
> > MMF_PROC_SHARED, but for_each_process() won't find the new child until
> > copy_process() does list_add_tail_rcu(&p->tasks, &init_task.tasks) ?
> 
> Good point. Then we will have to move this thing there.

I was toying with moving this into sm like:

static inline copy_oom_score(unsigned long flags, struct task_struct *tsk)

trying to rely on set_bit() and test_bit() in copy_mm() being atomic and
then calling it where Oleg said after the point of no return.

Christian
Oleg Nesterov Aug. 20, 2020, 11:41 a.m. UTC | #10
On 08/20, Michal Hocko wrote:
>
> On Thu 20-08-20 13:13:55, Michal Hocko wrote:
> > On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> > > On 08/19, Suren Baghdasaryan wrote:
> > > >
> > > > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > > used the additional mutex lock in that path of the clone() syscall should
> > > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > > (when the last process sharing the mm exits) is left out of this patch to
> > > > keep it simple and because it is believed that this threading model is
> > > > rare.
> > >
> > > vfork() ?
> >
> > Could you be more specific?

I meant, vfork() is not rare and iiuc MMF_PROC_SHARED will be set in this
case and never cleared.

> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > > >  	if (clone_flags & CLONE_VM) {
> > > >  		mmget(oldmm);
> > > >  		mm = oldmm;
> > > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > >
> > > I agree with Christian, you need CLONE_THREAD
> >
> > This was my suggestion to Suren, likely because I've misrememberd which
> > clone flag is responsible for the signal delivery. But now, after double
> > checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
> > CLONE_THREAD is the right thing to check.
>
> I have tried to remember but I have to say that after reading man page I
> am still confused. So what is the actual difference between CLONE_THREAD
> and CLONE_SIGHAND?

Well, CLONE_THREAD creates a sub-thred, it needs CLONE_SIGHAND/VM.

> Essentially all we care about from the OOM (and
> oom_score_adj) POV is that signals are delivered to all entities and
> that thay share signal struct.

Yes, but CLONE_SIGHAND doesn't share the signal struct.

CLONE_SIGHAND shares sighand_struct, iow it shares the signal handlers.
so that if (say) the child does signal(SIG, handler) this equally affects
the parent. This obviously means that CLONE_SIGHAND needs CLONE_VM.

Oleg.
Michal Hocko Aug. 20, 2020, 11:42 a.m. UTC | #11
On Thu 20-08-20 13:30:23, Christian Brauner wrote:
> On Thu, Aug 20, 2020 at 01:13:49PM +0200, Michal Hocko wrote:
> > On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> > > On 08/19, Suren Baghdasaryan wrote:
> > > >
> > > > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > > used the additional mutex lock in that path of the clone() syscall should
> > > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > > (when the last process sharing the mm exits) is left out of this patch to
> > > > keep it simple and because it is believed that this threading model is
> > > > rare.
> > > 
> > > vfork() ?
> > 
> > Could you be more specific?
> 
> vfork() implies CLONE_VM but !CLONE_THREAD. The way this patch is
> written the mutex lock will be taken every time you do a vfork().

OK, I see. We definietely do not want to impact vfork so we likely have
to check for CLONE_VFORK as well. Ohh, well our clone flags are really
clear as mud.

> (It's honestly also debatable whether it's that rare. For one, userspace
> stuff I maintain uses it too (see [1]).
> [1]: https://github.com/lxc/lxc/blob/9d3b7c97f0443adc9f0b0438437657ab42f5a1c3/src/lxc/start.c#L1676
> )
> 
> > 
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > > >  	if (clone_flags & CLONE_VM) {
> > > >  		mmget(oldmm);
> > > >  		mm = oldmm;
> > > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > > 
> > > I agree with Christian, you need CLONE_THREAD
> > 
> > This was my suggestion to Suren, likely because I've misrememberd which
> > clone flag is responsible for the signal delivery. But now, after double
> > checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
> > CLONE_THREAD is the right thing to check.
> > 
> > > > +			/* We need to synchronize with __set_oom_adj */
> > > > +			mutex_lock(&oom_adj_lock);
> > > > +			set_bit(MMF_PROC_SHARED, &mm->flags);
> > > > +			/* Update the values in case they were changed after copy_signal */
> > > > +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> > > > +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> > > > +			mutex_unlock(&oom_adj_lock);
> > > 
> > > I don't understand how this can close the race with __set_oom_adj...
> > > 
> > > What if __set_oom_adj() is called right after mutex_unlock() ? It will see
> > > MMF_PROC_SHARED, but for_each_process() won't find the new child until
> > > copy_process() does list_add_tail_rcu(&p->tasks, &init_task.tasks) ?
> > 
> > Good point. Then we will have to move this thing there.
> 
> I was toying with moving this into sm like:
> 
> static inline copy_oom_score(unsigned long flags, struct task_struct *tsk)
> 
> trying to rely on set_bit() and test_bit() in copy_mm() being atomic and
> then calling it where Oleg said after the point of no return.

No objections.
Christian Brauner Aug. 20, 2020, 11:47 a.m. UTC | #12
On Thu, Aug 20, 2020 at 01:29:32PM +0200, Michal Hocko wrote:
> On Thu 20-08-20 13:13:55, Michal Hocko wrote:
> > On Thu 20-08-20 12:55:56, Oleg Nesterov wrote:
> > > On 08/19, Suren Baghdasaryan wrote:
> > > >
> > > > Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > > > used the additional mutex lock in that path of the clone() syscall should
> > > > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > > > (when the last process sharing the mm exits) is left out of this patch to
> > > > keep it simple and because it is believed that this threading model is
> > > > rare.
> > > 
> > > vfork() ?
> > 
> > Could you be more specific?
> > 
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> > > >  	if (clone_flags & CLONE_VM) {
> > > >  		mmget(oldmm);
> > > >  		mm = oldmm;
> > > > +		if (!(clone_flags & CLONE_SIGHAND)) {
> > > 
> > > I agree with Christian, you need CLONE_THREAD
> > 
> > This was my suggestion to Suren, likely because I've misrememberd which
> > clone flag is responsible for the signal delivery. But now, after double
> > checking we do explicitly disallow CLONE_SIGHAND && !CLONE_VM. So
> > CLONE_THREAD is the right thing to check.
> 
> I have tried to remember but I have to say that after reading man page I
> am still confused. So what is the actual difference between CLONE_THREAD
> and CLONE_SIGHAND? Essentially all we care about from the OOM (and

CLONE_THREAD implies CLONE_SIGHAND
CLONE_SIGHAND implies CLONE_VM but CLONE_SIGHAND doesn't imply CLONE_THREAD.

> oom_score_adj) POV is that signals are delivered to all entities and
> that thay share signal struct. copy_signal is checking for CLONE_THREAD

If a thread has a separate sighand struct it can have separate handlers
(Oleg will correct me if wrong.). But fatal signals will take the whole
thread-group down and can't be ignored which is the only thing you care
about with OOM afair.
What you care about is that the oom_score_adj{_min} settings are shared
and they live in struct signal_struct and whether that's shared or not
is basically guided by CLONE_THREAD.

> but CLONE_THREAD requires CLONE_SIGHAND AFAIU. So is there any cae where
> checking for CLONE_SIGHAND would wrong for our purpose?

Without having spent a long time thinking deeply about this it likely
wouldn't. But using CLONE_SIGHAND is very irritating since it doesn't
clearly express what you want this for. Especially since there's now a
difference between the check in copy_signal() and copy_mm() and a
disconnect to what is expressed in the commit message too, imho.

Christian
Eric W. Biederman Aug. 20, 2020, 12:34 p.m. UTC | #13
Suren Baghdasaryan <surenb@google.com> writes:

> Currently __set_oom_adj loops through all processes in the system to
> keep oom_score_adj and oom_score_adj_min in sync between processes
> sharing their mm. This is done for any task with more that one mm_users,
> which includes processes with multiple threads (sharing mm and signals).
> However for such processes the loop is unnecessary because their signal
> structure is shared as well.
> Android updates oom_score_adj whenever a tasks changes its role
> (background/foreground/...) or binds to/unbinds from a service, making
> it more/less important. Such operation can happen frequently.
> We noticed that updates to oom_score_adj became more expensive and after
> further investigation found out that the patch mentioned in "Fixes"
> introduced a regression. Using Pixel 4 with a typical Android workload,
> write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> this regression linearly depends on the number of multi-threaded
> processes running on the system.
> Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> instead of mm_users to decide whether oom_score_adj update should be
> synchronized between multiple processes. To prevent races between clone()
> and __set_oom_adj(), when oom_score_adj of the process being cloned might
> be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> global and it is renamed into oom_adj_lock for naming consistency with
> oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> used the additional mutex lock in that path of the clone() syscall should
> not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> (when the last process sharing the mm exits) is left out of this patch to
> keep it simple and because it is believed that this threading model is
> rare. Should there ever be a need for optimizing that case as well, it
> can be done by hooking into the exit path, likely following the
> mm_update_next_owner pattern.
> With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> regression is gone after the change is applied.

So I am confused.

Is there any reason why we don't simply move signal->oom_score_adj to
mm->oom_score_adj and call it a day?

The problem in all of this appears to be that we want the score to be
per mm and we are setting it per ``process'' (aka signal_struct).

To maintained backwards compatibility I expect exec can be taught to
copy the oom_score_adj from one mm to another more simply than
we can synchronize oom_score_adj between all of the threads in the
thread group.

Eric

>
> Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> Reported-by: Tim Murray <timmurray@google.com>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/base.c                 | 7 +++----
>  include/linux/oom.h            | 1 +
>  include/linux/sched/coredump.h | 1 +
>  kernel/fork.c                  | 9 +++++++++
>  mm/oom_kill.c                  | 2 ++
>  5 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 617db4e0faa0..cff1a58a236c 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
>  
>  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  {
> -	static DEFINE_MUTEX(oom_adj_mutex);
>  	struct mm_struct *mm = NULL;
>  	struct task_struct *task;
>  	int err = 0;
> @@ -1064,7 +1063,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  	if (!task)
>  		return -ESRCH;
>  
> -	mutex_lock(&oom_adj_mutex);
> +	mutex_lock(&oom_adj_lock);
>  	if (legacy) {
>  		if (oom_adj < task->signal->oom_score_adj &&
>  				!capable(CAP_SYS_RESOURCE)) {
> @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		struct task_struct *p = find_lock_task_mm(task);
>  
>  		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> +			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
>  				mm = p->mm;
>  				mmgrab(mm);
>  			}
> @@ -1132,7 +1131,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		mmdrop(mm);
>  	}
>  err_unlock:
> -	mutex_unlock(&oom_adj_mutex);
> +	mutex_unlock(&oom_adj_lock);
>  	put_task_struct(task);
>  	return err;
>  }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index f022f581ac29..861f22bd4706 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -55,6 +55,7 @@ struct oom_control {
>  };
>  
>  extern struct mutex oom_lock;
> +extern struct mutex oom_adj_lock;
>  
>  static inline void set_current_oom_origin(void)
>  {
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ecdc6542070f..070629b722df 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
>  #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
> +#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
>  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4d32190861bd..9177a76bf840 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1403,6 +1403,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_VM) {
>  		mmget(oldmm);
>  		mm = oldmm;
> +		if (!(clone_flags & CLONE_SIGHAND)) {
> +			/* We need to synchronize with __set_oom_adj */
> +			mutex_lock(&oom_adj_lock);
> +			set_bit(MMF_PROC_SHARED, &mm->flags);
> +			/* Update the values in case they were changed after copy_signal */
> +			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> +			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> +			mutex_unlock(&oom_adj_lock);
> +		}
>  		goto good_mm;
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e90f25d6385d..c22f07c986cb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -64,6 +64,8 @@ int sysctl_oom_dump_tasks = 1;
>   * and mark_oom_victim
>   */
>  DEFINE_MUTEX(oom_lock);
> +/* Serializes oom_score_adj and oom_score_adj_min updates */
> +DEFINE_MUTEX(oom_adj_lock);
>  
>  static inline bool is_memcg_oom(struct oom_control *oc)
>  {
Michal Hocko Aug. 20, 2020, 12:41 p.m. UTC | #14
On Thu 20-08-20 13:42:56, Michal Hocko wrote:
> On Thu 20-08-20 13:30:23, Christian Brauner wrote:
[...]
> > trying to rely on set_bit() and test_bit() in copy_mm() being atomic and
> > then calling it where Oleg said after the point of no return.
> 
> No objections.

Would something like the following work for you?

diff --git a/kernel/fork.c b/kernel/fork.c
index 9177a76bf840..25b83f0912a6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1403,15 +1403,6 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
-		if (!(clone_flags & CLONE_SIGHAND)) {
-			/* We need to synchronize with __set_oom_adj */
-			mutex_lock(&oom_adj_lock);
-			set_bit(MMF_PROC_SHARED, &mm->flags);
-			/* Update the values in case they were changed after copy_signal */
-			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
-			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
-			mutex_unlock(&oom_adj_lock);
-		}
 		goto good_mm;
 	}
 
@@ -1818,6 +1809,19 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
 		free_task(tsk);
 }
 
+static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
+{
+	if ((clone_flags & (CLONE_VM | CLONE_THREAD | CLONE_VFORK)) == CLONE_VM) {
+		/* We need to synchronize with __set_oom_adj */
+		mutex_lock(&oom_adj_lock);
+		set_bit(MMF_PROC_SHARED, &mm->flags);
+		/* Update the values in case they were changed after copy_signal */
+		tsk->signal->oom_score_adj = current->signal->oom_score_adj;
+		tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
+		mutex_unlock(&oom_adj_lock);
+	}
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2290,6 +2294,8 @@ static __latent_entropy struct task_struct *copy_process(
 	trace_task_newtask(p, clone_flags);
 	uprobe_copy_process(p, clone_flags);
 
+	copy_oom_score_adj(clone_flags, p);
+
 	return p;
 
 bad_fork_cancel_cgroup:
Michal Hocko Aug. 20, 2020, 12:42 p.m. UTC | #15
On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
> Suren Baghdasaryan <surenb@google.com> writes:
> 
> > Currently __set_oom_adj loops through all processes in the system to
> > keep oom_score_adj and oom_score_adj_min in sync between processes
> > sharing their mm. This is done for any task with more that one mm_users,
> > which includes processes with multiple threads (sharing mm and signals).
> > However for such processes the loop is unnecessary because their signal
> > structure is shared as well.
> > Android updates oom_score_adj whenever a tasks changes its role
> > (background/foreground/...) or binds to/unbinds from a service, making
> > it more/less important. Such operation can happen frequently.
> > We noticed that updates to oom_score_adj became more expensive and after
> > further investigation found out that the patch mentioned in "Fixes"
> > introduced a regression. Using Pixel 4 with a typical Android workload,
> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> > this regression linearly depends on the number of multi-threaded
> > processes running on the system.
> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> > instead of mm_users to decide whether oom_score_adj update should be
> > synchronized between multiple processes. To prevent races between clone()
> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> > global and it is renamed into oom_adj_lock for naming consistency with
> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > used the additional mutex lock in that path of the clone() syscall should
> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > (when the last process sharing the mm exits) is left out of this patch to
> > keep it simple and because it is believed that this threading model is
> > rare. Should there ever be a need for optimizing that case as well, it
> > can be done by hooking into the exit path, likely following the
> > mm_update_next_owner pattern.
> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> > regression is gone after the change is applied.
> 
> So I am confused.
> 
> Is there any reason why we don't simply move signal->oom_score_adj to
> mm->oom_score_adj and call it a day?

Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
sharing mm have same view of oom_score_adj")
Eric W. Biederman Aug. 20, 2020, 12:45 p.m. UTC | #16
Michal Hocko <mhocko@suse.com> writes:

> On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
>> Suren Baghdasaryan <surenb@google.com> writes:
>> 
>> > Currently __set_oom_adj loops through all processes in the system to
>> > keep oom_score_adj and oom_score_adj_min in sync between processes
>> > sharing their mm. This is done for any task with more that one mm_users,
>> > which includes processes with multiple threads (sharing mm and signals).
>> > However for such processes the loop is unnecessary because their signal
>> > structure is shared as well.
>> > Android updates oom_score_adj whenever a tasks changes its role
>> > (background/foreground/...) or binds to/unbinds from a service, making
>> > it more/less important. Such operation can happen frequently.
>> > We noticed that updates to oom_score_adj became more expensive and after
>> > further investigation found out that the patch mentioned in "Fixes"
>> > introduced a regression. Using Pixel 4 with a typical Android workload,
>> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
>> > this regression linearly depends on the number of multi-threaded
>> > processes running on the system.
>> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
>> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
>> > instead of mm_users to decide whether oom_score_adj update should be
>> > synchronized between multiple processes. To prevent races between clone()
>> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
>> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
>> > global and it is renamed into oom_adj_lock for naming consistency with
>> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
>> > used the additional mutex lock in that path of the clone() syscall should
>> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
>> > (when the last process sharing the mm exits) is left out of this patch to
>> > keep it simple and because it is believed that this threading model is
>> > rare. Should there ever be a need for optimizing that case as well, it
>> > can be done by hooking into the exit path, likely following the
>> > mm_update_next_owner pattern.
>> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
>> > regression is gone after the change is applied.
>> 
>> So I am confused.
>> 
>> Is there any reason why we don't simply move signal->oom_score_adj to
>> mm->oom_score_adj and call it a day?
>
> Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
> sharing mm have same view of oom_score_adj")

That explains why the scores are synchronized.

It doesn't explain why we don't do the much simpler thing and move
oom_score_adj from signal_struct to mm_struct. Which is my question.

Why not put the score where we need it to ensure that the oom score
is always synchronized?  AKA on the mm_struct, not the signal_struct.

Eric
Eric W. Biederman Aug. 20, 2020, 12:54 p.m. UTC | #17
ebiederm@xmission.com (Eric W. Biederman) writes:

2> Michal Hocko <mhocko@suse.com> writes:
>
>> On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
>>> Suren Baghdasaryan <surenb@google.com> writes:
>>> 
>>> > Currently __set_oom_adj loops through all processes in the system to
>>> > keep oom_score_adj and oom_score_adj_min in sync between processes
>>> > sharing their mm. This is done for any task with more that one mm_users,
>>> > which includes processes with multiple threads (sharing mm and signals).
>>> > However for such processes the loop is unnecessary because their signal
>>> > structure is shared as well.
>>> > Android updates oom_score_adj whenever a tasks changes its role
>>> > (background/foreground/...) or binds to/unbinds from a service, making
>>> > it more/less important. Such operation can happen frequently.
>>> > We noticed that updates to oom_score_adj became more expensive and after
>>> > further investigation found out that the patch mentioned in "Fixes"
>>> > introduced a regression. Using Pixel 4 with a typical Android workload,
>>> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
>>> > this regression linearly depends on the number of multi-threaded
>>> > processes running on the system.
>>> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
>>> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
>>> > instead of mm_users to decide whether oom_score_adj update should be
>>> > synchronized between multiple processes. To prevent races between clone()
>>> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
>>> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
>>> > global and it is renamed into oom_adj_lock for naming consistency with
>>> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
>>> > used the additional mutex lock in that path of the clone() syscall should
>>> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
>>> > (when the last process sharing the mm exits) is left out of this patch to
>>> > keep it simple and because it is believed that this threading model is
>>> > rare. Should there ever be a need for optimizing that case as well, it
>>> > can be done by hooking into the exit path, likely following the
>>> > mm_update_next_owner pattern.
>>> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
>>> > regression is gone after the change is applied.
>>> 
>>> So I am confused.
>>> 
>>> Is there any reason why we don't simply move signal->oom_score_adj to
>>> mm->oom_score_adj and call it a day?
>>
>> Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
>> sharing mm have same view of oom_score_adj")
>
> That explains why the scores are synchronized.
>
> It doesn't explain why we don't do the much simpler thing and move
> oom_score_adj from signal_struct to mm_struct. Which is my question.
>
> Why not put the score where we need it to ensure that the oom score
> is always synchronized?  AKA on the mm_struct, not the signal_struct.

Apologies.  That 44a70adec910 does describe that some people have seen
vfork users set oom_score.  No details unfortunately.

I will skip the part where posix calls this undefined behavior.  It
breaks userspace to change.

It still seems like the code should be able to buffer oom_adj during
vfork, and only move the value onto mm_struct during exec.

Eric
Michal Hocko Aug. 20, 2020, 1:26 p.m. UTC | #18
On Thu 20-08-20 07:54:44, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> 2> Michal Hocko <mhocko@suse.com> writes:
> >
> >> On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
> >>> Suren Baghdasaryan <surenb@google.com> writes:
> >>> 
> >>> > Currently __set_oom_adj loops through all processes in the system to
> >>> > keep oom_score_adj and oom_score_adj_min in sync between processes
> >>> > sharing their mm. This is done for any task with more that one mm_users,
> >>> > which includes processes with multiple threads (sharing mm and signals).
> >>> > However for such processes the loop is unnecessary because their signal
> >>> > structure is shared as well.
> >>> > Android updates oom_score_adj whenever a tasks changes its role
> >>> > (background/foreground/...) or binds to/unbinds from a service, making
> >>> > it more/less important. Such operation can happen frequently.
> >>> > We noticed that updates to oom_score_adj became more expensive and after
> >>> > further investigation found out that the patch mentioned in "Fixes"
> >>> > introduced a regression. Using Pixel 4 with a typical Android workload,
> >>> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> >>> > this regression linearly depends on the number of multi-threaded
> >>> > processes running on the system.
> >>> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> >>> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> >>> > instead of mm_users to decide whether oom_score_adj update should be
> >>> > synchronized between multiple processes. To prevent races between clone()
> >>> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
> >>> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> >>> > global and it is renamed into oom_adj_lock for naming consistency with
> >>> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> >>> > used the additional mutex lock in that path of the clone() syscall should
> >>> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> >>> > (when the last process sharing the mm exits) is left out of this patch to
> >>> > keep it simple and because it is believed that this threading model is
> >>> > rare. Should there ever be a need for optimizing that case as well, it
> >>> > can be done by hooking into the exit path, likely following the
> >>> > mm_update_next_owner pattern.
> >>> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> >>> > regression is gone after the change is applied.
> >>> 
> >>> So I am confused.
> >>> 
> >>> Is there any reason why we don't simply move signal->oom_score_adj to
> >>> mm->oom_score_adj and call it a day?
> >>
> >> Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
> >> sharing mm have same view of oom_score_adj")
> >
> > That explains why the scores are synchronized.
> >
> > It doesn't explain why we don't do the much simpler thing and move
> > oom_score_adj from signal_struct to mm_struct. Which is my question.
> >
> > Why not put the score where we need it to ensure that the oom score
> > is always synchronized?  AKA on the mm_struct, not the signal_struct.
> 
> Apologies.  That 44a70adec910 does describe that some people have seen
> vfork users set oom_score.  No details unfortunately.
> 
> I will skip the part where posix calls this undefined behavior.  It
> breaks userspace to change.
> 
> It still seems like the code should be able to buffer oom_adj during
> vfork, and only move the value onto mm_struct during exec.

If you can handle vfork by other means then I am all for it. There were
no patches in that regard proposed yet. Maybe it will turn out simpler
then the heavy lifting we have to do in the oom specific code.
Christian Brauner Aug. 20, 2020, 1:34 p.m. UTC | #19
On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
> On Thu 20-08-20 07:54:44, Eric W. Biederman wrote:
> > ebiederm@xmission.com (Eric W. Biederman) writes:
> > 
> > 2> Michal Hocko <mhocko@suse.com> writes:
> > >
> > >> On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
> > >>> Suren Baghdasaryan <surenb@google.com> writes:
> > >>> 
> > >>> > Currently __set_oom_adj loops through all processes in the system to
> > >>> > keep oom_score_adj and oom_score_adj_min in sync between processes
> > >>> > sharing their mm. This is done for any task with more that one mm_users,
> > >>> > which includes processes with multiple threads (sharing mm and signals).
> > >>> > However for such processes the loop is unnecessary because their signal
> > >>> > structure is shared as well.
> > >>> > Android updates oom_score_adj whenever a tasks changes its role
> > >>> > (background/foreground/...) or binds to/unbinds from a service, making
> > >>> > it more/less important. Such operation can happen frequently.
> > >>> > We noticed that updates to oom_score_adj became more expensive and after
> > >>> > further investigation found out that the patch mentioned in "Fixes"
> > >>> > introduced a regression. Using Pixel 4 with a typical Android workload,
> > >>> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
> > >>> > this regression linearly depends on the number of multi-threaded
> > >>> > processes running on the system.
> > >>> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
> > >>> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
> > >>> > instead of mm_users to decide whether oom_score_adj update should be
> > >>> > synchronized between multiple processes. To prevent races between clone()
> > >>> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
> > >>> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
> > >>> > global and it is renamed into oom_adj_lock for naming consistency with
> > >>> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
> > >>> > used the additional mutex lock in that path of the clone() syscall should
> > >>> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
> > >>> > (when the last process sharing the mm exits) is left out of this patch to
> > >>> > keep it simple and because it is believed that this threading model is
> > >>> > rare. Should there ever be a need for optimizing that case as well, it
> > >>> > can be done by hooking into the exit path, likely following the
> > >>> > mm_update_next_owner pattern.
> > >>> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
> > >>> > regression is gone after the change is applied.
> > >>> 
> > >>> So I am confused.
> > >>> 
> > >>> Is there any reason why we don't simply move signal->oom_score_adj to
> > >>> mm->oom_score_adj and call it a day?
> > >>
> > >> Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
> > >> sharing mm have same view of oom_score_adj")
> > >
> > > That explains why the scores are synchronized.
> > >
> > > It doesn't explain why we don't do the much simpler thing and move
> > > oom_score_adj from signal_struct to mm_struct. Which is my question.
> > >
> > > Why not put the score where we need it to ensure that the oom score
> > > is always synchronized?  AKA on the mm_struct, not the signal_struct.
> > 
> > Apologies.  That 44a70adec910 does describe that some people have seen
> > vfork users set oom_score.  No details unfortunately.
> > 
> > I will skip the part where posix calls this undefined behavior.  It
> > breaks userspace to change.
> > 
> > It still seems like the code should be able to buffer oom_adj during
> > vfork, and only move the value onto mm_struct during exec.
> 
> If you can handle vfork by other means then I am all for it. There were
> no patches in that regard proposed yet. Maybe it will turn out simpler
> then the heavy lifting we have to do in the oom specific code.

Eric's not wrong. I fiddled with this too this morning but since
oom_score_adj is fiddled with in a bunch of places this seemed way more
code churn then what's proposed here.

Christian
Eric W. Biederman Aug. 20, 2020, 1:41 p.m. UTC | #20
Michal Hocko <mhocko@suse.com> writes:

> On Thu 20-08-20 07:54:44, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>> 2> Michal Hocko <mhocko@suse.com> writes:
>> >
>> >> On Thu 20-08-20 07:34:41, Eric W. Biederman wrote:
>> >>> Suren Baghdasaryan <surenb@google.com> writes:
>> >>> 
>> >>> > Currently __set_oom_adj loops through all processes in the system to
>> >>> > keep oom_score_adj and oom_score_adj_min in sync between processes
>> >>> > sharing their mm. This is done for any task with more that one mm_users,
>> >>> > which includes processes with multiple threads (sharing mm and signals).
>> >>> > However for such processes the loop is unnecessary because their signal
>> >>> > structure is shared as well.
>> >>> > Android updates oom_score_adj whenever a tasks changes its role
>> >>> > (background/foreground/...) or binds to/unbinds from a service, making
>> >>> > it more/less important. Such operation can happen frequently.
>> >>> > We noticed that updates to oom_score_adj became more expensive and after
>> >>> > further investigation found out that the patch mentioned in "Fixes"
>> >>> > introduced a regression. Using Pixel 4 with a typical Android workload,
>> >>> > write time to oom_score_adj increased from ~3.57us to ~362us. Moreover
>> >>> > this regression linearly depends on the number of multi-threaded
>> >>> > processes running on the system.
>> >>> > Mark the mm with a new MMF_PROC_SHARED flag bit when task is created with
>> >>> > CLONE_VM and !CLONE_SIGHAND. Change __set_oom_adj to use MMF_PROC_SHARED
>> >>> > instead of mm_users to decide whether oom_score_adj update should be
>> >>> > synchronized between multiple processes. To prevent races between clone()
>> >>> > and __set_oom_adj(), when oom_score_adj of the process being cloned might
>> >>> > be modified from userspace, we use oom_adj_mutex. Its scope is changed to
>> >>> > global and it is renamed into oom_adj_lock for naming consistency with
>> >>> > oom_lock. Since the combination of CLONE_VM and !CLONE_SIGHAND is rarely
>> >>> > used the additional mutex lock in that path of the clone() syscall should
>> >>> > not affect its overall performance. Clearing the MMF_PROC_SHARED flag
>> >>> > (when the last process sharing the mm exits) is left out of this patch to
>> >>> > keep it simple and because it is believed that this threading model is
>> >>> > rare. Should there ever be a need for optimizing that case as well, it
>> >>> > can be done by hooking into the exit path, likely following the
>> >>> > mm_update_next_owner pattern.
>> >>> > With the combination of CLONE_VM and !CLONE_SIGHAND being quite rare, the
>> >>> > regression is gone after the change is applied.
>> >>> 
>> >>> So I am confused.
>> >>> 
>> >>> Is there any reason why we don't simply move signal->oom_score_adj to
>> >>> mm->oom_score_adj and call it a day?
>> >>
>> >> Yes. Please read through 44a70adec910 ("mm, oom_adj: make sure processes
>> >> sharing mm have same view of oom_score_adj")
>> >
>> > That explains why the scores are synchronized.
>> >
>> > It doesn't explain why we don't do the much simpler thing and move
>> > oom_score_adj from signal_struct to mm_struct. Which is my question.
>> >
>> > Why not put the score where we need it to ensure that the oom score
>> > is always synchronized?  AKA on the mm_struct, not the signal_struct.
>> 
>> Apologies.  That 44a70adec910 does describe that some people have seen
>> vfork users set oom_score.  No details unfortunately.
>> 
>> I will skip the part where posix calls this undefined behavior.  It
>> breaks userspace to change.
>> 
>> It still seems like the code should be able to buffer oom_adj during
>> vfork, and only move the value onto mm_struct during exec.
>
> If you can handle vfork by other means then I am all for it. There were
> no patches in that regard proposed yet. Maybe it will turn out simpler
> then the heavy lifting we have to do in the oom specific code.

I expect something like this completley untested patch will work.

Eric


 fs/exec.c                    |    4 ++++
 fs/proc/base.c               |   30 ++++++------------------------
 include/linux/mm_types.h     |    4 ++++
 include/linux/sched/signal.h |    4 +---
 kernel/fork.c                |    3 +--
 mm/oom_kill.c                |   12 ++++++------
 6 files changed, 22 insertions(+), 35 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 9b723d2560d1..e7eed5212c6c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
 	vmacache_flush(tsk);
 	task_unlock(tsk);
 	if (old_mm) {
+		mm->oom_score_adj = old_mm->oom_score_adj;
+		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
+		if (tsk->vfork_done)
+			mm->oom_score_adj = tsk->vfork_oom_score_adj;
 		mmap_read_unlock(old_mm);
 		BUG_ON(active_mm != old_mm);
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..795fa0a8db52 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1103,33 +1103,15 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
-	task->signal->oom_score_adj = oom_adj;
-	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_adj;
-	trace_oom_score_adj_update(task);
-
 	if (mm) {
 		struct task_struct *p;
 
-		rcu_read_lock();
-		for_each_process(p) {
-			if (same_thread_group(task, p))
-				continue;
-
-			/* do not touch kernel threads or the global init */
-			if (p->flags & PF_KTHREAD || is_global_init(p))
-				continue;
-
-			task_lock(p);
-			if (!p->vfork_done && process_shares_mm(p, mm)) {
-				p->signal->oom_score_adj = oom_adj;
-				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
-					p->signal->oom_score_adj_min = (short)oom_adj;
-			}
-			task_unlock(p);
-		}
-		rcu_read_unlock();
-		mmdrop(mm);
+		mm->oom_score_adj = oom_adj;
+		if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+			mm->oom_score_adj_min = (short)oom_adj;
+		trace_oom_score_adj_update(task);
+	} else {
+		task->signal->vfork_oom_score_adj = oom_adj;
 	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..b865048ab25a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -542,6 +542,10 @@ struct mm_struct {
 		atomic_long_t hugetlb_usage;
 #endif
 		struct work_struct async_put_work;
+
+		short oom_score_adj;		/* OOM kill score adjustment */
+		short oom_score_adj_min;	/* OOM kill score adjustment min value.
+					 * Only settable by CAP_SYS_RESOURCE. */
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..a69eb9e0d247 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -218,9 +218,7 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
-	short oom_score_adj;		/* OOM kill score adjustment */
-	short oom_score_adj_min;	/* OOM kill score adjustment min value.
-					 * Only settable by CAP_SYS_RESOURCE. */
+	short vfork_oom_score_adj;		/* OOM kill score adjustment */
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3049a41076f3..1ba4deaa2f98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1584,8 +1584,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
-	sig->oom_score_adj = current->signal->oom_score_adj;
-	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
+	sig->vfork_oom_score_adj = current->mm->oom_score_adj;
 
 	mutex_init(&sig->cred_guard_mutex);
 	mutex_init(&sig->exec_update_mutex);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e90f25d6385d..0412f64e74c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -213,7 +213,7 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * unkillable or have been already oom reaped or the are in
 	 * the middle of vfork
 	 */
-	adj = (long)p->signal->oom_score_adj;
+	adj = (long)p->mm->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
 			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
@@ -403,7 +403,7 @@ static int dump_task(struct task_struct *p, void *arg)
 		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
 		mm_pgtables_bytes(task->mm),
 		get_mm_counter(task->mm, MM_SWAPENTS),
-		task->signal->oom_score_adj, task->comm);
+		task->mm->oom_score_adj, task->comm);
 	task_unlock(task);
 
 	return 0;
@@ -452,7 +452,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 {
 	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
-			current->signal->oom_score_adj);
+			current->mm->oom_score_adj);
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -892,7 +892,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 		K(get_mm_counter(mm, MM_FILEPAGES)),
 		K(get_mm_counter(mm, MM_SHMEMPAGES)),
 		from_kuid(&init_user_ns, task_uid(victim)),
-		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
+		mm_pgtables_bytes(mm) >> 10, victim->mm->oom_score_adj);
 	task_unlock(victim);
 
 	/*
@@ -942,7 +942,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
  */
 static int oom_kill_memcg_member(struct task_struct *task, void *message)
 {
-	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN &&
+	if (task->mm->oom_score_adj != OOM_SCORE_ADJ_MIN &&
 	    !is_global_init(task)) {
 		get_task_struct(task);
 		__oom_kill_process(task, message);
@@ -1089,7 +1089,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current) &&
 	    oom_cpuset_eligible(current, oc) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    current->mm->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
Christian Brauner Aug. 20, 2020, 1:43 p.m. UTC | #21
On Thu, Aug 20, 2020 at 02:41:09PM +0200, Michal Hocko wrote:
> On Thu 20-08-20 13:42:56, Michal Hocko wrote:
> > On Thu 20-08-20 13:30:23, Christian Brauner wrote:
> [...]
> > > trying to rely on set_bit() and test_bit() in copy_mm() being atomic and
> > > then calling it where Oleg said after the point of no return.
> > 
> > No objections.
> 
> Would something like the following work for you?
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9177a76bf840..25b83f0912a6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1403,15 +1403,6 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_VM) {
>  		mmget(oldmm);
>  		mm = oldmm;
> -		if (!(clone_flags & CLONE_SIGHAND)) {
> -			/* We need to synchronize with __set_oom_adj */
> -			mutex_lock(&oom_adj_lock);
> -			set_bit(MMF_PROC_SHARED, &mm->flags);
> -			/* Update the values in case they were changed after copy_signal */
> -			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> -			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> -			mutex_unlock(&oom_adj_lock);
> -		}
>  		goto good_mm;
>  	}
>  
> @@ -1818,6 +1809,19 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
>  		free_task(tsk);
>  }
>  
> +static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
> +{
> +	if ((clone_flags & (CLONE_VM | CLONE_THREAD | CLONE_VFORK)) == CLONE_VM) {
> +		/* We need to synchronize with __set_oom_adj */
> +		mutex_lock(&oom_adj_lock);
> +		set_bit(MMF_PROC_SHARED, &mm->flags);
> +		/* Update the values in case they were changed after copy_signal */
> +		tsk->signal->oom_score_adj = current->signal->oom_score_adj;
> +		tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
> +		mutex_unlock(&oom_adj_lock);
> +	}
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2290,6 +2294,8 @@ static __latent_entropy struct task_struct *copy_process(
>  	trace_task_newtask(p, clone_flags);
>  	uprobe_copy_process(p, clone_flags);
>  
> +	copy_oom_score_adj(clone_flags, p);
> +
>  	return p;
>  
>  bad_fork_cancel_cgroup:

This should work, yes.
And keeps the patch reasonably small.

Christian
Tetsuo Handa Aug. 20, 2020, 1:48 p.m. UTC | #22
On 2020/08/20 22:34, Christian Brauner wrote:
> On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
>> If you can handle vfork by other means then I am all for it. There were
>> no patches in that regard proposed yet. Maybe it will turn out simpler
>> then the heavy lifting we have to do in the oom specific code.
> 
> Eric's not wrong. I fiddled with this too this morning but since
> oom_score_adj is fiddled with in a bunch of places this seemed way more
> code churn then what's proposed here.

I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj").

  https://lore.kernel.org/patchwork/patch/1037208/
Christian Brauner Aug. 20, 2020, 2 p.m. UTC | #23
On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
> On 2020/08/20 22:34, Christian Brauner wrote:
> > On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
> >> If you can handle vfork by other means then I am all for it. There were
> >> no patches in that regard proposed yet. Maybe it will turn out simpler
> >> then the heavy lifting we have to do in the oom specific code.
> > 
> > Eric's not wrong. I fiddled with this too this morning but since
> > oom_score_adj is fiddled with in a bunch of places this seemed way more
> > code churn then what's proposed here.
> 
> I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj").
> 
>   https://lore.kernel.org/patchwork/patch/1037208/

I guess this is a can of worms but just or the sake of getting more
background: the question seems to be whether the oom adj score is a
property of the task/thread-group or a property of the mm. I always
thought the oom score is a property of the task/thread-group and not the
mm which is also why it lives in struct signal_struct and not in struct
mm_struct. But

44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")

reads like it is supposed to be a property of the mm or at least the
change makes it so.

Christian
Oleg Nesterov Aug. 20, 2020, 2:04 p.m. UTC | #24
On 08/20, Eric W. Biederman wrote:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
>  	vmacache_flush(tsk);
>  	task_unlock(tsk);
>  	if (old_mm) {
> +		mm->oom_score_adj = old_mm->oom_score_adj;
> +		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
> +		if (tsk->vfork_done)
> +			mm->oom_score_adj = tsk->vfork_oom_score_adj;

too late, ->vfork_done is NULL after mm_release().

And this can race with __set_oom_adj(). Yes, the current code is racy too,
but this change adds another race, __set_oom_adj() could already observe
->mm != NULL and update mm->oom_score_adj.

Oleg.
Michal Hocko Aug. 20, 2020, 2:12 p.m. UTC | #25
On Thu 20-08-20 08:41:18, Eric W. Biederman wrote:
[...]
> I expect something like this completley untested patch will work.

Neat. I think you will need to duplicate oom_score_adj_min as well.
I am not familiar with vfork specifics to review this in depth but if
this is correct then it is much more effective than the existing
implementation which has focused more on not spreading the oom specifics
to the core code. So if the state duplication is deemed ok then no
objections from me.

Thanks!

> 
> Eric
> 
> 
>  fs/exec.c                    |    4 ++++
>  fs/proc/base.c               |   30 ++++++------------------------
>  include/linux/mm_types.h     |    4 ++++
>  include/linux/sched/signal.h |    4 +---
>  kernel/fork.c                |    3 +--
>  mm/oom_kill.c                |   12 ++++++------
>  6 files changed, 22 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 9b723d2560d1..e7eed5212c6c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
>  	vmacache_flush(tsk);
>  	task_unlock(tsk);
>  	if (old_mm) {
> +		mm->oom_score_adj = old_mm->oom_score_adj;
> +		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
> +		if (tsk->vfork_done)
> +			mm->oom_score_adj = tsk->vfork_oom_score_adj;
>  		mmap_read_unlock(old_mm);
>  		BUG_ON(active_mm != old_mm);
>  		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 617db4e0faa0..795fa0a8db52 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1103,33 +1103,15 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		}
>  	}
>  
> -	task->signal->oom_score_adj = oom_adj;
> -	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> -		task->signal->oom_score_adj_min = (short)oom_adj;
> -	trace_oom_score_adj_update(task);
> -
>  	if (mm) {
>  		struct task_struct *p;
>  
> -		rcu_read_lock();
> -		for_each_process(p) {
> -			if (same_thread_group(task, p))
> -				continue;
> -
> -			/* do not touch kernel threads or the global init */
> -			if (p->flags & PF_KTHREAD || is_global_init(p))
> -				continue;
> -
> -			task_lock(p);
> -			if (!p->vfork_done && process_shares_mm(p, mm)) {
> -				p->signal->oom_score_adj = oom_adj;
> -				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> -					p->signal->oom_score_adj_min = (short)oom_adj;
> -			}
> -			task_unlock(p);
> -		}
> -		rcu_read_unlock();
> -		mmdrop(mm);
> +		mm->oom_score_adj = oom_adj;
> +		if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> +			mm->oom_score_adj_min = (short)oom_adj;
> +		trace_oom_score_adj_update(task);
> +	} else {
> +		task->signal->vfork_oom_score_adj = oom_adj;
>  	}
>  err_unlock:
>  	mutex_unlock(&oom_adj_mutex);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 496c3ff97cce..b865048ab25a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -542,6 +542,10 @@ struct mm_struct {
>  		atomic_long_t hugetlb_usage;
>  #endif
>  		struct work_struct async_put_work;
> +
> +		short oom_score_adj;		/* OOM kill score adjustment */
> +		short oom_score_adj_min;	/* OOM kill score adjustment min value.
> +					 * Only settable by CAP_SYS_RESOURCE. */
>  	} __randomize_layout;
>  
>  	/*
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1bad18a1d8ba..a69eb9e0d247 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -218,9 +218,7 @@ struct signal_struct {
>  	 * oom
>  	 */
>  	bool oom_flag_origin;
> -	short oom_score_adj;		/* OOM kill score adjustment */
> -	short oom_score_adj_min;	/* OOM kill score adjustment min value.
> -					 * Only settable by CAP_SYS_RESOURCE. */
> +	short vfork_oom_score_adj;		/* OOM kill score adjustment */
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3049a41076f3..1ba4deaa2f98 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1584,8 +1584,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	tty_audit_fork(sig);
>  	sched_autogroup_fork(sig);
>  
> -	sig->oom_score_adj = current->signal->oom_score_adj;
> -	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
> +	sig->vfork_oom_score_adj = current->mm->oom_score_adj;
>  
>  	mutex_init(&sig->cred_guard_mutex);
>  	mutex_init(&sig->exec_update_mutex);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e90f25d6385d..0412f64e74c1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -213,7 +213,7 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
>  	 * unkillable or have been already oom reaped or the are in
>  	 * the middle of vfork
>  	 */
> -	adj = (long)p->signal->oom_score_adj;
> +	adj = (long)p->mm->oom_score_adj;
>  	if (adj == OOM_SCORE_ADJ_MIN ||
>  			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>  			in_vfork(p)) {
> @@ -403,7 +403,7 @@ static int dump_task(struct task_struct *p, void *arg)
>  		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
>  		mm_pgtables_bytes(task->mm),
>  		get_mm_counter(task->mm, MM_SWAPENTS),
> -		task->signal->oom_score_adj, task->comm);
> +		task->mm->oom_score_adj, task->comm);
>  	task_unlock(task);
>  
>  	return 0;
> @@ -452,7 +452,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>  {
>  	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
>  		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
> -			current->signal->oom_score_adj);
> +			current->mm->oom_score_adj);
>  	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
>  		pr_warn("COMPACTION is disabled!!!\n");
>  
> @@ -892,7 +892,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  		K(get_mm_counter(mm, MM_FILEPAGES)),
>  		K(get_mm_counter(mm, MM_SHMEMPAGES)),
>  		from_kuid(&init_user_ns, task_uid(victim)),
> -		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
> +		mm_pgtables_bytes(mm) >> 10, victim->mm->oom_score_adj);
>  	task_unlock(victim);
>  
>  	/*
> @@ -942,7 +942,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>   */
>  static int oom_kill_memcg_member(struct task_struct *task, void *message)
>  {
> -	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN &&
> +	if (task->mm->oom_score_adj != OOM_SCORE_ADJ_MIN &&
>  	    !is_global_init(task)) {
>  		get_task_struct(task);
>  		__oom_kill_process(task, message);
> @@ -1089,7 +1089,7 @@ bool out_of_memory(struct oom_control *oc)
>  	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>  	    current->mm && !oom_unkillable_task(current) &&
>  	    oom_cpuset_eligible(current, oc) &&
> -	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +	    current->mm->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
>  		oc->chosen = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
Michal Hocko Aug. 20, 2020, 2:15 p.m. UTC | #26
On Thu 20-08-20 16:00:54, Christian Brauner wrote:
> On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
> > On 2020/08/20 22:34, Christian Brauner wrote:
> > > On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
> > >> If you can handle vfork by other means then I am all for it. There were
> > >> no patches in that regard proposed yet. Maybe it will turn out simpler
> > >> then the heavy lifting we have to do in the oom specific code.
> > > 
> > > Eric's not wrong. I fiddled with this too this morning but since
> > > oom_score_adj is fiddled with in a bunch of places this seemed way more
> > > code churn then what's proposed here.
> > 
> > I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
> > processes sharing mm have same view of oom_score_adj").
> > 
> >   https://lore.kernel.org/patchwork/patch/1037208/
> 
> I guess this is a can of worms but just or the sake of getting more
> background: the question seems to be whether the oom adj score is a
> property of the task/thread-group or a property of the mm. I always
> thought the oom score is a property of the task/thread-group and not the
> mm which is also why it lives in struct signal_struct and not in struct
> mm_struct. But

I would tend to agree that from the userspace POV it is nice to look at
oom tuning per process but fundamentaly the oom killer operates on the
address space much more than other resources bound to a process because
it is usually the address space hogging the largest portion of the
memory footprint. This is the reason why the oom killer has been
evaluating tasks based on that aspect rather than other potential memory
consumers bound to a task. Mostly due to lack of means to evaluate
those.

> 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> 
> reads like it is supposed to be a property of the mm or at least the
> change makes it so.

Yes, based on the current and historical behavior of the oom killer.
Tetsuo Handa Aug. 20, 2020, 2:18 p.m. UTC | #27
On 2020/08/20 23:00, Christian Brauner wrote:
> On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
>> On 2020/08/20 22:34, Christian Brauner wrote:
>>> On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
>>>> If you can handle vfork by other means then I am all for it. There were
>>>> no patches in that regard proposed yet. Maybe it will turn out simpler
>>>> then the heavy lifting we have to do in the oom specific code.
>>>
>>> Eric's not wrong. I fiddled with this too this morning but since
>>> oom_score_adj is fiddled with in a bunch of places this seemed way more
>>> code churn then what's proposed here.
>>
>> I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
>> processes sharing mm have same view of oom_score_adj").
>>
>>   https://lore.kernel.org/patchwork/patch/1037208/
> 
> I guess this is a can of worms but just or the sake of getting more
> background: the question seems to be whether the oom adj score is a
> property of the task/thread-group or a property of the mm. I always
> thought the oom score is a property of the task/thread-group and not the
> mm which is also why it lives in struct signal_struct and not in struct
> mm_struct. But
> 
> 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> 
> reads like it is supposed to be a property of the mm or at least the
> change makes it so.

Yes, 44a70adec910 is trying to go towards changing from a property of the task/thread-group
to a property of mm. But I don't think we need to do it at the cost of "__set_oom_adj() latency
Yong-Taek Lee and Tim Murray have reported" and "complicity for supporting
vfork() => __set_oom_adj() => execve() sequence".
Tetsuo Handa Aug. 20, 2020, 2:26 p.m. UTC | #28
On 2020/08/20 23:15, Michal Hocko wrote:
> I would tend to agree that from the userspace POV it is nice to look at
> oom tuning per process but fundamentaly the oom killer operates on the
> address space much more than other resources bound to a process because
> it is usually the address space hogging the largest portion of the
> memory footprint. This is the reason why the oom killer has been
> evaluating tasks based on that aspect rather than other potential memory
> consumers bound to a task. Mostly due to lack of means to evaluate
> those.

We already allow specifying potential memory consumers via oom_task_origin().

If we change from a property of the task/thread-group to a property of mm,
we won't be able to add means to adjust oom score based on other potential
memory consumers bound to a task (e.g. pipes) in the future.
Michal Hocko Aug. 20, 2020, 2:34 p.m. UTC | #29
On Thu 20-08-20 23:26:29, Tetsuo Handa wrote:
> On 2020/08/20 23:15, Michal Hocko wrote:
> > I would tend to agree that from the userspace POV it is nice to look at
> > oom tuning per process but fundamentaly the oom killer operates on the
> > address space much more than other resources bound to a process because
> > it is usually the address space hogging the largest portion of the
> > memory footprint. This is the reason why the oom killer has been
> > evaluating tasks based on that aspect rather than other potential memory
> > consumers bound to a task. Mostly due to lack of means to evaluate
> > those.
> 
> We already allow specifying potential memory consumers via oom_task_origin().

oom_task_origin is a single purpose hack to handle swapoff situation
more gracefully. By no means this is something to base the behavior on.

> If we change from a property of the task/thread-group to a property of mm,
> we won't be able to add means to adjust oom score based on other potential
> memory consumers bound to a task (e.g. pipes) in the future.

While that would be really nice to achieve I am not really sure this is
feasible. Mostly because accounting shared resources like pipes but fd
based resources in general is really hard to do right without any
surprises. Pipes are not really bound to a specific process for example.
You are free to hand over fd to a different process for example.
Oleg Nesterov Aug. 20, 2020, 2:36 p.m. UTC | #30
On 08/20, Oleg Nesterov wrote:
>
> On 08/20, Eric W. Biederman wrote:
> >
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
> >  	vmacache_flush(tsk);
> >  	task_unlock(tsk);
> >  	if (old_mm) {
> > +		mm->oom_score_adj = old_mm->oom_score_adj;
> > +		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
> > +		if (tsk->vfork_done)
> > +			mm->oom_score_adj = tsk->vfork_oom_score_adj;
>
> too late, ->vfork_done is NULL after mm_release().
>
> And this can race with __set_oom_adj(). Yes, the current code is racy too,
> but this change adds another race, __set_oom_adj() could already observe
> ->mm != NULL and update mm->oom_score_adj.
  ^^^^^^^^^^^^

I meant ->mm == new_mm.

And another problem. Suppose we have

	if (!vfork()) {
		change_oom_score();
		exec();
	}

the parent can be killed before the child execs, in this case vfork_oom_score_adj
will be lost.

Oleg.
Eric W. Biederman Aug. 20, 2020, 2:43 p.m. UTC | #31
Oleg Nesterov <oleg@redhat.com> writes:

> On 08/20, Eric W. Biederman wrote:
>>
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
>>  	vmacache_flush(tsk);
>>  	task_unlock(tsk);
>>  	if (old_mm) {
>> +		mm->oom_score_adj = old_mm->oom_score_adj;
>> +		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
>> +		if (tsk->vfork_done)
>> +			mm->oom_score_adj = tsk->vfork_oom_score_adj;
>
> too late, ->vfork_done is NULL after mm_release().

Good point.  

> And this can race with __set_oom_adj(). Yes, the current code is racy too,
> but this change adds another race, __set_oom_adj() could already observe
> ->mm != NULL and update mm->oom_score_adj.

I am not certain about races but we should be able to do something like:

in exec_mmap:
        if (old_mm) {
		mm->oom_score_adj = old_mm->oom_score_adj;
        	mm->oom_score_adj_min = old_mm->oom_score_adj_min;
        	if (tsk->signal->vfork_oom_score_adj_set) {
                	mm->oom_score_adj = tsk->vfork_oom_score_adj;
                	tsk->signal->vfork_oom_score_adj_set = false;
                }
        }

in __set_oom_adj:
	if (mm) {
		mm->oom_score_adj = oom_adj;
                tsk->signal->vfork_oom_score_adj_set = false;
        } else {
		tsk->vfork_score_adj = old_mm->oom_score_adj;
                tsk->signal->vfork_oom_score_adj_set = true;
        }

There might even be a special oom_score_adj value we can use instead of
a separate flag.  I am just not familiar enough with oom_score_adj to know.

We should be able to do something like that where we know the value is
set and only use it if so.  And a subsequent _set_oom_adj without
observing vfork_done set will clear the value in signal_struct.

We have to be a bit careful to get the details right but it should be
straight forward.


Michal also has a point about oom_score_adj_min, and I really don't
understand the oom logic value well enough to guess how that should
work.


Although to deal with some of the races it probably only makes sense
to call complete_vfork_done in exec after the new mm has been installed,
and while exec_update_mutex is held.  I don't think anyone every
anticipated using vfork_done as a flag.

Eric
Eric W. Biederman Aug. 20, 2020, 2:49 p.m. UTC | #32
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:

> On 2020/08/20 23:00, Christian Brauner wrote:
>> On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
>>> On 2020/08/20 22:34, Christian Brauner wrote:
>>>> On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
>>>>> If you can handle vfork by other means then I am all for it. There were
>>>>> no patches in that regard proposed yet. Maybe it will turn out simpler
>>>>> then the heavy lifting we have to do in the oom specific code.
>>>>
>>>> Eric's not wrong. I fiddled with this too this morning but since
>>>> oom_score_adj is fiddled with in a bunch of places this seemed way more
>>>> code churn then what's proposed here.
>>>
>>> I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
>>> processes sharing mm have same view of oom_score_adj").
>>>
>>>   https://lore.kernel.org/patchwork/patch/1037208/
>> 
>> I guess this is a can of worms but just or the sake of getting more
>> background: the question seems to be whether the oom adj score is a
>> property of the task/thread-group or a property of the mm. I always
>> thought the oom score is a property of the task/thread-group and not the
>> mm which is also why it lives in struct signal_struct and not in struct
>> mm_struct. But
>> 
>> 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
>> 
>> reads like it is supposed to be a property of the mm or at least the
>> change makes it so.
>
> Yes, 44a70adec910 is trying to go towards changing from a property of the task/thread-group
> to a property of mm. But I don't think we need to do it at the cost of "__set_oom_adj() latency
> Yong-Taek Lee and Tim Murray have reported" and "complicity for supporting
> vfork() => __set_oom_adj() => execve() sequence".

The thing is commit 44a70adec910d692 ("mm, oom_adj: make sure processes
sharing mm have same view of oom_score_adj") has been in the tree for 4
years.

That someone is just now noticing a regression is their problem.  The
change is semantics is done and decided.  We can not reasonably revert
at this point without risking other regressions.

Given that the decision has already been made to make oom_adj
effectively per mm.  There is no point on have a debate if we should do
it.

Eric
Eric W. Biederman Aug. 20, 2020, 3:06 p.m. UTC | #33
Oleg Nesterov <oleg@redhat.com> writes:

> On 08/20, Oleg Nesterov wrote:
>>
>> On 08/20, Eric W. Biederman wrote:
>> >
>> > --- a/fs/exec.c
>> > +++ b/fs/exec.c
>> > @@ -1139,6 +1139,10 @@ static int exec_mmap(struct mm_struct *mm)
>> >  	vmacache_flush(tsk);
>> >  	task_unlock(tsk);
>> >  	if (old_mm) {
>> > +		mm->oom_score_adj = old_mm->oom_score_adj;
>> > +		mm->oom_score_adj_min = old_mm->oom_score_adj_min;
>> > +		if (tsk->vfork_done)
>> > +			mm->oom_score_adj = tsk->vfork_oom_score_adj;
>>
>> too late, ->vfork_done is NULL after mm_release().
>>
>> And this can race with __set_oom_adj(). Yes, the current code is racy too,
>> but this change adds another race, __set_oom_adj() could already observe
>> ->mm != NULL and update mm->oom_score_adj.
>   ^^^^^^^^^^^^
>
> I meant ->mm == new_mm.
>
> And another problem. Suppose we have
>
> 	if (!vfork()) {
> 		change_oom_score();
> 		exec();
> 	}
>
> the parent can be killed before the child execs, in this case vfork_oom_score_adj
> will be lost.

Yes.

Looking at include/uapi/linux/oom.h it appears that there are a lot of
oom_score_adj values that are reserved.  So it should be completely
possible to initialize vfork_oom_score_adj to -32768 aka SHRT_MIN, and
use that as a flag to see if it is active or not.

Likewise for vfork_oom_score_adj_min if we need to duplicate that one as
well.


That deals with that entire class of race.  We still have races during
exec about vfork_done being cleared before the new ->mm == new_mm.
While that is worth fixing is an independent issue.  

Eric
Christian Brauner Aug. 20, 2020, 3:06 p.m. UTC | #34
On Thu, Aug 20, 2020 at 09:49:11AM -0500, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
> 
> > On 2020/08/20 23:00, Christian Brauner wrote:
> >> On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
> >>> On 2020/08/20 22:34, Christian Brauner wrote:
> >>>> On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
> >>>>> If you can handle vfork by other means then I am all for it. There were
> >>>>> no patches in that regard proposed yet. Maybe it will turn out simpler
> >>>>> then the heavy lifting we have to do in the oom specific code.
> >>>>
> >>>> Eric's not wrong. I fiddled with this too this morning but since
> >>>> oom_score_adj is fiddled with in a bunch of places this seemed way more
> >>>> code churn then what's proposed here.
> >>>
> >>> I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
> >>> processes sharing mm have same view of oom_score_adj").
> >>>
> >>>   https://lore.kernel.org/patchwork/patch/1037208/
> >> 
> >> I guess this is a can of worms but just or the sake of getting more
> >> background: the question seems to be whether the oom adj score is a
> >> property of the task/thread-group or a property of the mm. I always
> >> thought the oom score is a property of the task/thread-group and not the
> >> mm which is also why it lives in struct signal_struct and not in struct
> >> mm_struct. But
> >> 
> >> 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> >> 
> >> reads like it is supposed to be a property of the mm or at least the
> >> change makes it so.
> >
> > Yes, 44a70adec910 is trying to go towards changing from a property of the task/thread-group
> > to a property of mm. But I don't think we need to do it at the cost of "__set_oom_adj() latency
> > Yong-Taek Lee and Tim Murray have reported" and "complicity for supporting
> > vfork() => __set_oom_adj() => execve() sequence".
> 
> The thing is commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> sharing mm have same view of oom_score_adj") has been in the tree for 4
> years.
> 
> That someone is just now noticing a regression is their problem.  The
> change is semantics is done and decided.  We can not reasonably revert
> at this point without risking other regressions.
> 
> Given that the decision has already been made to make oom_adj
> effectively per mm.  There is no point on have a debate if we should do
> it.

I mean yeah, I think no-one really was going to jump on the revert-train.

At least for me the historical background was quite important to know
actually. The fact that by sharing the mm one can effectively be bound
to the oom score of another thread-group is kinda suprising and the
oom_score_adj section in man proc doesn't mention this.

And I really think we need to document this. Either on the clone() or on
the oom_score_adj page...

Christian
Suren Baghdasaryan Aug. 20, 2020, 3:56 p.m. UTC | #35
On Thu, Aug 20, 2020 at 7:53 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
>
> > On 2020/08/20 23:00, Christian Brauner wrote:
> >> On Thu, Aug 20, 2020 at 10:48:43PM +0900, Tetsuo Handa wrote:
> >>> On 2020/08/20 22:34, Christian Brauner wrote:
> >>>> On Thu, Aug 20, 2020 at 03:26:31PM +0200, Michal Hocko wrote:
> >>>>> If you can handle vfork by other means then I am all for it. There were
> >>>>> no patches in that regard proposed yet. Maybe it will turn out simpler
> >>>>> then the heavy lifting we have to do in the oom specific code.
> >>>>
> >>>> Eric's not wrong. I fiddled with this too this morning but since
> >>>> oom_score_adj is fiddled with in a bunch of places this seemed way more
> >>>> code churn then what's proposed here.
> >>>
> >>> I prefer simply reverting commit 44a70adec910d692 ("mm, oom_adj: make sure
> >>> processes sharing mm have same view of oom_score_adj").
> >>>
> >>>   https://lore.kernel.org/patchwork/patch/1037208/
> >>
> >> I guess this is a can of worms but just or the sake of getting more
> >> background: the question seems to be whether the oom adj score is a
> >> property of the task/thread-group or a property of the mm. I always
> >> thought the oom score is a property of the task/thread-group and not the
> >> mm which is also why it lives in struct signal_struct and not in struct
> >> mm_struct. But
> >>
> >> 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> >>
> >> reads like it is supposed to be a property of the mm or at least the
> >> change makes it so.
> >
> > Yes, 44a70adec910 is trying to go towards changing from a property of the task/thread-group
> > to a property of mm. But I don't think we need to do it at the cost of "__set_oom_adj() latency
> > Yong-Taek Lee and Tim Murray have reported" and "complicity for supporting
> > vfork() => __set_oom_adj() => execve() sequence".
>
> The thing is commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> sharing mm have same view of oom_score_adj") has been in the tree for 4
> years.
>
> That someone is just now noticing a regression is their problem.  The
> change is semantics is done and decided.  We can not reasonably revert
> at this point without risking other regressions.
>
> Given that the decision has already been made to make oom_adj
> effectively per mm.  There is no point on have a debate if we should do
> it.

Catching up on the discussion which was going on while I was asleep...
So it sounds like there is a consensus that oom_adj should be moved to
mm_struct rather than trying to synchronize it among tasks sharing mm.
That sounds reasonable to me too. Michal answered all the earlier
questions about this patch, so I won't be reiterating them, thanks
Michal. If any questions are still lingering about the original patch
I'll be glad to answer them.

>
> Eric
>
>
Michal Hocko Aug. 20, 2020, 4:26 p.m. UTC | #36
On Thu 20-08-20 08:56:53, Suren Baghdasaryan wrote:
[...]
> Catching up on the discussion which was going on while I was asleep...
> So it sounds like there is a consensus that oom_adj should be moved to
> mm_struct rather than trying to synchronize it among tasks sharing mm.
> That sounds reasonable to me too. Michal answered all the earlier
> questions about this patch, so I won't be reiterating them, thanks
> Michal. If any questions are still lingering about the original patch
> I'll be glad to answer them.

I think it still makes some sense to go with a simpler (aka less tricky)
solution which would be your original patch with an incremental fix for
vfork and the proper ordering (http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz)
and then make a more complex shift to mm struct on top of that. The
former will be less tricky to backport to stable IMHO.
Christian Brauner Aug. 20, 2020, 4:29 p.m. UTC | #37
On Thu, Aug 20, 2020 at 06:26:45PM +0200, Michal Hocko wrote:
> On Thu 20-08-20 08:56:53, Suren Baghdasaryan wrote:
> [...]
> > Catching up on the discussion which was going on while I was asleep...
> > So it sounds like there is a consensus that oom_adj should be moved to
> > mm_struct rather than trying to synchronize it among tasks sharing mm.
> > That sounds reasonable to me too. Michal answered all the earlier
> > questions about this patch, so I won't be reiterating them, thanks
> > Michal. If any questions are still lingering about the original patch
> > I'll be glad to answer them.
> 
> I think it still makes some sense to go with a simpler (aka less tricky)
> solution which would be your original patch with an incremental fix for
> vfork and the proper ordering (http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz)
> and then make a more complex shift to mm struct on top of that. The
> former will be less tricky to backport to stable IMHO.

/me nods

Christian
Suren Baghdasaryan Aug. 20, 2020, 4:47 p.m. UTC | #38
On Thu, Aug 20, 2020 at 9:29 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Aug 20, 2020 at 06:26:45PM +0200, Michal Hocko wrote:
> > On Thu 20-08-20 08:56:53, Suren Baghdasaryan wrote:
> > [...]
> > > Catching up on the discussion which was going on while I was asleep...
> > > So it sounds like there is a consensus that oom_adj should be moved to
> > > mm_struct rather than trying to synchronize it among tasks sharing mm.
> > > That sounds reasonable to me too. Michal answered all the earlier
> > > questions about this patch, so I won't be reiterating them, thanks
> > > Michal. If any questions are still lingering about the original patch
> > > I'll be glad to answer them.
> >
> > I think it still makes some sense to go with a simpler (aka less tricky)
> > solution which would be your original patch with an incremental fix for
> > vfork and the proper ordering (http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz)
> > and then make a more complex shift to mm struct on top of that. The
> > former will be less tricky to backport to stable IMHO.
>
> /me nods

Ah, ok. Then I'll incorporate these changes, re-test and re-post as v2. Thanks!

>
> Christian
Eric W. Biederman Aug. 21, 2020, 4:39 a.m. UTC | #39
Michal Hocko <mhocko@suse.com> writes:

> On Thu 20-08-20 08:56:53, Suren Baghdasaryan wrote:
> [...]
>> Catching up on the discussion which was going on while I was asleep...
>> So it sounds like there is a consensus that oom_adj should be moved to
>> mm_struct rather than trying to synchronize it among tasks sharing mm.
>> That sounds reasonable to me too. Michal answered all the earlier
>> questions about this patch, so I won't be reiterating them, thanks
>> Michal. If any questions are still lingering about the original patch
>> I'll be glad to answer them.
>
> I think it still makes some sense to go with a simpler (aka less tricky)
> solution which would be your original patch with an incremental fix for
> vfork and the proper ordering (http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz)
> and then make a more complex shift to mm struct on top of that. The
> former will be less tricky to backport to stable IMHO.

So I am confused.

I don't know how a subtle dependency on something in clone
is better than something flat footed in exec.


That said if we are going for a small change why not:

	/*
	 * Make sure we will check other processes sharing the mm if this is
	 * not vfrok which wants its own oom_score_adj.
	 * pin the mm so it doesn't go away and get reused after task_unlock
	 */
	if (!task->vfork_done) {
		struct task_struct *p = find_lock_task_mm(task);

		if (p) {
-			if (atomic_read(&p->mm->mm_users) > 1) {
+			if (atomic_read(&p->mm->mm_users) > p->signal->nr_threads) {
				mm = p->mm;
				mmgrab(mm);
			}
			task_unlock(p);
		}
	}

That would seem to be the minimal change to make this happen.  That has
the advantage that if a processes does vfork it won't always have to
take the slow path.

Moving to the mm_struct is much less racy but this is simple.

Eric
Michal Hocko Aug. 21, 2020, 7:17 a.m. UTC | #40
On Thu 20-08-20 23:39:25, Eric W. Biederman wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 20-08-20 08:56:53, Suren Baghdasaryan wrote:
> > [...]
> >> Catching up on the discussion which was going on while I was asleep...
> >> So it sounds like there is a consensus that oom_adj should be moved to
> >> mm_struct rather than trying to synchronize it among tasks sharing mm.
> >> That sounds reasonable to me too. Michal answered all the earlier
> >> questions about this patch, so I won't be reiterating them, thanks
> >> Michal. If any questions are still lingering about the original patch
> >> I'll be glad to answer them.
> >
> > I think it still makes some sense to go with a simpler (aka less tricky)
> > solution which would be your original patch with an incremental fix for
> > vfork and the proper ordering (http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz)
> > and then make a more complex shift to mm struct on top of that. The
> > former will be less tricky to backport to stable IMHO.
> 
> So I am confused.
> 
> I don't know how a subtle dependency on something in clone
> is better than something flat footed in exec.

Well, I think that setting a flag is an easier approach than handle all
the special cases for the mm thing. But this is likely because this is
not my domain so my judgment call might be misled. Anyway if there is a
general consensus that doing the middle step is not worth it I am not
going to object.

> That said if we are going for a small change why not:
> 
> 	/*
> 	 * Make sure we will check other processes sharing the mm if this is
> 	 * not vfrok which wants its own oom_score_adj.
> 	 * pin the mm so it doesn't go away and get reused after task_unlock
> 	 */
> 	if (!task->vfork_done) {
> 		struct task_struct *p = find_lock_task_mm(task);
> 
> 		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> +			if (atomic_read(&p->mm->mm_users) > p->signal->nr_threads) {
> 				mm = p->mm;
> 				mmgrab(mm);
> 			}
> 			task_unlock(p);
> 		}
> 	}

I remember playing with something like that but it had problems too. I
do not remember details. Oleg would know better.
Oleg Nesterov Aug. 21, 2020, 11:15 a.m. UTC | #41
On 08/20, Eric W. Biederman wrote:
>
> That said if we are going for a small change why not:
>
> 	/*
> 	 * Make sure we will check other processes sharing the mm if this is
> 	 * not vfrok which wants its own oom_score_adj.
> 	 * pin the mm so it doesn't go away and get reused after task_unlock
> 	 */
> 	if (!task->vfork_done) {
> 		struct task_struct *p = find_lock_task_mm(task);
>
> 		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> +			if (atomic_read(&p->mm->mm_users) > p->signal->nr_threads) {

In theory this needs a barrier to avoid the race with do_exit(). And I'd
suggest to use signal->live, I think signal->nr_threads should die...
Something like

	bool probably_has_other_mm_users(tsk)
	{
		return	atomic_read_acquire(&tsk->mm->mm_users) >
			atomic_read(&tsk->signal->live);
	}

The barrier implied by _acquire ensures that if we race with the exiting
task and see the result of exit_mm()->mmput(mm), then we must also see
the result of atomic_dec_and_test(signal->live).

Either way, if we want to fix the race with clone(CLONE_VM) we need other
changes.

Oleg.
Suren Baghdasaryan Aug. 21, 2020, 3:28 p.m. UTC | #42
On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/20, Eric W. Biederman wrote:
> >
> > That said if we are going for a small change why not:
> >
> >       /*
> >        * Make sure we will check other processes sharing the mm if this is
> >        * not vfrok which wants its own oom_score_adj.
> >        * pin the mm so it doesn't go away and get reused after task_unlock
> >        */
> >       if (!task->vfork_done) {
> >               struct task_struct *p = find_lock_task_mm(task);
> >
> >               if (p) {
> > -                     if (atomic_read(&p->mm->mm_users) > 1) {
> > +                     if (atomic_read(&p->mm->mm_users) > p->signal->nr_threads) {
>
> In theory this needs a barrier to avoid the race with do_exit(). And I'd
> suggest to use signal->live, I think signal->nr_threads should die...
> Something like
>
>         bool probably_has_other_mm_users(tsk)
>         {
>                 return  atomic_read_acquire(&tsk->mm->mm_users) >
>                         atomic_read(&tsk->signal->live);
>         }
>
> The barrier implied by _acquire ensures that if we race with the exiting
> task and see the result of exit_mm()->mmput(mm), then we must also see
> the result of atomic_dec_and_test(signal->live).
>
> Either way, if we want to fix the race with clone(CLONE_VM) we need other
> changes.

The way I understand this condition in __set_oom_adj() sync logic is
that we would be ok with false positives (when we loop unnecessarily)
but we can't tolerate false negatives (when oom_score_adj gets out of
sync). With the clone(CLONE_VM) race not addressed we are allowing
false negatives and IMHO that's not acceptable because it creates a
possibility for userspace to get an inconsistent picture. When
developing the patch I did think about using (p->mm->mm_users >
p->signal->nr_threads) condition and had to reject it due to that
reason.

>
> Oleg.
>
Suren Baghdasaryan Aug. 21, 2020, 4:06 p.m. UTC | #43
On Fri, Aug 21, 2020 at 8:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/20, Eric W. Biederman wrote:
> > >
> > > That said if we are going for a small change why not:
> > >
> > >       /*
> > >        * Make sure we will check other processes sharing the mm if this is
> > >        * not vfrok which wants its own oom_score_adj.
> > >        * pin the mm so it doesn't go away and get reused after task_unlock
> > >        */
> > >       if (!task->vfork_done) {
> > >               struct task_struct *p = find_lock_task_mm(task);
> > >
> > >               if (p) {
> > > -                     if (atomic_read(&p->mm->mm_users) > 1) {
> > > +                     if (atomic_read(&p->mm->mm_users) > p->signal->nr_threads) {
> >
> > In theory this needs a barrier to avoid the race with do_exit(). And I'd
> > suggest to use signal->live, I think signal->nr_threads should die...
> > Something like
> >
> >         bool probably_has_other_mm_users(tsk)
> >         {
> >                 return  atomic_read_acquire(&tsk->mm->mm_users) >
> >                         atomic_read(&tsk->signal->live);
> >         }
> >
> > The barrier implied by _acquire ensures that if we race with the exiting
> > task and see the result of exit_mm()->mmput(mm), then we must also see
> > the result of atomic_dec_and_test(signal->live).
> >
> > Either way, if we want to fix the race with clone(CLONE_VM) we need other
> > changes.
>
> The way I understand this condition in __set_oom_adj() sync logic is
> that we would be ok with false positives (when we loop unnecessarily)
> but we can't tolerate false negatives (when oom_score_adj gets out of
> sync). With the clone(CLONE_VM) race not addressed we are allowing
> false negatives and IMHO that's not acceptable because it creates a
> possibility for userspace to get an inconsistent picture. When
> developing the patch I did think about using (p->mm->mm_users >
> p->signal->nr_threads) condition and had to reject it due to that
> reason.
>

Actually, reviewing again and considering where list_add_tail_rcu is
happening, maybe the race with clone(CLONE_VM) does not introduce
false negatives. However a false negative I think will happen when a
task shares mm with another task and also has an additional thread.
Shared mm will increment mm_users without adding to signal->live and
the additional thread will advance signal->live without adding to
mm_users. As a result these increments will balance themselves and
(mm->mm_users > signal->live) condition will yield false negative.

> >
> > Oleg.
> >
Oleg Nesterov Aug. 21, 2020, 4:33 p.m. UTC | #44
On 08/21, Suren Baghdasaryan wrote:
>
> On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >         bool probably_has_other_mm_users(tsk)
> >         {
> >                 return  atomic_read_acquire(&tsk->mm->mm_users) >
> >                         atomic_read(&tsk->signal->live);
> >         }
> >
> > The barrier implied by _acquire ensures that if we race with the exiting
> > task and see the result of exit_mm()->mmput(mm), then we must also see
> > the result of atomic_dec_and_test(signal->live).
> >
> > Either way, if we want to fix the race with clone(CLONE_VM) we need other
> > changes.
>
> The way I understand this condition in __set_oom_adj() sync logic is
> that we would be ok with false positives (when we loop unnecessarily)
> but we can't tolerate false negatives (when oom_score_adj gets out of
> sync).

Yes,

> With the clone(CLONE_VM) race not addressed we are allowing
> false negatives and IMHO that's not acceptable because it creates a
> possibility for userspace to get an inconsistent picture. When
> developing the patch I did think about using (p->mm->mm_users >
> p->signal->nr_threads) condition and had to reject it due to that
> reason.

Not sure I understand... I mean, the test_bit(MMF_PROC_SHARED) you propose
is equally racy and we need copy_oom_score() at the end of copy_process()
either way?

Oleg.
Oleg Nesterov Aug. 21, 2020, 4:37 p.m. UTC | #45
again, don't really understand...

On 08/21, Suren Baghdasaryan wrote:
>
> Actually, reviewing again and considering where list_add_tail_rcu is
> happening, maybe the race with clone(CLONE_VM) does not introduce
> false negatives.

I think it does... Whatever we check, mm_users or MMF_PROC_SHARED,
the task can do clone(CLONE_VM) right after the check.

> However a false negative I think will happen when a
> task shares mm with another task and also has an additional thread.
> Shared mm will increment mm_users without adding to signal->live

Yes,

> and
> the additional thread will advance signal->live without adding to
> mm_users.

No, please note that CLONE_THREAD requires CLONE_VM.

Oleg.
Suren Baghdasaryan Aug. 21, 2020, 5:22 p.m. UTC | #46
On Fri, Aug 21, 2020 at 9:37 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> again, don't really understand...
>
> On 08/21, Suren Baghdasaryan wrote:
> >
> > Actually, reviewing again and considering where list_add_tail_rcu is
> > happening, maybe the race with clone(CLONE_VM) does not introduce
> > false negatives.
>
> I think it does... Whatever we check, mm_users or MMF_PROC_SHARED,
> the task can do clone(CLONE_VM) right after the check.

Ah, yes of course. I missed this same just like in the original patch.

>
> > However a false negative I think will happen when a
> > task shares mm with another task and also has an additional thread.
> > Shared mm will increment mm_users without adding to signal->live
>
> Yes,
>
> > and
> > the additional thread will advance signal->live without adding to
> > mm_users.
>
> No, please note that CLONE_THREAD requires CLONE_VM.

My fault. Forgot that CLONE_VM means "share VM" and not "dup VM". Need
some coffee.
Thanks Oleg!

>
> Oleg.
>
Oleg Nesterov Aug. 21, 2020, 5:59 p.m. UTC | #47
On 08/21, Oleg Nesterov wrote:
>
> On 08/21, Suren Baghdasaryan wrote:
> >
> > On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > >         bool probably_has_other_mm_users(tsk)
> > >         {
> > >                 return  atomic_read_acquire(&tsk->mm->mm_users) >
> > >                         atomic_read(&tsk->signal->live);
> > >         }
> > >
> > > The barrier implied by _acquire ensures that if we race with the exiting
> > > task and see the result of exit_mm()->mmput(mm), then we must also see
> > > the result of atomic_dec_and_test(signal->live).
> > >
> > > Either way, if we want to fix the race with clone(CLONE_VM) we need other
> > > changes.
> >
> > The way I understand this condition in __set_oom_adj() sync logic is
> > that we would be ok with false positives (when we loop unnecessarily)
> > but we can't tolerate false negatives (when oom_score_adj gets out of
> > sync).
>
> Yes,
>
> > With the clone(CLONE_VM) race not addressed we are allowing
> > false negatives and IMHO that's not acceptable because it creates a
> > possibility for userspace to get an inconsistent picture. When
> > developing the patch I did think about using (p->mm->mm_users >
> > p->signal->nr_threads) condition and had to reject it due to that
> > reason.
>
> Not sure I understand... I mean, the test_bit(MMF_PROC_SHARED) you propose
> is equally racy and we need copy_oom_score() at the end of copy_process()
> either way?

On a second thought I agree that probably_has_other_mm_users() above can't
work ;) Compared to the test_bit(MMF_PROC_SHARED) check it is not _equally_
racy, it adds _another_ race with clone(CLONE_VM).

Suppose a single-threaded process P does

	clone(CLONE_VM); // creates the child C

	// mm_users == 2; P->signal->live == 1;

	clone(CLONE_THREAD | CLONE_VM);

	// mm_users == 3; P->signal->live == 2;

the problem is that in theory clone(CLONE_THREAD | CLONE_VM) can increment
_both_ counters between atomic_read_acquire(mm_users) and atomic_read(live)
in probably_has_other_mm_users() so it can observe mm_users == live == 2.

Oleg.
Suren Baghdasaryan Aug. 21, 2020, 6:53 p.m. UTC | #48
On Fri, Aug 21, 2020 at 11:00 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/21, Oleg Nesterov wrote:
> >
> > On 08/21, Suren Baghdasaryan wrote:
> > >
> > > On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > >         bool probably_has_other_mm_users(tsk)
> > > >         {
> > > >                 return  atomic_read_acquire(&tsk->mm->mm_users) >
> > > >                         atomic_read(&tsk->signal->live);
> > > >         }
> > > >
> > > > The barrier implied by _acquire ensures that if we race with the exiting
> > > > task and see the result of exit_mm()->mmput(mm), then we must also see
> > > > the result of atomic_dec_and_test(signal->live).
> > > >
> > > > Either way, if we want to fix the race with clone(CLONE_VM) we need other
> > > > changes.
> > >
> > > The way I understand this condition in __set_oom_adj() sync logic is
> > > that we would be ok with false positives (when we loop unnecessarily)
> > > but we can't tolerate false negatives (when oom_score_adj gets out of
> > > sync).
> >
> > Yes,
> >
> > > With the clone(CLONE_VM) race not addressed we are allowing
> > > false negatives and IMHO that's not acceptable because it creates a
> > > possibility for userspace to get an inconsistent picture. When
> > > developing the patch I did think about using (p->mm->mm_users >
> > > p->signal->nr_threads) condition and had to reject it due to that
> > > reason.
> >
> > Not sure I understand... I mean, the test_bit(MMF_PROC_SHARED) you propose
> > is equally racy and we need copy_oom_score() at the end of copy_process()
> > either way?
>
> On a second thought I agree that probably_has_other_mm_users() above can't
> work ;) Compared to the test_bit(MMF_PROC_SHARED) check it is not _equally_
> racy, it adds _another_ race with clone(CLONE_VM).
>
> Suppose a single-threaded process P does
>
>         clone(CLONE_VM); // creates the child C
>
>         // mm_users == 2; P->signal->live == 1;
>
>         clone(CLONE_THREAD | CLONE_VM);
>
>         // mm_users == 3; P->signal->live == 2;
>
> the problem is that in theory clone(CLONE_THREAD | CLONE_VM) can increment
> _both_ counters between atomic_read_acquire(mm_users) and atomic_read(live)
> in probably_has_other_mm_users() so it can observe mm_users == live == 2.

I see. So even though live is incremented after mm_users, the observer
from __set_oom_adj still can see them becoming equal because it reads
mm_users first.

Do you see any such races if I incorporate the changes proposed by
Michal in http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz
? I have the new patch and I'm testing it right now. So far it behaves
well but maybe I'm missing some rare race here that won't show up in
my testing?


>
> Oleg.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan Aug. 24, 2020, 8:03 p.m. UTC | #49
On Fri, Aug 21, 2020 at 11:53 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Aug 21, 2020 at 11:00 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/21, Oleg Nesterov wrote:
> > >
> > > On 08/21, Suren Baghdasaryan wrote:
> > > >
> > > > On Fri, Aug 21, 2020 at 4:16 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > >         bool probably_has_other_mm_users(tsk)
> > > > >         {
> > > > >                 return  atomic_read_acquire(&tsk->mm->mm_users) >
> > > > >                         atomic_read(&tsk->signal->live);
> > > > >         }
> > > > >
> > > > > The barrier implied by _acquire ensures that if we race with the exiting
> > > > > task and see the result of exit_mm()->mmput(mm), then we must also see
> > > > > the result of atomic_dec_and_test(signal->live).
> > > > >
> > > > > Either way, if we want to fix the race with clone(CLONE_VM) we need other
> > > > > changes.
> > > >
> > > > The way I understand this condition in __set_oom_adj() sync logic is
> > > > that we would be ok with false positives (when we loop unnecessarily)
> > > > but we can't tolerate false negatives (when oom_score_adj gets out of
> > > > sync).
> > >
> > > Yes,
> > >
> > > > With the clone(CLONE_VM) race not addressed we are allowing
> > > > false negatives and IMHO that's not acceptable because it creates a
> > > > possibility for userspace to get an inconsistent picture. When
> > > > developing the patch I did think about using (p->mm->mm_users >
> > > > p->signal->nr_threads) condition and had to reject it due to that
> > > > reason.
> > >
> > > Not sure I understand... I mean, the test_bit(MMF_PROC_SHARED) you propose
> > > is equally racy and we need copy_oom_score() at the end of copy_process()
> > > either way?
> >
> > On a second thought I agree that probably_has_other_mm_users() above can't
> > work ;) Compared to the test_bit(MMF_PROC_SHARED) check it is not _equally_
> > racy, it adds _another_ race with clone(CLONE_VM).
> >
> > Suppose a single-threaded process P does
> >
> >         clone(CLONE_VM); // creates the child C
> >
> >         // mm_users == 2; P->signal->live == 1;
> >
> >         clone(CLONE_THREAD | CLONE_VM);
> >
> >         // mm_users == 3; P->signal->live == 2;
> >
> > the problem is that in theory clone(CLONE_THREAD | CLONE_VM) can increment
> > _both_ counters between atomic_read_acquire(mm_users) and atomic_read(live)
> > in probably_has_other_mm_users() so it can observe mm_users == live == 2.
>
> I see. So even though live is incremented after mm_users, the observer
> from __set_oom_adj still can see them becoming equal because it reads
> mm_users first.
>
> Do you see any such races if I incorporate the changes proposed by
> Michal in http://lkml.kernel.org/r/20200820124109.GI5033@dhcp22.suse.cz
> ? I have the new patch and I'm testing it right now. So far it behaves
> well but maybe I'm missing some rare race here that won't show up in
> my testing?
>

FYI: v2 of this patch with proposed changes is posted at:
https://lore.kernel.org/patchwork/patch/1294520/

>
> >
> > Oleg.
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
diff mbox series

Patch

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..cff1a58a236c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1055,7 +1055,6 @@  static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
-	static DEFINE_MUTEX(oom_adj_mutex);
 	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
@@ -1064,7 +1063,7 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 	if (!task)
 		return -ESRCH;
 
-	mutex_lock(&oom_adj_mutex);
+	mutex_lock(&oom_adj_lock);
 	if (legacy) {
 		if (oom_adj < task->signal->oom_score_adj &&
 				!capable(CAP_SYS_RESOURCE)) {
@@ -1095,7 +1094,7 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		struct task_struct *p = find_lock_task_mm(task);
 
 		if (p) {
-			if (atomic_read(&p->mm->mm_users) > 1) {
+			if (test_bit(MMF_PROC_SHARED, &p->mm->flags)) {
 				mm = p->mm;
 				mmgrab(mm);
 			}
@@ -1132,7 +1131,7 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		mmdrop(mm);
 	}
 err_unlock:
-	mutex_unlock(&oom_adj_mutex);
+	mutex_unlock(&oom_adj_lock);
 	put_task_struct(task);
 	return err;
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index f022f581ac29..861f22bd4706 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -55,6 +55,7 @@  struct oom_control {
 };
 
 extern struct mutex oom_lock;
+extern struct mutex oom_adj_lock;
 
 static inline void set_current_oom_origin(void)
 {
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ecdc6542070f..070629b722df 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -72,6 +72,7 @@  static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
 #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
+#define MMF_PROC_SHARED	27	/* mm is shared while sighand is not */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d32190861bd..9177a76bf840 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1403,6 +1403,15 @@  static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
+		if (!(clone_flags & CLONE_SIGHAND)) {
+			/* We need to synchronize with __set_oom_adj */
+			mutex_lock(&oom_adj_lock);
+			set_bit(MMF_PROC_SHARED, &mm->flags);
+			/* Update the values in case they were changed after copy_signal */
+			tsk->signal->oom_score_adj = current->signal->oom_score_adj;
+			tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min;
+			mutex_unlock(&oom_adj_lock);
+		}
 		goto good_mm;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e90f25d6385d..c22f07c986cb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -64,6 +64,8 @@  int sysctl_oom_dump_tasks = 1;
  * and mark_oom_victim
  */
 DEFINE_MUTEX(oom_lock);
+/* Serializes oom_score_adj and oom_score_adj_min updates */
+DEFINE_MUTEX(oom_adj_lock);
 
 static inline bool is_memcg_oom(struct oom_control *oc)
 {