diff mbox series

[mm-unstable] mm/madvise: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE)

Message ID 20220801210946.3069083-1-zokeefe@google.com (mailing list archive)
State New
Headers show
Series [mm-unstable] mm/madvise: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE) | expand

Commit Message

Zach O'Keefe Aug. 1, 2022, 9:09 p.m. UTC
process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not
acting on the caller's own mm.  This is maximally restrictive, and
perpetuates existing issues with CAP_SYS_ADMIN.  Remove this requirement.

When acting on an external process' memory, the biggest concerns for
process_madvise(MADV_COLLAPSE) are (1) being able to influence process
performance by moving memory, possibly between nodes, that is mapped
into the address space of external process(es), (2) defeat of
address-space-layout randomization, and (3), being able to increase
process RSS and memcg usage, possibly causing memcg OOM.

process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in
PTRACE_MODE_FSCREDS mode).  A process with these credentials can already
accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via
process_madvise(MADV_WILLNEED).

process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings.
When acting on one's own memory (which is equivalent to
madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the
possibility of hoarding available hugepages (which is currently already
possible) no harm to the system can be done.  When acting on an external
process' memory, circumventing sysfs THP settings should provide no
additional threat compared to the ones listed.  As such, imposing
additional capabilities (such as CAP_SETUID, as a way to ensure the
caller could have just altered the sysfs THP settings themselves)
provides no extra protection.

Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()")
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/madvise.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Michal Hocko Aug. 2, 2022, 9:09 a.m. UTC | #1
Hasn't this been discussed during the MADV_COLLAPSE submission? What has
changed? Does this need more time to settle with the consensus?

On Mon 01-08-22 14:09:46, Zach O'Keefe wrote:
> process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not
> acting on the caller's own mm.  This is maximally restrictive, and
> perpetuates existing issues with CAP_SYS_ADMIN.  Remove this requirement.
> 
> When acting on an external process' memory, the biggest concerns for
> process_madvise(MADV_COLLAPSE) are (1) being able to influence process
> performance by moving memory, possibly between nodes, that is mapped
> into the address space of external process(es), (2) defeat of
> address-space-layout randomization, and (3), being able to increase
> process RSS and memcg usage, possibly causing memcg OOM.
> 
> process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in
> PTRACE_MODE_FSCREDS mode).  A process with these credentials can already
> accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via
> process_madvise(MADV_WILLNEED).
> 
> process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings.
> When acting on one's own memory (which is equivalent to
> madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the
> possibility of hoarding available hugepages (which is currently already
> possible) no harm to the system can be done.  When acting on an external
> process' memory, circumventing sysfs THP settings should provide no
> additional threat compared to the ones listed.  As such, imposing
> additional capabilities (such as CAP_SETUID, as a way to ensure the
> caller could have just altered the sysfs THP settings themselves)
> provides no extra protection.
> 
> Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()")
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/madvise.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f9e11b6c9916..af97100a0727 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1170,16 +1170,14 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> -static bool
> -process_madvise_behavior_valid(int behavior, struct task_struct *task)
> +static bool process_madvise_behavior_valid(int behavior)
>  {
>  	switch (behavior) {
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
>  	case MADV_WILLNEED:
> -		return true;
>  	case MADV_COLLAPSE:
> -		return task == current || capable(CAP_SYS_ADMIN);
> +		return true;
>  	default:
>  		return false;
>  	}
> @@ -1457,7 +1455,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  		goto free_iov;
>  	}
>  
> -	if (!process_madvise_behavior_valid(behavior, task)) {
> +	if (!process_madvise_behavior_valid(behavior)) {
>  		ret = -EINVAL;
>  		goto release_task;
>  	}
> -- 
> 2.37.1.455.g008518b4e5-goog
Zach O'Keefe Aug. 2, 2022, 9:48 a.m. UTC | #2
On Tue, Aug 2, 2022 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Hasn't this been discussed during the MADV_COLLAPSE submission? What has
> changed? Does this need more time to settle with the consensus?
>
> On Mon 01-08-22 14:09:46, Zach O'Keefe wrote:
> > process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not
> > acting on the caller's own mm.  This is maximally restrictive, and
> > perpetuates existing issues with CAP_SYS_ADMIN.  Remove this requirement.
> >
> > When acting on an external process' memory, the biggest concerns for
> > process_madvise(MADV_COLLAPSE) are (1) being able to influence process
> > performance by moving memory, possibly between nodes, that is mapped
> > into the address space of external process(es), (2) defeat of
> > address-space-layout randomization, and (3), being able to increase
> > process RSS and memcg usage, possibly causing memcg OOM.
> >
> > process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in
> > PTRACE_MODE_FSCREDS mode).  A process with these credentials can already
> > accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via
> > process_madvise(MADV_WILLNEED).
> >
> > process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings.
> > When acting on one's own memory (which is equivalent to
> > madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the
> > possibility of hoarding available hugepages (which is currently already
> > possible) no harm to the system can be done.  When acting on an external
> > process' memory, circumventing sysfs THP settings should provide no
> > additional threat compared to the ones listed.  As such, imposing
> > additional capabilities (such as CAP_SETUID, as a way to ensure the
> > caller could have just altered the sysfs THP settings themselves)
> > provides no extra protection.
> >
> > Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()")
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/madvise.c | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index f9e11b6c9916..af97100a0727 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1170,16 +1170,14 @@ madvise_behavior_valid(int behavior)
> >       }
> >  }
> >
> > -static bool
> > -process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > +static bool process_madvise_behavior_valid(int behavior)
> >  {
> >       switch (behavior) {
> >       case MADV_COLD:
> >       case MADV_PAGEOUT:
> >       case MADV_WILLNEED:
> > -             return true;
> >       case MADV_COLLAPSE:
> > -             return task == current || capable(CAP_SYS_ADMIN);
> > +             return true;
> >       default:
> >               return false;
> >       }
> > @@ -1457,7 +1455,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >               goto free_iov;
> >       }
> >
> > -     if (!process_madvise_behavior_valid(behavior, task)) {
> > +     if (!process_madvise_behavior_valid(behavior)) {
> >               ret = -EINVAL;
> >               goto release_task;
> >       }
> > --
> > 2.37.1.455.g008518b4e5-goog
>
> --
> Michal Hocko
> SUSE Labs

Hey Michal,

Thanks for taking the time to take a look at this.

"mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series
ended with me mentioning a couple options, but ultimately I didn't
present a solution, and no consensus was reached[1]. After taking a
closer look, this is my proposal for what I believe to be the best
path forward. It should be squashed into the original patch. What do you think?

Thanks again,
Zach

[1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
Michal Hocko Aug. 2, 2022, 12:04 p.m. UTC | #3
On Tue 02-08-22 02:48:33, Zach O'Keefe wrote:
[...]
> "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series
> ended with me mentioning a couple options, but ultimately I didn't
> present a solution, and no consensus was reached[1]. After taking a
> closer look, this is my proposal for what I believe to be the best
> path forward. It should be squashed into the original patch. What do you think?

If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement
then yes, this should be squashed into the original patch. There is no
real reason to create a potential bisection headache by changing the
permission model in a later patch.

From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a
requirement.

I didn't really have time to follow recent discussions but I would argue
that the operation is not really destructive or seriously harmful. All
applications can already have their memory (almost) equally THP
collapsed by khupaged with the proposed process_madvise semantic.

NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only
difference is the global THP killswitch behavior which I do not think
warrants the strongest CAP_SYS_ADMIN capability (especially because it
doesn't really control all kinds of THPs).

If there is a userspace agent collapsing memory and causing problems
then it can be easily fixed in the userspace. And I find that easier
to do than putting the bar so high that userspace agents would be
unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it
would allow essentially full control of other stuff). So from practical
POV, risking an extended RSS is really a negligible risk to lose a
potentially useful feature for all others.

Just my 2c

> Thanks again,
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
Zach O'Keefe Aug. 2, 2022, 7:42 p.m. UTC | #4
On Tue, Aug 2, 2022 at 5:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 02-08-22 02:48:33, Zach O'Keefe wrote:
> [...]
> > "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series
> > ended with me mentioning a couple options, but ultimately I didn't
> > present a solution, and no consensus was reached[1]. After taking a
> > closer look, this is my proposal for what I believe to be the best
> > path forward. It should be squashed into the original patch. What do you think?
>
> If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement
> then yes, this should be squashed into the original patch. There is no
> real reason to create a potential bisection headache by changing the
> permission model in a later patch.

Sorry about the confusion here. Assumed (incorrectly) that Andrew
would kindly squash this in mm-unstable since I added the Fixes: tag.
Next time I'll add some explicit verbiage saying it should be
squashed.

> From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a
> requirement.
>
> I didn't really have time to follow recent discussions but I would argue
> that the operation is not really destructive or seriously harmful. All
> applications can already have their memory (almost) equally THP
> collapsed by khupaged with the proposed process_madvise semantic.
>
> NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only
> difference is the global THP killswitch behavior which I do not think
> warrants the strongest CAP_SYS_ADMIN capability (especially because it
> doesn't really control all kinds of THPs).

Ya. In fact, I don't think the ignoring the THP sysfs controls
warrants any additional capability (set alone CAPS_SYS_ADMIN), since a
malicious program can't really inflict any more damage than they would
with CAP_SYS_NICE and PTRACE_MODE_READ.

> If there is a userspace agent collapsing memory and causing problems
> then it can be easily fixed in the userspace. And I find that easier
> to do than putting the bar so high that userspace agents would be
> unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it
> would allow essentially full control of other stuff). So from practical
> POV, risking an extended RSS is really a negligible risk to lose a
> potentially useful feature for all others.
>

Agreed.

Thanks for taking the time, Michal!
Zach


> Just my 2c
>
> > Thanks again,
> > Zach
> >
> > [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
>
> --
> Michal Hocko
> SUSE Labs
Yang Shi Aug. 4, 2022, 5:46 p.m. UTC | #5
On Tue, Aug 2, 2022 at 12:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Tue, Aug 2, 2022 at 5:04 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 02-08-22 02:48:33, Zach O'Keefe wrote:
> > [...]
> > > "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series
> > > ended with me mentioning a couple options, but ultimately I didn't
> > > present a solution, and no consensus was reached[1]. After taking a
> > > closer look, this is my proposal for what I believe to be the best
> > > path forward. It should be squashed into the original patch. What do you think?
> >
> > If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement
> > then yes, this should be squashed into the original patch. There is no
> > real reason to create a potential bisection headache by changing the
> > permission model in a later patch.
>
> Sorry about the confusion here. Assumed (incorrectly) that Andrew
> would kindly squash this in mm-unstable since I added the Fixes: tag.
> Next time I'll add some explicit verbiage saying it should be
> squashed.
>
> > From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a
> > requirement.
> >
> > I didn't really have time to follow recent discussions but I would argue
> > that the operation is not really destructive or seriously harmful. All
> > applications can already have their memory (almost) equally THP
> > collapsed by khupaged with the proposed process_madvise semantic.
> >
> > NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only
> > difference is the global THP killswitch behavior which I do not think
> > warrants the strongest CAP_SYS_ADMIN capability (especially because it
> > doesn't really control all kinds of THPs).
>
> Ya. In fact, I don't think the ignoring the THP sysfs controls
> warrants any additional capability (set alone CAPS_SYS_ADMIN), since a
> malicious program can't really inflict any more damage than they would
> with CAP_SYS_NICE and PTRACE_MODE_READ.
>
> > If there is a userspace agent collapsing memory and causing problems
> > then it can be easily fixed in the userspace. And I find that easier
> > to do than putting the bar so high that userspace agents would be
> > unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it
> > would allow essentially full control of other stuff). So from practical
> > POV, risking an extended RSS is really a negligible risk to lose a
> > potentially useful feature for all others.
> >
>
> Agreed.

+1

>
> Thanks for taking the time, Michal!
> Zach
>
>
> > Just my 2c
> >
> > > Thanks again,
> > > Zach
> > >
> > > [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
> >
> > --
> > Michal Hocko
> > SUSE Labs
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index f9e11b6c9916..af97100a0727 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1170,16 +1170,14 @@  madvise_behavior_valid(int behavior)
 	}
 }
 
-static bool
-process_madvise_behavior_valid(int behavior, struct task_struct *task)
+static bool process_madvise_behavior_valid(int behavior)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
-		return true;
 	case MADV_COLLAPSE:
-		return task == current || capable(CAP_SYS_ADMIN);
+		return true;
 	default:
 		return false;
 	}
@@ -1457,7 +1455,7 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior, task)) {
+	if (!process_madvise_behavior_valid(behavior)) {
 		ret = -EINVAL;
 		goto release_task;
 	}