Message ID | 20220801210946.3069083-1-zokeefe@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [mm-unstable] mm/madvise: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE) | expand |
Hasn't this been discussed during the MADV_COLLAPSE submission? What has changed? Does this need more time to settle with the consensus? On Mon 01-08-22 14:09:46, Zach O'Keefe wrote: > process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not > acting on the caller's own mm. This is maximally restrictive, and > perpetuates existing issues with CAP_SYS_ADMIN. Remove this requirement. > > When acting on an external process' memory, the biggest concerns for > process_madvise(MADV_COLLAPSE) are (1) being able to influence process > performance by moving memory, possibly between nodes, that is mapped > into the address space of external process(es), (2) defeat of > address-space-layout randomization, and (3), being able to increase > process RSS and memcg usage, possibly causing memcg OOM. > > process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in > PTRACE_MODE_FSCREDS mode). A process with these credentials can already > accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via > process_madvise(MADV_WILLNEED). > > process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings. > When acting on one's own memory (which is equivalent to > madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the > possibility of hoarding available hugepages (which is currently already > possible) no harm to the system can be done. When acting on an external > process' memory, circumventing sysfs THP settings should provide no > additional threat compared to the ones listed. As such, imposing > additional capabilities (such as CAP_SETUID, as a way to ensure the > caller could have just altered the sysfs THP settings themselves) > provides no extra protection. > > Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()") > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > --- > mm/madvise.c | 8 +++----- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index f9e11b6c9916..af97100a0727 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1170,16 +1170,14 @@ madvise_behavior_valid(int behavior) > } > } > > -static bool > -process_madvise_behavior_valid(int behavior, struct task_struct *task) > +static bool process_madvise_behavior_valid(int behavior) > { > switch (behavior) { > case MADV_COLD: > case MADV_PAGEOUT: > case MADV_WILLNEED: > - return true; > case MADV_COLLAPSE: > - return task == current || capable(CAP_SYS_ADMIN); > + return true; > default: > return false; > } > @@ -1457,7 +1455,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > goto free_iov; > } > > - if (!process_madvise_behavior_valid(behavior, task)) { > + if (!process_madvise_behavior_valid(behavior)) { > ret = -EINVAL; > goto release_task; > } > -- > 2.37.1.455.g008518b4e5-goog
On Tue, Aug 2, 2022 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote: > > Hasn't this been discussed during the MADV_COLLAPSE submission? What has > changed? Does this need more time to settle with the consensus? > > On Mon 01-08-22 14:09:46, Zach O'Keefe wrote: > > process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not > > acting on the caller's own mm. This is maximally restrictive, and > > perpetuates existing issues with CAP_SYS_ADMIN. Remove this requirement. > > > > When acting on an external process' memory, the biggest concerns for > > process_madvise(MADV_COLLAPSE) are (1) being able to influence process > > performance by moving memory, possibly between nodes, that is mapped > > into the address space of external process(es), (2) defeat of > > address-space-layout randomization, and (3), being able to increase > > process RSS and memcg usage, possibly causing memcg OOM. > > > > process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in > > PTRACE_MODE_FSCREDS mode). A process with these credentials can already > > accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via > > process_madvise(MADV_WILLNEED). > > > > process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings. > > When acting on one's own memory (which is equivalent to > > madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the > > possibility of hoarding available hugepages (which is currently already > > possible) no harm to the system can be done. When acting on an external > > process' memory, circumventing sysfs THP settings should provide no > > additional threat compared to the ones listed. As such, imposing > > additional capabilities (such as CAP_SETUID, as a way to ensure the > > caller could have just altered the sysfs THP settings themselves) > > provides no extra protection. > > > > Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()") > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > > --- > > mm/madvise.c | 8 +++----- > > 1 file changed, 3 insertions(+), 5 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index f9e11b6c9916..af97100a0727 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1170,16 +1170,14 @@ madvise_behavior_valid(int behavior) > > } > > } > > > > -static bool > > -process_madvise_behavior_valid(int behavior, struct task_struct *task) > > +static bool process_madvise_behavior_valid(int behavior) > > { > > switch (behavior) { > > case MADV_COLD: > > case MADV_PAGEOUT: > > case MADV_WILLNEED: > > - return true; > > case MADV_COLLAPSE: > > - return task == current || capable(CAP_SYS_ADMIN); > > + return true; > > default: > > return false; > > } > > @@ -1457,7 +1455,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > goto free_iov; > > } > > > > - if (!process_madvise_behavior_valid(behavior, task)) { > > + if (!process_madvise_behavior_valid(behavior)) { > > ret = -EINVAL; > > goto release_task; > > } > > -- > > 2.37.1.455.g008518b4e5-goog > > -- > Michal Hocko > SUSE Labs Hey Michal, Thanks for taking the time to take a look at this. "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series ended with me mentioning a couple options, but ultimately I didn't present a solution, and no consensus was reached[1]. After taking a closer look, this is my proposal for what I believe to be the best path forward. It should be squashed into the original patch. What do you think? Thanks again, Zach [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
On Tue 02-08-22 02:48:33, Zach O'Keefe wrote: [...] > "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series > ended with me mentioning a couple options, but ultimately I didn't > present a solution, and no consensus was reached[1]. After taking a > closer look, this is my proposal for what I believe to be the best > path forward. It should be squashed into the original patch. What do you think? If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement then yes, this should be squashed into the original patch. There is no real reason to create a potential bisection headache by changing the permission model in a later patch. From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a requirement. I didn't really have time to follow recent discussions but I would argue that the operation is not really destructive or seriously harmful. All applications can already have their memory (almost) equally THP collapsed by khupaged with the proposed process_madvise semantic. NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only difference is the global THP killswitch behavior which I do not think warrants the strongest CAP_SYS_ADMIN capability (especially because it doesn't really control all kinds of THPs). If there is a userspace agent collapsing memory and causing problems then it can be easily fixed in the userspace. And I find that easier to do than putting the bar so high that userspace agents would be unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it would allow essentially full control of other stuff). So from practical POV, risking an extended RSS is really a negligible risk to lose a potentially useful feature for all others. Just my 2c > Thanks again, > Zach > > [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/
On Tue, Aug 2, 2022 at 5:04 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 02-08-22 02:48:33, Zach O'Keefe wrote: > [...] > > "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series > > ended with me mentioning a couple options, but ultimately I didn't > > present a solution, and no consensus was reached[1]. After taking a > > closer look, this is my proposal for what I believe to be the best > > path forward. It should be squashed into the original patch. What do you think? > > If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement > then yes, this should be squashed into the original patch. There is no > real reason to create a potential bisection headache by changing the > permission model in a later patch. Sorry about the confusion here. Assumed (incorrectly) that Andrew would kindly squash this in mm-unstable since I added the Fixes: tag. Next time I'll add some explicit verbiage saying it should be squashed. > From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a > requirement. > > I didn't really have time to follow recent discussions but I would argue > that the operation is not really destructive or seriously harmful. All > applications can already have their memory (almost) equally THP > collapsed by khupaged with the proposed process_madvise semantic. > > NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only > difference is the global THP killswitch behavior which I do not think > warrants the strongest CAP_SYS_ADMIN capability (especially because it > doesn't really control all kinds of THPs). Ya. In fact, I don't think the ignoring the THP sysfs controls warrants any additional capability (set alone CAPS_SYS_ADMIN), since a malicious program can't really inflict any more damage than they would with CAP_SYS_NICE and PTRACE_MODE_READ. > If there is a userspace agent collapsing memory and causing problems > then it can be easily fixed in the userspace. And I find that easier > to do than putting the bar so high that userspace agents would be > unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it > would allow essentially full control of other stuff). So from practical > POV, risking an extended RSS is really a negligible risk to lose a > potentially useful feature for all others. > Agreed. Thanks for taking the time, Michal! Zach > Just my 2c > > > Thanks again, > > Zach > > > > [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/ > > -- > Michal Hocko > SUSE Labs
On Tue, Aug 2, 2022 at 12:43 PM Zach O'Keefe <zokeefe@google.com> wrote: > > On Tue, Aug 2, 2022 at 5:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 02-08-22 02:48:33, Zach O'Keefe wrote: > > [...] > > > "mm/madvise: add MADV_COLLAPSE to process_madvise()" in the v7 series > > > ended with me mentioning a couple options, but ultimately I didn't > > > present a solution, and no consensus was reached[1]. After taking a > > > closer look, this is my proposal for what I believe to be the best > > > path forward. It should be squashed into the original patch. What do you think? > > > > If it is agreed that the CAP_SYS_ADMIN is too strict of a requirement > > then yes, this should be squashed into the original patch. There is no > > real reason to create a potential bisection headache by changing the > > permission model in a later patch. > > Sorry about the confusion here. Assumed (incorrectly) that Andrew > would kindly squash this in mm-unstable since I added the Fixes: tag. > Next time I'll add some explicit verbiage saying it should be > squashed. > > > From my POV, I would agree that CAP_SYS_ADMIN is just too strict of a > > requirement. > > > > I didn't really have time to follow recent discussions but I would argue > > that the operation is not really destructive or seriously harmful. All > > applications can already have their memory (almost) equally THP > > collapsed by khupaged with the proposed process_madvise semantic. > > > > NOHUGEMEM and prctl opt out from THP are both honored AFAIU and the only > > difference is the global THP killswitch behavior which I do not think > > warrants the strongest CAP_SYS_ADMIN capability (especially because it > > doesn't really control all kinds of THPs). > > Ya. In fact, I don't think the ignoring the THP sysfs controls > warrants any additional capability (set alone CAPS_SYS_ADMIN), since a > malicious program can't really inflict any more damage than they would > with CAP_SYS_NICE and PTRACE_MODE_READ. > > > If there is a userspace agent collapsing memory and causing problems > > then it can be easily fixed in the userspace. And I find that easier > > to do than putting the bar so high that userspace agents would be > > unfeasible because of CAP_SYS_ADMIN (which is nono in many cases as it > > would allow essentially full control of other stuff). So from practical > > POV, risking an extended RSS is really a negligible risk to lose a > > potentially useful feature for all others. > > > > Agreed. +1 > > Thanks for taking the time, Michal! > Zach > > > > Just my 2c > > > > > Thanks again, > > > Zach > > > > > > [1] https://lore.kernel.org/linux-mm/Ys4aTRqWIbjNs1mI@google.com/ > > > > -- > > Michal Hocko > > SUSE Labs
diff --git a/mm/madvise.c b/mm/madvise.c index f9e11b6c9916..af97100a0727 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1170,16 +1170,14 @@ madvise_behavior_valid(int behavior) } } -static bool -process_madvise_behavior_valid(int behavior, struct task_struct *task) +static bool process_madvise_behavior_valid(int behavior) { switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: case MADV_WILLNEED: - return true; case MADV_COLLAPSE: - return task == current || capable(CAP_SYS_ADMIN); + return true; default: return false; } @@ -1457,7 +1455,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, goto free_iov; } - if (!process_madvise_behavior_valid(behavior, task)) { + if (!process_madvise_behavior_valid(behavior)) { ret = -EINVAL; goto release_task; }
process_madvise(MADV_COLLAPSE) currently requires CAP_SYS_ADMIN when not acting on the caller's own mm. This is maximally restrictive, and perpetuates existing issues with CAP_SYS_ADMIN. Remove this requirement. When acting on an external process' memory, the biggest concerns for process_madvise(MADV_COLLAPSE) are (1) being able to influence process performance by moving memory, possibly between nodes, that is mapped into the address space of external process(es), (2) defeat of address-space-layout randomization, and (3), being able to increase process RSS and memcg usage, possibly causing memcg OOM. process_madvise(2) already enforces CAP_SYS_NICE and PTRACE_MODE_READ (in PTRACE_MODE_FSCREDS mode). A process with these credentials can already accomplish (1) and (2) via move_pages(MPOL_MF_MOVE_ALL), and (3) via process_madvise(MADV_WILLNEED). process_madvise(MADV_COLLAPSE) may also circumvent sysfs THP settings. When acting on one's own memory (which is equivalent to madvise(MADV_COLLAPSE)), this is deemed acceptable, since aside from the possibility of hoarding available hugepages (which is currently already possible) no harm to the system can be done. When acting on an external process' memory, circumventing sysfs THP settings should provide no additional threat compared to the ones listed. As such, imposing additional capabilities (such as CAP_SETUID, as a way to ensure the caller could have just altered the sysfs THP settings themselves) provides no extra protection. Fixes: 7ec952341312 ("mm/madvise: add MADV_COLLAPSE to process_madvise()") Signed-off-by: Zach O'Keefe <zokeefe@google.com> --- mm/madvise.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)