diff mbox series

[mm-unstable,v7,12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()

Message ID 20220706235936.2197195-13-zokeefe@google.com (mailing list archive)
State New
Headers show
Series mm: userspace hugepage collapse | expand

Commit Message

Zach O'Keefe July 6, 2022, 11:59 p.m. UTC
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is useful for the development of userspace agents that seek to
optimize THP utilization system-wide by using userspace signals to
prioritize what memory is most deserving of being THP-backed.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Andrew Morton July 8, 2022, 8:47 p.m. UTC | #1
On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is maximally restrictive.  I didn't see any discussion of why this
was chosen either here of in the [0/N].  I expect that people will be
coming after us to relax this.

So please do add (a lot of) words explaining this decision, and
describing what might be done in the future to relax it.
Zach O'Keefe July 13, 2022, 1:05 a.m. UTC | #2
On Jul 08 13:47, Andrew Morton wrote:
> On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> 
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> This is maximally restrictive.  I didn't see any discussion of why this
> was chosen either here of in the [0/N].  I expect that people will be
> coming after us to relax this.
> 
> So please do add (a lot of) words explaining this decision, and
> describing what might be done in the future to relax it.

Hey Andrew,

Thanks for taking the time to look at this series. After taking a look through
capabilities(7) I think you're absolutely right to call this out - thanks for
that.

I think move_pages(2) seems to be the best comparison here. There, we use
CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to
copying + moving memory of an eternal process, between nodes.  This is also the
current default for process_madvise(2). However, MADV_COLLAPSE additionally is
able to:

1) Influence the RSS of a process / memory charged to a cgroup (by
  collapsing a hugepage-sized/aligned region with nonresident pages). Note that
  for file/shmem, this might cause increase in file/shmem RSS for non-target
  mm's.
2) Bypass sysfs THP settings

For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS
/ memcg usage, and we don't require any additional capabilities there.

For (2), I don't think there is an easy precedent. I think it makes sense that
the caller has write permission to /sys/kernel/mm/transparent_hugapage/*.
AFAICT, this means an effective user ID of 0 ... which is similarly restrictive
like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads
could always assume an real/effective user ID of 0.

That said, I'm note sure CAP_SETUID is needed, and perhaps the existing
process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to
copy around all the same memory.. we'll just be doing some additional page table
manipulations after some of that copying - which should (mostly) be transparent
to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that
much. Could be wrong through.

Again, thanks for your time,
Zach
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 9f08e958ea86..6fb6b7160bda 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1171,13 +1171,15 @@  madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1455,7 +1457,7 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}