Message ID | 20220706235936.2197195-13-zokeefe@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: userspace hugepage collapse | expand |
On Wed, 6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote: > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > CAP_SYS_ADMIN or is requesting collapse of it's own memory. This is maximally restrictive. I didn't see any discussion of why this was chosen either here of in the [0/N]. I expect that people will be coming after us to relax this. So please do add (a lot of) words explaining this decision, and describing what might be done in the future to relax it.
On Jul 08 13:47, Andrew Morton wrote: > On Wed, 6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote: > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > This is maximally restrictive. I didn't see any discussion of why this > was chosen either here of in the [0/N]. I expect that people will be > coming after us to relax this. > > So please do add (a lot of) words explaining this decision, and > describing what might be done in the future to relax it. Hey Andrew, Thanks for taking the time to look at this series. After taking a look through capabilities(7) I think you're absolutely right to call this out - thanks for that. I think move_pages(2) seems to be the best comparison here. There, we use CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to copying + moving memory of an eternal process, between nodes. This is also the current default for process_madvise(2). However, MADV_COLLAPSE additionally is able to: 1) Influence the RSS of a process / memory charged to a cgroup (by collapsing a hugepage-sized/aligned region with nonresident pages). Note that for file/shmem, this might cause increase in file/shmem RSS for non-target mm's. 2) Bypass sysfs THP settings For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS / memcg usage, and we don't require any additional capabilities there. For (2), I don't think there is an easy precedent. I think it makes sense that the caller has write permission to /sys/kernel/mm/transparent_hugapage/*. AFAICT, this means an effective user ID of 0 ... which is similarly restrictive like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads could always assume an real/effective user ID of 0. That said, I'm note sure CAP_SETUID is needed, and perhaps the existing process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to copy around all the same memory.. we'll just be doing some additional page table manipulations after some of that copying - which should (mostly) be transparent to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that much. Could be wrong through. Again, thanks for your time, Zach
diff --git a/mm/madvise.c b/mm/madvise.c index 9f08e958ea86..6fb6b7160bda 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1171,13 +1171,15 @@ madvise_behavior_valid(int behavior) } static bool -process_madvise_behavior_valid(int behavior) +process_madvise_behavior_valid(int behavior, struct task_struct *task) { switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: case MADV_WILLNEED: return true; + case MADV_COLLAPSE: + return task == current || capable(CAP_SYS_ADMIN); default: return false; } @@ -1455,7 +1457,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, goto free_iov; } - if (!process_madvise_behavior_valid(behavior)) { + if (!process_madvise_behavior_valid(behavior, task)) { ret = -EINVAL; goto release_task; }