Message ID | 20230907075453.350554-4-gregory.price@memverge.com |
---|---|
State | New, archived |
Headers | show |
Series | sys_move_phy_pages system call | expand |
On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote: > On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote: > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > > b/arch/x86/entry/syscalls/syscall_32.tbl > > index 2d0b1bd866ea..25db6d71af0c 100644 > > --- a/arch/x86/entry/syscalls/syscall_32.tbl > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > > @@ -457,3 +457,4 @@ > > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node > > 451 i386 cachestat sys_cachestat > > 452 i386 fchmodat2 sys_fchmodat2 > > +454 i386 move_phys_pages sys_move_phys_pages > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > > b/arch/x86/entry/syscalls/syscall_64.tbl > > index 1d6eee30eceb..9676f2e7698c 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -375,6 +375,7 @@ > > 451 common cachestat sys_cachestat > > 452 common fchmodat2 sys_fchmodat2 > > 453 64 map_shadow_stack sys_map_shadow_stack > > +454 common move_phys_pages sys_move_phys_pages > > Doing only x86 is fine for review purposes, but note that > once there is consensus on actually merging it and the syscall > number, you should do the same for all architectures. I think > most commonly we do one patch to add the code and another > patch to hook it up to all the syscall.tbl files and the > include/uapi/asm-generic/unistd.h file. > I'd looked through a few prior patches and that's what I observed so I tried to follow that. For the RFC i think it only made sense to integrate it with the system i'm actually testing on, but I'll eventually need to test it on ARM and such. Noted. > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index 22bc6bc147f8..6860675a942f 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned > > long nr_pages, > > const int __user *nodes, > > int __user *status, > > int flags); > > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, > > + const void __user * __user *pages, > > + const int __user *nodes, > > + int __user *status, > > + int flags); > > The prototype looks good from a portability point of view, > i.e. I made sure this is portable across CONFIG_COMPAT > architectures etc. > > What I'm not sure about is whether the representation of physical > memory pages as a 'void *' is a good choice, I can see potential > problems with this: > > - it's not really a pointer, but instead a shifted PFN number > in the implementation > > - physical addresses may be wider than pointers on 32-bit > architectures with CONFIG_PHYS_ADDR_T_64BIT > Hm, good points. I tried to keep the code close to move_pages for the sake of familiarity and ease of review, but the physical address length is not something i'd considered, and I'm not sure how exactly we would handle CONFIG_PHYS_ADDR_T_64BIT. If you have an idea, I'm happy to run with it. on address vs pfn: Would using PFNs cause issues with portability of userland code? User code that gets access to a physical address would have to convert that to a PFN, which would be kernel-defined. That could be easy enough if the kernel exposed the shift size somewhere. I can see arguments for PFN as opposed to physical address for portability sake. This doesn't handle the 64-bit phys address configuration issue, of course. In the case where a user already has a PFN (e.g. via mm/page_idle), defining it as a PFN interface would certainly make more sense. Mayhaps handling both cases would be useful, depending on the source information (see below for more details). > I'm also not sure where the user space caller gets the > physical addresses from, are those not intentionally kept > hidden from userspace? > > Arnd There are presently 4 places (that I know of), and 1 that is being proposed here in the near future 1) Generally: /proc/pid/pagemap can be used to do page table translations. I think this is only really useful for testing, since if you have the virtual address and pid you would use move_pages, but it's certainly useful for testing this. 2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical address information instead of virtual address information. In fact you can configure it to give you both the virtual and physical address for a process. 3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones 4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle. While itself seemingly not useful, if the goal is to migrate less-used pages to "lower tiers" of memory, then you can query the bitmap, directly recover idle PFNs, and attempt to migrate them (which may fail, for a variety of reasons). https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide hot/cold information about its memory. If a heatmap is provided, for example, it would have to be a device address (0-based) or a physical address (some base defined by the kernel and programmed into device decoders such that it can convert them to 0-based). This is presently being proposed but has not been agreed upon yet. ~Gregory
On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote: > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > b/arch/x86/entry/syscalls/syscall_32.tbl > index 2d0b1bd866ea..25db6d71af0c 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -457,3 +457,4 @@ > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node > 451 i386 cachestat sys_cachestat > 452 i386 fchmodat2 sys_fchmodat2 > +454 i386 move_phys_pages sys_move_phys_pages > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 1d6eee30eceb..9676f2e7698c 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -375,6 +375,7 @@ > 451 common cachestat sys_cachestat > 452 common fchmodat2 sys_fchmodat2 > 453 64 map_shadow_stack sys_map_shadow_stack > +454 common move_phys_pages sys_move_phys_pages Doing only x86 is fine for review purposes, but note that once there is consensus on actually merging it and the syscall number, you should do the same for all architectures. I think most commonly we do one patch to add the code and another patch to hook it up to all the syscall.tbl files and the include/uapi/asm-generic/unistd.h file. > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 22bc6bc147f8..6860675a942f 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned > long nr_pages, > const int __user *nodes, > int __user *status, > int flags); > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, > + const void __user * __user *pages, > + const int __user *nodes, > + int __user *status, > + int flags); The prototype looks good from a portability point of view, i.e. I made sure this is portable across CONFIG_COMPAT architectures etc. What I'm not sure about is whether the representation of physical memory pages as a 'void *' is a good choice, I can see potential problems with this: - it's not really a pointer, but instead a shifted PFN number in the implementation - physical addresses may be wider than pointers on 32-bit architectures with CONFIG_PHYS_ADDR_T_64BIT I'm also not sure where the user space caller gets the physical addresses from, are those not intentionally kept hidden from userspace? Arnd
On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote: > On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote: >> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h >> > index 22bc6bc147f8..6860675a942f 100644 >> > --- a/include/linux/syscalls.h >> > +++ b/include/linux/syscalls.h >> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned >> > long nr_pages, >> > const int __user *nodes, >> > int __user *status, >> > int flags); >> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, >> > + const void __user * __user *pages, >> > + const int __user *nodes, >> > + int __user *status, >> > + int flags); >> >> The prototype looks good from a portability point of view, >> i.e. I made sure this is portable across CONFIG_COMPAT >> architectures etc. >> >> What I'm not sure about is whether the representation of physical >> memory pages as a 'void *' is a good choice, I can see potential >> problems with this: >> >> - it's not really a pointer, but instead a shifted PFN number >> in the implementation >> >> - physical addresses may be wider than pointers on 32-bit >> architectures with CONFIG_PHYS_ADDR_T_64BIT >> > > Hm, good points. > > I tried to keep the code close to move_pages for the sake of > familiarity and ease of review, but the physical address length > is not something i'd considered, and I'm not sure how exactly > we would handle CONFIG_PHYS_ADDR_T_64BIT. If you have an idea, > I'm happy to run with it. I think a pointer to '__u64' is the most appropriate here, that is compatible between 32-bit and 64-bit architectures and covers all addresses until we get architectures with 128-bit addressing. Thinking about it more, I noticed an existing bug in both sys_move_pages and your current version of sys_move_phys_pages: the 'pages' array is in fact not suitable for compat tasks and needs an is_compat_task check to load a 32-bit address from compat userspace on the "if (get_user(p, pages + i))" line. > on address vs pfn: > > Would using PFNs cause issues with portability of userland code? User > code that gets access to a physical address would have to convert > that to a PFN, which would be kernel-defined. That could be easy > enough if the kernel exposed the shift size somewhere. I don't think we currently use PFN anywhere in the syscall ABI, so this introduces a new basic type into uapi headers that is currently kernel internal. A 32-bit PFN is always sufficient to represent all physical addresses on native 32-bit kernel, but not necessarily for compat user space on 64-bit kernels with an address space wider than 44 bits (16 terabyte address range). For the interface it would also require the same quirk for compat tasks that I pointed out above that is missing in sys_move_pages today. A minor quirk of PFN values is that they require knowing the page size to convert addresses, but I suppose you need that anyway. >> I'm also not sure where the user space caller gets the >> physical addresses from, are those not intentionally kept >> hidden from userspace? >> > > There are presently 4 places (that I know of), and 1 that is being > proposed here in the near future > > 1) Generally: /proc/pid/pagemap can be used to do page table > translations. I think this is only really useful for testing, > since if you have the virtual address and pid you would use > move_pages, but it's certainly useful for testing this. > > 2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical > address information instead of virtual address information. In fact > you can configure it to give you both the virtual and physical > address for a process. Ah right. I see these already require CAP_SYS_ADMIN permissions because of the risk of rowhammer or speculative execution attacks, so I suppose users of move_phys_pages() would need additional permissions compared to move_pages() to actually use that information. > 3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones > > 4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle. > While itself seemingly not useful, if the goal is to migrate > less-used pages to "lower tiers" of memory, then you can query > the bitmap, directly recover idle PFNs, and attempt to migrate > them (which may fail, for a variety of reasons). > > https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html > > > 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide > hot/cold information about its memory. If a heatmap is provided, > for example, it would have to be a device address (0-based) or a > physical address (some base defined by the kernel and programmed > into device decoders such that it can convert them to 0-based). > > This is presently being proposed but has not been agreed upon yet. These do not seem to be problematic from the ASLR perspective, so I guess it may still be useful without CAP_SYS_ADMIN. Arnd
On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > Similar to the move_pages system call, instead of taking a pid and > > list of virtual addresses, this system call takes a list of physical > > addresses. > > > > Because there is no task to validate the memory policy against, each > > page needs to be interrogated to determine whether the migration is > > valid, and all tasks that map it need to be interrogated. > > > > This is accomplished via an rmap_walk on the folio containing > > the page, and interrogating all tasks that map the page. > > > > Each page must be interrogated individually, which should be > > considered when using this to migrate shared regions. > > > > The remaining logic is the same as the move_pages syscall. One > > change to do_pages_move is made (to check whether an mm_struct is > > passed) in order to re-use the existing migration code. > > > > Signed-off-by: Gregory Price <gregory.price@memverge.com> > > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > include/linux/syscalls.h | 5 + > > include/uapi/asm-generic/unistd.h | 8 +- > > kernel/sys_ni.c | 1 + > > mm/migrate.c | 178 +++++++++++++++++++++++- > > tools/include/uapi/asm-generic/unistd.h | 8 +- > > 7 files changed, 197 insertions(+), 5 deletions(-) > > So this is probably a silly question, but just to be sure ... what is > the permission model for this system call? As far as I can tell, the > ability to move pages is entirely unrestricted, with the exception of > pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable, > but probably I'm just missing something ... ? > > Thanks, > > jon Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to check against), check i neglected to add a CAP_SYS_ADMIN check. Oversight on my part, I'll work it in with other feedback. Thanks! ~Gregory
On Sat, Sep 09, 2023 at 05:18:13PM +0200, Arnd Bergmann wrote: > > I think a pointer to '__u64' is the most appropriate here, > that is compatible between 32-bit and 64-bit architectures > and covers all addresses until we get architectures with > 128-bit addressing. > > Thinking about it more, I noticed an existing bug in > both sys_move_pages and your current version of > sys_move_phys_pages: the 'pages' array is in fact not > suitable for compat tasks and needs an is_compat_task > check to load a 32-bit address from compat userspace on > the "if (get_user(p, pages + i))" line. > I'll clean up the current implementation for what I have on a v2 of an RFC, and then look at adding some pull-ahead patches to fix both move_pages and move_phys_pages for compat processes. Might take me a bit, I've only done compat work once before and I remember it being annoying to get right. I did see other work on migrate.c hanging around on the list, I'll double check this hasn't already been discovered/handled. > > on address vs pfn: > > > > Would using PFNs cause issues with portability of userland code? User > > code that gets access to a physical address would have to convert > > that to a PFN, which would be kernel-defined. That could be easy > > enough if the kernel exposed the shift size somewhere. > > I don't think we currently use PFN anywhere in the syscall > ABI, so this introduces a new basic type into uapi headers that > is currently kernel internal. > > A 32-bit PFN is always sufficient to represent all physical > addresses on native 32-bit kernel, but not necessarily for > compat user space on 64-bit kernels with an address space wider > than 44 bits (16 terabyte address range). > > For the interface it would also require the same quirk > for compat tasks that I pointed out above that is missing > in sys_move_pages today. > > A minor quirk of PFN values is that they require knowing > the page size to convert addresses, but I suppose you need > that anyway. > Hm. So, based on the 5 sources I listed below, some can be PFN and others can be phys-addr. To validate a physical address and ensure the page is online, we have to convert to PFN anyway. So for the user's sake, lets consider Source Data: PFN User Actions: 1) If interface requires phys-addr, convert to phys-addr 2) If interface allows PFN, simply pass in as-is Kernel Actions: 0) If Phys-Addr, convert to PFN 1) Validate the PFN (no bits beyond architecture length) 2) Get and validate struct page Source Data: Phys-Addr User Actions: 1) If interface requires phys-addr, simply pass in as-is 2) If interface requries PFN, convert to PFN Kernel Actions: 0) If Phys-addr, convert to PFN 1) Validate the PFN (no bits beyond arch length) 2) Get and validate struct page I think the take-away here is that the kernel has to validate the PFN regardless, and the kernel already has the information to do that. If you want to use both PFN (Idle Bit) and Phys Addr (IBS/PEBS) information, then the user needs the information to do page shifts That requires more exposure of kernel information, and is possibly complicated by things like hugepages. It seems better to simply allow the interface to take both values, and add flags which format is being used. The worst possible result of malformed input is that the wrong page is migrated. That's no different than move_phys_pages(/dev/random), and the user already requires CAP_SYS_NICE or _ADMIN, so there are probably worse things they could accomplish. Additionally, this allows the page buffer to simply use __u64*, and no need to expose a new user type for PFN. Now the interactions look like: Source Data: Phys-Addr or PFN User Actions: 0) move_phys_pages([phys-addr | pfn], MPOL_MF_(PHYS_ADDR ^ PFN)) Kernel Actions: 0) If Phys-addr, convert to PFN 1) Validate the PFN (no bits beyond arch length) 2) Get and validate struct page This only requires plumbing new 2 flags through do_pages_move, and no new user-exposed types or information. Is there an ick-factor with the idea of adding the following? MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical MPOL_MF_PFN : Treat page migration addresses as PFNs > > These do not seem to be problematic from the ASLR perspective, so > I guess it may still be useful without CAP_SYS_ADMIN. > > Arnd After reviewing the capabilities documentation it seems like CAP_SYS_NICE is the appropriate capability. My last meassage I said CAP_SYS_ADMIN was probably correct, but I think using CAP_SYS_NICE is more appropriate unless there are strong feelings due to the use of PFN and Physcall Address. I'm not sure rowhammer is of great concern in this interface because you can't choose the destination address, only the destination node. Though I suppose someone could go nuts and try to "massage" a node in some way to get a statistical likelihood of placement (similar heap grooming). ~Gregory
On Mon, Sep 11, 2023 at 07:26:45PM +0200, Arnd Bergmann wrote: > On Sun, Sep 10, 2023, at 14:52, Gregory Price wrote: > > I'll clean up the current implementation for what I have on a v2 of an > > RFC, and then look at adding some pull-ahead patches to fix both > > move_pages and move_phys_pages for compat processes. Might take me a > > bit, I've only done compat work once before and I remember it being > > annoying to get right. > > I think what you want is roughly this (untested): > > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2159,6 +2159,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, > const int __user *nodes, > int __user *status, int flags) > { > + struct compat_uptr_t __user *compat_pages = (void __user *)pages; > int current_node = NUMA_NO_NODE; > LIST_HEAD(pagelist); > int start, i; > @@ -2171,8 +2172,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, > int node; > > err = -EFAULT; > - if (get_user(p, pages + i)) > - goto out_flush; > + if (in_compat_syscall() { > + compat_uptr_t cp; > + > + if (get_user(cp, compat_pages + i)) > + goto out_flush; > + > + p = compat_ptr(cp); > + } else { > + if (get_user(p, pages + i)) > + goto out_flush; > + } > if (get_user(node, nodes + i)) > goto out_flush; > > alternatively you could use the get_compat_pages_array() > helper that is already used in the do_pages_stat() > function. > Appreciated, i'll give it a hack before i submit V2. Just to be clear, it sounds like you want move_pages to be converted from (const __user * __user *pages) to (const __u64 __user *pages) as well, correct? That seems like a fairly trivial change. > > > > > This only requires plumbing new 2 flags through do_pages_move, and no > > new user-exposed types or information. > > > > Is there an ick-factor with the idea of adding the following? > > > > MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical > > MPOL_MF_PFN : Treat page migration addresses as PFNs > > I would strongly prefer supporting only one of the two, and > a 64-bit physical address seems like the logical choice here. > > I agree that this doesn't introduce any additional risk for rowhammer > attacks, but it seems slightly more logical to me to use CAP_SYS_ADMIN > if that is what the other interfaces use that handle physical addresses > and may leak address information. > > Arnd Fair enough, I'll swap to ADMIN and limit to phys_addr. I suppose I could add /sys/kernel/mm/page_size accessible only by root for the same purpose, so that PFNs from idle and such can be useful. I don't know of another way for userland to determine the shift. ~Gregory
Gregory Price <gourry.memverge@gmail.com> writes: > Similar to the move_pages system call, instead of taking a pid and > list of virtual addresses, this system call takes a list of physical > addresses. > > Because there is no task to validate the memory policy against, each > page needs to be interrogated to determine whether the migration is > valid, and all tasks that map it need to be interrogated. > > This is accomplished via an rmap_walk on the folio containing > the page, and interrogating all tasks that map the page. > > Each page must be interrogated individually, which should be > considered when using this to migrate shared regions. > > The remaining logic is the same as the move_pages syscall. One > change to do_pages_move is made (to check whether an mm_struct is > passed) in order to re-use the existing migration code. > > Signed-off-by: Gregory Price <gregory.price@memverge.com> > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 5 + > include/uapi/asm-generic/unistd.h | 8 +- > kernel/sys_ni.c | 1 + > mm/migrate.c | 178 +++++++++++++++++++++++- > tools/include/uapi/asm-generic/unistd.h | 8 +- > 7 files changed, 197 insertions(+), 5 deletions(-) So this is probably a silly question, but just to be sure ... what is the permission model for this system call? As far as I can tell, the ability to move pages is entirely unrestricted, with the exception of pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable, but probably I'm just missing something ... ? Thanks, jon
On Sun, Sep 10, 2023, at 14:52, Gregory Price wrote: > On Sat, Sep 09, 2023 at 05:18:13PM +0200, Arnd Bergmann wrote: >> >> I think a pointer to '__u64' is the most appropriate here, >> that is compatible between 32-bit and 64-bit architectures >> and covers all addresses until we get architectures with >> 128-bit addressing. >> >> Thinking about it more, I noticed an existing bug in >> both sys_move_pages and your current version of >> sys_move_phys_pages: the 'pages' array is in fact not >> suitable for compat tasks and needs an is_compat_task >> check to load a 32-bit address from compat userspace on >> the "if (get_user(p, pages + i))" line. >> > > I'll clean up the current implementation for what I have on a v2 of an > RFC, and then look at adding some pull-ahead patches to fix both > move_pages and move_phys_pages for compat processes. Might take me a > bit, I've only done compat work once before and I remember it being > annoying to get right. I think what you want is roughly this (untested): --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2159,6 +2159,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, const int __user *nodes, int __user *status, int flags) { + struct compat_uptr_t __user *compat_pages = (void __user *)pages; int current_node = NUMA_NO_NODE; LIST_HEAD(pagelist); int start, i; @@ -2171,8 +2172,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, int node; err = -EFAULT; - if (get_user(p, pages + i)) - goto out_flush; + if (in_compat_syscall() { + compat_uptr_t cp; + + if (get_user(cp, compat_pages + i)) + goto out_flush; + + p = compat_ptr(cp); + } else { + if (get_user(p, pages + i)) + goto out_flush; + } if (get_user(node, nodes + i)) goto out_flush; alternatively you could use the get_compat_pages_array() helper that is already used in the do_pages_stat() function. > I did see other work on migrate.c hanging around on the list, I'll > double check this hasn't already been discovered/handled. It looks like I broke it, and it was working before my own 5b1b561ba73c8 ("mm: simplify compat_sys_move_pages"), which only handled the nodes=NULL path. I suppose nobody noticed the regression because there are very few 32-bit NUMA systems, and even fewer cases in which one would run compat userspace to manage a 64-bit NUMA machine. > > This only requires plumbing new 2 flags through do_pages_move, and no > new user-exposed types or information. > > Is there an ick-factor with the idea of adding the following? > > MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical > MPOL_MF_PFN : Treat page migration addresses as PFNs I would strongly prefer supporting only one of the two, and a 64-bit physical address seems like the logical choice here. >> These do not seem to be problematic from the ASLR perspective, so >> I guess it may still be useful without CAP_SYS_ADMIN. > > After reviewing the capabilities documentation it seems like > CAP_SYS_NICE is the appropriate capability. My last meassage I said > CAP_SYS_ADMIN was probably correct, but I think using CAP_SYS_NICE > is more appropriate unless there are strong feelings due to the use of > PFN and Physcall Address. > > I'm not sure rowhammer is of great concern in this interface because you > can't choose the destination address, only the destination node. Though > I suppose someone could go nuts and try to "massage" a node in some way > to get a statistical likelihood of placement (similar heap grooming). I agree that this doesn't introduce any additional risk for rowhammer attacks, but it seems slightly more logical to me to use CAP_SYS_ADMIN if that is what the other interfaces use that handle physical addresses and may leak address information. Arnd
On Thu, Sep 07 2023 at 03:54, Gregory Price wrote: > Similar to the move_pages system call, instead of taking a pid and > list of virtual addresses, this system call takes a list of physical > addresses. Silly question. Where are these physical addresses coming from? In my naive understanding user space deals with virtual addresses for a reason. Exposing access to physical addresses is definitely helpful to write more powerful exploits, so what are the restriction applied to this? > +/* > + * Move a list of pages in the address space of the currently executing > + * process. > + */ > +static int kernel_move_phys_pages(unsigned long nr_pages, > + const void __user * __user *pages, > + const int __user *nodes, > + int __user *status, int flags) > +{ > + int err; > + nodemask_t target_nodes; > + > + /* Check flags */ Documeting the obvious ... > + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) > + return -EINVAL; > + > + if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) > + return -EPERM; According to this logic here MPOL_MF_MOVE is unrestricted, right? But how is an unpriviledged process knowing which physical address the pages have? Confused.... > + /* All tasks mapping each page is checked in phys_page_migratable */ > + nodes_setall(target_nodes); How is the comment related to nodes_setall() and why is nodes_setall() unconditional when target_nodes is only used in the @nodes != NULL case? > + if (nodes) > + err = do_pages_move(NULL, target_nodes, nr_pages, pages, > + nodes, status, flags); > + else > + err = do_pages_stat(NULL, nr_pages, pages, status); Thanks, tglx
On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote: > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote: >> Gregory Price <gourry.memverge@gmail.com> writes: >> >> > Similar to the move_pages system call, instead of taking a pid and >> > list of virtual addresses, this system call takes a list of physical >> > addresses. >> > >> > Because there is no task to validate the memory policy against, each >> > page needs to be interrogated to determine whether the migration is >> > valid, and all tasks that map it need to be interrogated. >> > >> > This is accomplished via an rmap_walk on the folio containing >> > the page, and interrogating all tasks that map the page. >> > >> > Each page must be interrogated individually, which should be >> > considered when using this to migrate shared regions. >> > >> > The remaining logic is the same as the move_pages syscall. One >> > change to do_pages_move is made (to check whether an mm_struct is >> > passed) in order to re-use the existing migration code. >> > >> > Signed-off-by: Gregory Price <gregory.price@memverge.com> >> > --- >> > arch/x86/entry/syscalls/syscall_32.tbl | 1 + >> > arch/x86/entry/syscalls/syscall_64.tbl | 1 + >> > include/linux/syscalls.h | 5 + >> > include/uapi/asm-generic/unistd.h | 8 +- >> > kernel/sys_ni.c | 1 + >> > mm/migrate.c | 178 +++++++++++++++++++++++- >> > tools/include/uapi/asm-generic/unistd.h | 8 +- >> > 7 files changed, 197 insertions(+), 5 deletions(-) >> >> So this is probably a silly question, but just to be sure ... what is >> the permission model for this system call? As far as I can tell, the >> ability to move pages is entirely unrestricted, with the exception of >> pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable, >> but probably I'm just missing something ... ? >> >> Thanks, >> >> jon > > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to > check against), check i neglected to add a CAP_SYS_ADMIN check. Global, I presume? I have to admit that I don’t think this patch set makes sense at all. As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory. Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.) Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory. I fully believe that there’s some use for migrating host memory to a node that's closer to a device. But I don't think this API is the right way. First, something needs to figure out that the host memory should be migrated. Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it. Maybe new APIs are needed to enable this. But this API is IMO rather silly. Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*. So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc. Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense. That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime. I just don't see it this code working well, never mind the security issues.
On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote: > On Thu, Sep 07 2023 at 03:54, Gregory Price wrote: > > Similar to the move_pages system call, instead of taking a pid and > > list of virtual addresses, this system call takes a list of physical > > addresses. > > Silly question. Where are these physical addresses coming from? > > In my naive understanding user space deals with virtual addresses for a > reason. > > Exposing access to physical addresses is definitely helpful to write > more powerful exploits, so what are the restriction applied to this? > There are a variety of sources from which userspace can acquire physical addresses, most require SYS_CAP_ADMIN. I should probably add this list explicitly to the RFC for clarification: 1) /proc/pid/pagemap: can be used to do page table translation. This is only really useful for testing, and it requires CAP_SYS_ADMIN. 2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical, virtual, or both physical and virtual adressing. This requires CAP_SYS_ADMIN. 3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones. Not largely useful in this context, but noteably it does expose a PFN does not require privilege. 4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle. One tiering strategy can be to use idle information to move less-used pages to "lower tiers" of memory. With page_idle, you can query page_idle migrate based on PFN, without the need to know which process it belongs to. Obviously migrations must still respect memcg restrictions, which is why iterating through each mapping VMA is required. https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide hot/cold information about its memory. If a heatmap is provided, for example, it would have to be a device address (0-based) or a physical address (some base defined by the kernel and programmed into device decoders such that it can convert them to 0-based). This is presently being proposed but has not been agreed upon yet. > > + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) > > + return -EINVAL; > > + > > + if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) > > + return -EPERM; > > According to this logic here MPOL_MF_MOVE is unrestricted, right? > > But how is an unpriviledged process knowing which physical address the > pages have? Confused.... > Someone else pointed this out as well, I should have a SYS_CAP_ADMIN check here that i neglected to add, I have a v2 of the RFC with a test and fixes, and I am working on a sample man page. This entire syscall should require SYS_CAP_ADMIN, though there may be an argument for CAP_SYS_NICE. I personally am of the opinion that ADMIN is the right choice, because you're right that operations on physical addresses are a major security issue. > > + /* All tasks mapping each page is checked in phys_page_migratable */ > > + nodes_setall(target_nodes); > > How is the comment related to nodes_setall() and why is nodes_setall() > unconditional when target_nodes is only used in the @nodes != NULL case? > Yeah this comment is confusing. This is a bit of a wart that comes from trying to re-use the existing do_pages_move to ensure correctness, and I really should have added this information directly in the comments. The relevant code is here: static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, unsigned long nr_pages, const void __user * __user *pages, const int __user *nodes, int __user *status, int flags) { ... err = -EACCES; if (!node_isset(node, task_nodes)) goto out_flush; ... if (mm) err = add_virt_page_for_migration(mm, p, current_node, &pagelist, flags & MPOL_MF_MOVE_ALL); else err = add_phys_page_for_migration(p, current_node, &pagelist, flags & MPOL_MF_MOVE_ALL); do_pages_move was originally written to check that the memcg contained the node as a valid destination for the entire request. This nodemask is aquired from find_mm_struct in a call to cpuset_mems_allowed(task). e.g. if memcg.valid_nodes=0,1 and you attempt a move_pages(task, 2), then the entire syscall will error out with -EACCES. The first chunk above checks this condition. The second chunk (add_virt/phys...), these calls check that individual pages are actually migratable, and if so removes them from the LRU and places them onto the list of pages to migrate in-bulk at a later time. In the physical addressing path, there is no pid/task provided by the user, because physical addresses are not associated directly with task. We must look up each VMA, and each owner of that VMA, and validate the intersection of the allowed nodes defined by that set contain the requested node. This is implemented here: static int add_phys_page_for_migration(const void __user *p, int node, struct list_head *pagelist, bool migrate_all) ... pfn = ((unsigned long)p) >> PAGE_SHIFT; page = pfn_to_online_page(pfn); if (!page || PageTail(page)) return -ENOENT; folio = phys_migrate_get_folio(page); if (folio) { rmap_walk(folio, &rwc); <---- per-vma walk folio_put(folio); } static bool phys_page_migratable(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg; struct task_struct *owner = vma->vm_mm->owner; nodemask_t task_nodes = cpuset_mems_allowed(owner); ... the actual node check here ... ctxt->node_allowed &= node_isset(ctxt->node, task_nodes); I will add some better comments that lay this out. ~Gregory
On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote: > > > On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote: > > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote: > >> > >> So this is probably a silly question, but just to be sure ... what is > >> the permission model for this system call? As far as I can tell, the > >> ability to move pages is entirely unrestricted, with the exception of > >> pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable, > >> but probably I'm just missing something ... ? > >> > >> Thanks, > >> > >> jon > > > > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to > > check against), check i neglected to add a CAP_SYS_ADMIN check. > > Global, I presume? > > I have to admit that I don’t think this patch set makes sense at all. > > As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory. > > Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.) > > Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory. > The CXL memory description here is a bit inaccurate. Memory on the CXL bus is not limited to host and accelerator, CXL memory devices may also present memory for use by the system as-if it were just DRAM as well. The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]' and the result is a cacheline fetch that goes over the cxl bus rather than the DRAM memory controllers). Small CXL background for the sake of clarity: type-2 devices are "accelerators", and the memory relationships you describe here are roughly accurate. The intent of this interface is not really for the purpose of managing type-2/accelerator device memory. type-3 devices are "memory devices", whose intent is to provide the system additional memory resources that get mapped into one or more numa nodes. The intent of these devices is to present memory to the kernel *as-if* it were regular old DRAM just with different latency and bandwidth attributes. This is a simplification of the overall goal. So from the perspective of the kernel and a memory-tiering system, we have numa nodes which abstract physical memory, and that physical memory may actually live anywhere (DRAM, CXL, where-ever). This memory is fungible with the exception that CXL memory should be placed in ZONE_MOVABLE to ensure the hot-plugability of those memory devices. The intent of this interface is to make page-migration decisions without the need to track individual processes or virtual address mappings. One example would be to utilize the idle page tracking mechanism from userland to make migration decisions. https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html This mechanism allows a user to determine which PFNs are idle. Combine this information with a move_phys_page syscall, you can now implement demotion/promotion in user-land without having to identify the virtual address mapping of those PFN's in user-land. > I fully believe that there’s some use for migrating host memory to a node that's closer to a device. But I don't think this API is the right way. First, something needs to figure out that the host memory should be migrated. Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it. Maybe new APIs are needed to enable this. > The intent is not to migrate memory to making it "closer to a device", assuming you mean the intent is to make that data closer to a device that is using it (i.e. an accelerator). The intent is to allow migration of memory based on a user-defined policy via the usage of physical addresses. Lets consider a bandwidth-expansion focused tiering policy. Each additional CXL Type-3 Memory device provides additional memory bandwidth to a processor via its pcie/cxl lanes. If all you care about is latency, moving/migrating pages closer to the processor is beneficial. However, if you care about maximizing bandwidth, distributing memory across all possible devices with some statistical distribution is a better goal. So sometimes you actually want hot data further away because it allows for more concurrent cacheline fetches to occur. The question as to whether getting the logical memory address is required, useful, or performant depends on what sources of information you can pull physical address information from. Above I explained idle page tracking, but another example would be the CXL device directly, which knows 2 pieces of information (generally): 1) The extent of the memory it is hosting (some size) 2) The physical-to-device address mapping for the system contacting it. The device talks (internally) in 0-based addressing (0x0 up to 0x...), but the host places the host physical address (HPA) on the bus (0x123450000). The device receives and converts 0x123450000 (HPA) into a 0-base address (device-physical-address, DPA). How does this relate to this interface? Consider a device which provides a "heat-map" for the memory it is hosting. If a user or system requests this heat-map, the device can only provide that information in terms of either HPA or DPA. If DPA, then the host can recover the HPA by simply looking at the mapping it programmed the device with. This reverse-transaction (DPA-to-HPA) is relatively inexpensive. The idle-page tracking interface is actually a good example of this. It is functionally an heat-map for the entire system. However, things get extraordinary expensive after this. HPA to host virtual address translation (HPA to HVA) requires inspecting every task that may map that HPA in its page tables. When the cacheline fetch hits the bus, you are well below the construct of a "task", and the devices has no way of telling you what task is using memory on that device. This makes any kind of tiering operation based on this type of heat-map information somewhat of a non-starter. You steal so much performance just converting that information into task-specific information, that you may as well not bother doing it. Instead, this interface would allow for a tiering policy to operate on such heat-map information directly, and since all CXL memory is intended to be placed in ZONE_MOVABLE, that memory should always be migratable. > But this API is IMO rather silly. Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*. So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc. > > Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense. That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime. > > I just don't see it this code working well, never mind the security issues. I think this is more of a terminology issue. I'm not married to the name, but to me move_phys_page is intuitively easier to understand because move_page exists and the only difference between the two interfaces is virtual vs physical addressing. move_pages doesn't "migrate a virtual page" either, it "migrates the data pointed to by this virtual address to another physical page located on the target numa node". Likewise this interface "migrates the data located at the physical address, assuming the physical address is mapped, to another page on the target numa node". The virtual-address sister-function doesn't "move the physical page" either, it moves the data from one physical page to another and updates the page tables. ~Gregory
On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote: > On Thu, Sep 07 2023 at 03:54, Gregory Price wrote: > > + /* All tasks mapping each page is checked in phys_page_migratable */ > > + nodes_setall(target_nodes); > > How is the comment related to nodes_setall() and why is nodes_setall() > unconditional when target_nodes is only used in the @nodes != NULL case? > Short follow up, sorry for the spam. I realized there is a better way to do this by simply bypassing the task node check in do_pages_moves if the mm_struct is NULL. This removes the need for this struct all-together. Will simplify down for a v2 and add comments to do_pages_move accordingly. ~Gregory
On Tue, Sep 19, 2023, at 9:31 AM, Gregory Price wrote: > On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote: >> >> >> On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote: >> > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote: >> >> >> >> So this is probably a silly question, but just to be sure ... what is >> >> the permission model for this system call? As far as I can tell, the >> >> ability to move pages is entirely unrestricted, with the exception of >> >> pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable, >> >> but probably I'm just missing something ... ? >> >> >> >> Thanks, >> >> >> >> jon >> > >> > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to >> > check against), check i neglected to add a CAP_SYS_ADMIN check. >> >> Global, I presume? >> >> I have to admit that I don’t think this patch set makes sense at all. >> >> As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory. >> >> Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.) >> >> Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory. >> > > The CXL memory description here is a bit inaccurate. Memory on the CXL > bus is not limited to host and accelerator, CXL memory devices may also > present memory for use by the system as-if it were just DRAM as well. > The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]' > and the result is a cacheline fetch that goes over the cxl bus rather > than the DRAM memory controllers). > > Small CXL background for the sake of clarity: > > type-2 devices are "accelerators", and the memory relationships you > describe here are roughly accurate. The intent of this interface is not > really for the purpose of managing type-2/accelerator device memory. > > type-3 devices are "memory devices", whose intent is to provide the > system additional memory resources that get mapped into one or more numa > nodes. The intent of these devices is to present memory to the kernel > *as-if* it were regular old DRAM just with different latency and > bandwidth attributes. This is a simplification of the overall goal. > > > So from the perspective of the kernel and a memory-tiering system, we > have numa nodes which abstract physical memory, and that physical memory > may actually live anywhere (DRAM, CXL, where-ever). This memory is > fungible with the exception that CXL memory should be placed in > ZONE_MOVABLE to ensure the hot-plugability of those memory devices. > > > The intent of this interface is to make page-migration decisions without > the need to track individual processes or virtual address mappings. > > One example would be to utilize the idle page tracking mechanism from > userland to make migration decisions. > > https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html > > This mechanism allows a user to determine which PFNs are idle. Combine > this information with a move_phys_page syscall, you can now implement > demotion/promotion in user-land without having to identify the virtual > address mapping of those PFN's in user-land. > > >> I fully believe that there’s some use for migrating host memory to a node that's closer to a device. But I don't think this API is the right way. First, something needs to figure out that the host memory should be migrated. Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it. Maybe new APIs are needed to enable this. >> > > The intent is not to migrate memory to making it "closer to a device", > assuming you mean the intent is to make that data closer to a device > that is using it (i.e. an accelerator). > > The intent is to allow migration of memory based on a user-defined > policy via the usage of physical addresses. > > Lets consider a bandwidth-expansion focused tiering policy. Each > additional CXL Type-3 Memory device provides additional memory > bandwidth to a processor via its pcie/cxl lanes. > > If all you care about is latency, moving/migrating pages closer to the > processor is beneficial. However, if you care about maximizing > bandwidth, distributing memory across all possible devices with some > statistical distribution is a better goal. > > So sometimes you actually want hot data further away because it allows > for more concurrent cacheline fetches to occur. > > > The question as to whether getting the logical memory address is > required, useful, or performant depends on what sources of information > you can pull physical address information from. > > Above I explained idle page tracking, but another example would be the > CXL device directly, which knows 2 pieces of information (generally): > > 1) The extent of the memory it is hosting (some size) > 2) The physical-to-device address mapping for the system contacting it. > > The device talks (internally) in 0-based addressing (0x0 up to 0x...), > but the host places the host physical address (HPA) on the bus > (0x123450000). The device receives and converts 0x123450000 (HPA) into > a 0-base address (device-physical-address, DPA). > > How does this relate to this interface? > > Consider a device which provides a "heat-map" for the memory it is > hosting. If a user or system requests this heat-map, the device can > only provide that information in terms of either HPA or DPA. If DPA, > then the host can recover the HPA by simply looking at the mapping it > programmed the device with. This reverse-transaction (DPA-to-HPA) is > relatively inexpensive. > > The idle-page tracking interface is actually a good example of this. It > is functionally an heat-map for the entire system. > > However, things get extraordinary expensive after this. HPA to host > virtual address translation (HPA to HVA) requires inspecting every task > that may map that HPA in its page tables. When the cacheline fetch hits > the bus, you are well below the construct of a "task", and the devices > has no way of telling you what task is using memory on that device. > > This makes any kind of tiering operation based on this type of heat-map > information somewhat of a non-starter. You steal so much performance > just converting that information into task-specific information, that > you may as well not bother doing it. > > Instead, this interface would allow for a tiering policy to operate on > such heat-map information directly, and since all CXL memory is intended > to be placed in ZONE_MOVABLE, that memory should always be migratable. > >> But this API is IMO rather silly. Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*. So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc. >> >> Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense. That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime. >> >> I just don't see it this code working well, never mind the security issues. > > I think this is more of a terminology issue. I'm not married to the > name, but to me move_phys_page is intuitively easier to understand > because move_page exists and the only difference between the two > interfaces is virtual vs physical addressing. > > move_pages doesn't "migrate a virtual page" either, it "migrates the > data pointed to by this virtual address to another physical page located > on the target numa node". > > Likewise this interface "migrates the data located at the physical address, > assuming the physical address is mapped, to another page on the target numa > node". I'm not complaining about the name. I'm objecting about the semantics. Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote: > > I'm not complaining about the name. I'm objecting about the semantics. > > Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing. How does this change if I use virtual address based migration? I could do sampling based on virtual address (page faults, IBS/PEBs, whatever), and by the time I make a decision, the kernel could have migrated the data or even my task from Node A to Node B. The sample I took is now stale, and I could make a poor migration decision. If I do move_pages(pid, some_virt_addr, some_node) and it migrates the page from NodeA to NodeB, then the device-side collection is likewise no longer valid. This problem doesn't change because I used virtual address compared to physical address. But if i have a 512GB memory device, and i can see a wide swath of that 512GB is hot, while a good chunk of my local DRAM is not - then I probably don't care *what* gets migrated up to DRAM, i just care that a vast majority of that hot data does. The goal here isn't 100% precision, you will never get there. The goal here is broad-scope performance enhancements of the overall system while minimizing the cost to compute the migration actions to be taken. I don't think the contents of the page are always relevant. The entire concept here is to enable migration without caring about what programs are using the memory for - just so long as the memcg's and zoning is respected. ~Gregory
On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote: > On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote: >> >> I'm not complaining about the name. I'm objecting about the semantics. >> >> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing. > > How does this change if I use virtual address based migration? > > I could do sampling based on virtual address (page faults, IBS/PEBs, > whatever), and by the time I make a decision, the kernel could have > migrated the data or even my task from Node A to Node B. The sample I > took is now stale, and I could make a poor migration decision. The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example). The kernel can do this without races from page faults handlers. If you sample based on PA, you fundamentally race against anything that does migration. > > If I do move_pages(pid, some_virt_addr, some_node) and it migrates the > page from NodeA to NodeB, then the device-side collection is likewise > no longer valid. This problem doesn't change because I used virtual > address compared to physical address. Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use. > > But if i have a 512GB memory device, and i can see a wide swath of that > 512GB is hot, while a good chunk of my local DRAM is not - then I > probably don't care *what* gets migrated up to DRAM, i just care that a > vast majority of that hot data does. > > The goal here isn't 100% precision, you will never get there. The goal > here is broad-scope performance enhancements of the overall system > while minimizing the cost to compute the migration actions to be taken. > > I don't think the contents of the page are always relevant. The entire > concept here is to enable migration without caring about what programs > are using the memory for - just so long as the memcg's and zoning is > respected. > At the very least I think you need to be aware of page *size*. And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation. I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.
On Tue, Sep 19, 2023 at 11:52:27AM -0700, Andy Lutomirski wrote: > > > On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote: > > On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote: > >> > >> I'm not complaining about the name. I'm objecting about the semantics. > >> > >> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing. > > > > How does this change if I use virtual address based migration? > > > > I could do sampling based on virtual address (page faults, IBS/PEBs, > > whatever), and by the time I make a decision, the kernel could have > > migrated the data or even my task from Node A to Node B. The sample I > > took is now stale, and I could make a poor migration decision. > > The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example). The kernel can do this without races from page faults handlers. If you sample based on PA, you fundamentally race against anything that does migration. > Race conditions are race conditions, narrow or otherwise. The narrow-ness of the condition is either irrelevant, or you can accept some level of race because the goal allows for slop within a certain window. In fact, the entire purpose of this interface is to very explicity reduce the time it takes to go from sampling to migration decision. Certainly a user could simply use a heatmap with cgroups.numa_stat to determine what processes they should interrogate - but for systems with many tennants/processes/tasks, sometimes just acting on the overall view of memory would be better served if *major* changes have to be made to the overall distribution. Similarly, collection of data can likewise we made more performant. Faults introduce runtime overhead. Pushing heatmap collection to devices alleviates the need for introducing that overhead. Use of IBS/PEBS is (by nature of sampling and a variety of quirks having to do with the prefetcher) quite inaccurate and costly to compute after being collected. Ultimately if your goal is extremely high precision for the performance of a single process, I agree with your analysis. However, I never claimed the goal is precision on the level a single process. On the contrary, the goal is system-wide tiering based on cheap-to-acquire information. > > > > If I do move_pages(pid, some_virt_addr, some_node) and it migrates the > > page from NodeA to NodeB, then the device-side collection is likewise > > no longer valid. This problem doesn't change because I used virtual > > address compared to physical address. > > Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use. > I'm not keen on "the kernel should..." implying "user land should not". I don't disagree that the kernel could/should/would use this same information. My proposal here is that there is value in enabling userland to do the same thing (so long as security is not compromised). > > > > But if i have a 512GB memory device, and i can see a wide swath of that > > 512GB is hot, while a good chunk of my local DRAM is not - then I > > probably don't care *what* gets migrated up to DRAM, i just care that a > > vast majority of that hot data does. > > > > The goal here isn't 100% precision, you will never get there. The goal > > here is broad-scope performance enhancements of the overall system > > while minimizing the cost to compute the migration actions to be taken. > > > > I don't think the contents of the page are always relevant. The entire > > concept here is to enable migration without caring about what programs > > are using the memory for - just so long as the memcg's and zoning is > > respected. > > > > At the very least I think you need to be aware of page *size*. And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation. > Re page size: this was brought up in a prior email, and I agree, but that also doesn't really change for move_pages. The man page for move_pages already says this: """ pages is an array of pointers to the pages that should be moved. These are pointers that should be aligned to page boundaries. """ But to an extent I agree with you. A device collecting data as I describe won't necessarily know the page size (or there may even be multiple sized pages hosted on the device). Operating on that data blind does have some risks. Some work probably could be done to ensure the data being used doesn't, for example, cause a hugepage to be migrated multiple times. In that regard, if the address doesn't match the base of the page the migration should fail. I think there are simple ways to avoid these footguns. > > I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance. "entire process" and "blind" are a little bit of an straw man here. The goal is to marry many pieces of data to make decisions about how and what to move, with as many (flexible) tools available to enact changes quickly to reduce the staleness of information issue. Sometimes it would be "more blind" than others, yes. In other cases, such a device-provided heatmap would simply be informational for overall system health. To me, unless I'm misunderstanding, it sounds like you think system-wide tiering decisions should not be a function userland is empowered to do directly, but instead should be implemented entirely in the kernel? ~Gregory
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2d0b1bd866ea..25db6d71af0c 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -457,3 +457,4 @@ 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node 451 i386 cachestat sys_cachestat 452 i386 fchmodat2 sys_fchmodat2 +454 i386 move_phys_pages sys_move_phys_pages diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 1d6eee30eceb..9676f2e7698c 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -375,6 +375,7 @@ 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 453 64 map_shadow_stack sys_map_shadow_stack +454 common move_phys_pages sys_move_phys_pages # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 22bc6bc147f8..6860675a942f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages, const int __user *nodes, int __user *status, int flags); +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, + const void __user * __user *pages, + const int __user *nodes, + int __user *status, + int flags); asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig, siginfo_t __user *uinfo); asmlinkage long sys_perf_event_open( diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index abe087c53b4b..8838fcfaf261 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -823,8 +823,14 @@ __SYSCALL(__NR_cachestat, sys_cachestat) #define __NR_fchmodat2 452 __SYSCALL(__NR_fchmodat2, sys_fchmodat2) +/* CONFIG_MMU only */ +#ifndef __ARCH_NOMMU +#define __NR_move_phys_pages 454 +__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages) +#endif + #undef __NR_syscalls -#define __NR_syscalls 453 +#define __NR_syscalls 455 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index e137c1385c56..07441b10f92a 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -192,6 +192,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); COND_SYSCALL(cachestat); +COND_SYSCALL(move_phys_pages); COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/migrate.c b/mm/migrate.c index 3506b8202937..8a6f1eb6e512 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2161,6 +2161,101 @@ static int move_pages_and_store_status(int node, return store_status(status, start, node, i - start); } +struct rmap_page_ctxt { + bool found; + bool migratable; + bool node_allowed; + int node; +}; + +/* + * Walks each vma mapping a given page and determines if those + * vma's are both migratable, and that the target node is within + * the allowed cpuset of the owning task. + */ +static bool phys_page_migratable(struct folio *folio, + struct vm_area_struct *vma, + unsigned long address, + void *arg) +{ + struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg; + struct task_struct *owner = vma->vm_mm->owner; + nodemask_t task_nodes = cpuset_mems_allowed(owner); + + ctxt->found |= true; + ctxt->migratable &= vma_migratable(vma); + ctxt->node_allowed &= node_isset(ctxt->node, task_nodes); + + return ctxt->migratable && ctxt->node_allowed; +} + +static struct folio *phys_migrate_get_folio(struct page *page) +{ + struct folio *folio; + + folio = page_folio(page); + if (!folio_test_lru(folio) || !folio_try_get(folio)) + return NULL; + if (unlikely(page_folio(page) != folio || !folio_test_lru(folio))) { + folio_put(folio); + folio = NULL; + } + return folio; +} + +/* + * Validates the physical address is online and migratable. Walks the folio + * containing the page to validate the vma is migratable and the cpuset node + * restrictions. Then calls add_page_for_migration to isolate it from the + * LRU and place it into the given pagelist. + * Returns: + * errno - if the page is not online, migratable, or can't be isolated + * 0 - when it doesn't have to be migrated because it is already on the + * target node + * 1 - when it has been queued + */ +static int add_phys_page_for_migration(const void __user *p, int node, + struct list_head *pagelist, + bool migrate_all) +{ + unsigned long pfn; + struct page *page; + struct folio *folio; + int err; + struct rmap_page_ctxt rmctxt = { + .found = false, + .migratable = true, + .node_allowed = true, + .node = node + }; + struct rmap_walk_control rwc = { + .rmap_one = phys_page_migratable, + .arg = &rmctxt + }; + + pfn = ((unsigned long)p) >> PAGE_SHIFT; + page = pfn_to_online_page(pfn); + if (!page || PageTail(page)) + return -ENOENT; + + folio = phys_migrate_get_folio(page); + if (folio) { + rmap_walk(folio, &rwc); + folio_put(folio); + } + + if (!rmctxt.found) + err = -ENOENT; + else if (!rmctxt.migratable) + err = -EFAULT; + else if (!rmctxt.node_allowed) + err = -EACCES; + else + err = add_page_for_migration(page, node, pagelist, migrate_all); + + return err; +} + /* * Migrate an array of page address onto an array of nodes and fill * the corresponding array of status. @@ -2214,8 +2309,13 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, * Errors in the page lookup or isolation are not fatal and we simply * report them via status */ - err = add_virt_page_for_migration(mm, p, current_node, &pagelist, - flags & MPOL_MF_MOVE_ALL); + if (mm) + err = add_virt_page_for_migration(mm, p, current_node, &pagelist, + flags & MPOL_MF_MOVE_ALL); + else + err = add_phys_page_for_migration(p, current_node, &pagelist, + flags & MPOL_MF_MOVE_ALL); + if (err > 0) { /* The page is successfully queued for migration */ @@ -2303,6 +2403,36 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, mmap_read_unlock(mm); } +/* + * Determine the nodes of an array of pages and store it in an array of status. + */ +static void do_phys_pages_stat_array(unsigned long nr_pages, + const void __user **pages, int *status) +{ + unsigned long i; + + for (i = 0; i < nr_pages; i++) { + unsigned long pfn = (unsigned long)(*pages) >> PAGE_SHIFT; + struct page *page = pfn_to_online_page(pfn); + int err = -ENOENT; + + if (!page) + goto set_status; + + get_page(page); + + if (!is_zone_device_page(page)) + err = page_to_nid(page); + + put_page(page); +set_status: + *status = err; + + pages++; + status++; + } +} + static int get_compat_pages_array(const void __user *chunk_pages[], const void __user * __user *pages, unsigned long chunk_nr) @@ -2345,7 +2475,10 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, break; } - do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + if (mm) + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + else + do_phys_pages_stat_array(chunk_nr, chunk_pages, chunk_status); if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status))) break; @@ -2446,6 +2579,45 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags); } +/* + * Move a list of pages in the address space of the currently executing + * process. + */ +static int kernel_move_phys_pages(unsigned long nr_pages, + const void __user * __user *pages, + const int __user *nodes, + int __user *status, int flags) +{ + int err; + nodemask_t target_nodes; + + /* Check flags */ + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) + return -EINVAL; + + if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) + return -EPERM; + + /* All tasks mapping each page is checked in phys_page_migratable */ + nodes_setall(target_nodes); + if (nodes) + err = do_pages_move(NULL, target_nodes, nr_pages, pages, + nodes, status, flags); + else + err = do_pages_stat(NULL, nr_pages, pages, status); + + return err; +} + +SYSCALL_DEFINE5(move_phys_pages, unsigned long, nr_pages, + const void __user * __user *, pages, + const int __user *, nodes, + int __user *, status, int, flags) +{ + return kernel_move_phys_pages(nr_pages, pages, nodes, status, flags); +} + + #ifdef CONFIG_NUMA_BALANCING /* * Returns true if this is a safe migration target node for misplaced NUMA diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index fd6c1cb585db..b140ad444946 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -820,8 +820,14 @@ __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) #define __NR_cachestat 451 __SYSCALL(__NR_cachestat, sys_cachestat) +/* CONFIG_MMU only */ +#ifndef __ARCH_NOMMU +#define __NR_move_phys_pages 454 +__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages) +#endif + #undef __NR_syscalls -#define __NR_syscalls 452 +#define __NR_syscalls 455 /* * 32 bit systems traditionally used different
Similar to the move_pages system call, instead of taking a pid and list of virtual addresses, this system call takes a list of physical addresses. Because there is no task to validate the memory policy against, each page needs to be interrogated to determine whether the migration is valid, and all tasks that map it need to be interrogated. This is accomplished via an rmap_walk on the folio containing the page, and interrogating all tasks that map the page. Each page must be interrogated individually, which should be considered when using this to migrate shared regions. The remaining logic is the same as the move_pages syscall. One change to do_pages_move is made (to check whether an mm_struct is passed) in order to re-use the existing migration code. Signed-off-by: Gregory Price <gregory.price@memverge.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 5 + include/uapi/asm-generic/unistd.h | 8 +- kernel/sys_ni.c | 1 + mm/migrate.c | 178 +++++++++++++++++++++++- tools/include/uapi/asm-generic/unistd.h | 8 +- 7 files changed, 197 insertions(+), 5 deletions(-)