diff mbox series

[RFC,3/3] mm/migrate: Create move_phys_pages syscall

Message ID 20230907075453.350554-4-gregory.price@memverge.com
State New, archived
Headers show
Series sys_move_phy_pages system call | expand

Commit Message

Gregory Price Sept. 7, 2023, 7:54 a.m. UTC
Similar to the move_pages system call, instead of taking a pid and
list of virtual addresses, this system call takes a list of physical
addresses.

Because there is no task to validate the memory policy against, each
page needs to be interrogated to determine whether the migration is
valid, and all tasks that map it need to be interrogated.

This is accomplished via an rmap_walk on the folio containing
the page, and interrogating all tasks that map the page.

Each page must be interrogated individually, which should be
considered when using this to migrate shared regions.

The remaining logic is the same as the move_pages syscall. One
change to do_pages_move is made (to check whether an mm_struct is
passed) in order to re-use the existing migration code.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 include/linux/syscalls.h                |   5 +
 include/uapi/asm-generic/unistd.h       |   8 +-
 kernel/sys_ni.c                         |   1 +
 mm/migrate.c                            | 178 +++++++++++++++++++++++-
 tools/include/uapi/asm-generic/unistd.h |   8 +-
 7 files changed, 197 insertions(+), 5 deletions(-)

Comments

Gregory Price Sept. 8, 2023, 7:46 a.m. UTC | #1
On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
> On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 2d0b1bd866ea..25db6d71af0c 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -457,3 +457,4 @@
> >  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> >  451	i386	cachestat		sys_cachestat
> >  452	i386	fchmodat2		sys_fchmodat2
> > +454	i386	move_phys_pages		sys_move_phys_pages
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 1d6eee30eceb..9676f2e7698c 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -375,6 +375,7 @@
> >  451	common	cachestat		sys_cachestat
> >  452	common	fchmodat2		sys_fchmodat2
> >  453	64	map_shadow_stack	sys_map_shadow_stack
> > +454	common	move_phys_pages		sys_move_phys_pages
> 
> Doing only x86 is fine for review purposes, but note that
> once there is consensus on actually merging it and the syscall
> number, you should do the same for all architectures. I think
> most commonly we do one patch to add the code and another
> patch to hook it up to all the syscall.tbl files and the
> include/uapi/asm-generic/unistd.h file.
>

I'd looked through a few prior patches and that's what I observed so I
tried to follow that.  For the RFC i think it only made sense to
integrate it with the system i'm actually testing on, but I'll
eventually need to test it on ARM and such.

Noted.

> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 22bc6bc147f8..6860675a942f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
> > long nr_pages,
> >  				const int __user *nodes,
> >  				int __user *status,
> >  				int flags);
> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
> > +				    const void __user * __user *pages,
> > +				    const int __user *nodes,
> > +				    int __user *status,
> > +				    int flags);
> 
> The prototype looks good from a portability point of view,
> i.e. I made sure this is portable across CONFIG_COMPAT
> architectures etc.
> 
> What I'm not sure about is whether the representation of physical
> memory pages as a 'void *' is a good choice, I can see potential
> problems with this:
> 
> - it's not really a pointer, but instead a shifted PFN number
>   in the implementation
> 
> - physical addresses may be wider than pointers on 32-bit
>   architectures with CONFIG_PHYS_ADDR_T_64BIT
> 

Hm, good points.

I tried to keep the code close to move_pages for the sake of
familiarity and ease of review, but the physical address length
is not something i'd considered, and I'm not sure how exactly
we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
I'm happy to run with it.

on address vs pfn:

Would using PFNs cause issues with portability of userland code? User
code that gets access to a physical address would have to convert
that to a PFN, which would be kernel-defined.  That could be easy
enough if the kernel exposed the shift size somewhere.

I can see arguments for PFN as opposed to physical address for
portability sake.  This doesn't handle the 64-bit phys address
configuration issue, of course.

In the case where a user already has a PFN (e.g. via mm/page_idle),
defining it as a PFN interface would certainly make more sense.

Mayhaps handling both cases would be useful, depending on the source
information (see below for more details).

> I'm also not sure where the user space caller gets the
> physical addresses from, are those not intentionally kept
> hidden from userspace?
> 
>      Arnd

There are presently 4 places (that I know of), and 1 that is being
proposed here in the near future

1) Generally: /proc/pid/pagemap can be used to do page table
     translations.  I think this is only really useful for testing, 
     since if you have the virtual address and pid you would use
     move_pages, but it's certainly useful for testing this.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     address information instead of virtual address information. In fact
     you can configure it to give you both the virtual and physical
     address for a process.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     While itself seemingly not useful, if the goal is to migrate
     less-used pages to "lower tiers" of memory, then you can query
     the bitmap, directly recover idle PFNs, and attempt to migrate
     them (which may fail, for a variety of reasons).

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html


5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.

~Gregory
Arnd Bergmann Sept. 9, 2023, 8:03 a.m. UTC | #2
On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote:
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 2d0b1bd866ea..25db6d71af0c 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -457,3 +457,4 @@
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
>  451	i386	cachestat		sys_cachestat
>  452	i386	fchmodat2		sys_fchmodat2
> +454	i386	move_phys_pages		sys_move_phys_pages
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 1d6eee30eceb..9676f2e7698c 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -375,6 +375,7 @@
>  451	common	cachestat		sys_cachestat
>  452	common	fchmodat2		sys_fchmodat2
>  453	64	map_shadow_stack	sys_map_shadow_stack
> +454	common	move_phys_pages		sys_move_phys_pages

Doing only x86 is fine for review purposes, but note that
once there is consensus on actually merging it and the syscall
number, you should do the same for all architectures. I think
most commonly we do one patch to add the code and another
patch to hook it up to all the syscall.tbl files and the
include/uapi/asm-generic/unistd.h file.

> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 22bc6bc147f8..6860675a942f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
> long nr_pages,
>  				const int __user *nodes,
>  				int __user *status,
>  				int flags);
> +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
> +				    const void __user * __user *pages,
> +				    const int __user *nodes,
> +				    int __user *status,
> +				    int flags);

The prototype looks good from a portability point of view,
i.e. I made sure this is portable across CONFIG_COMPAT
architectures etc.

What I'm not sure about is whether the representation of physical
memory pages as a 'void *' is a good choice, I can see potential
problems with this:

- it's not really a pointer, but instead a shifted PFN number
  in the implementation

- physical addresses may be wider than pointers on 32-bit
  architectures with CONFIG_PHYS_ADDR_T_64BIT

I'm also not sure where the user space caller gets the
physical addresses from, are those not intentionally kept
hidden from userspace?

     Arnd
Arnd Bergmann Sept. 9, 2023, 3:18 p.m. UTC | #3
On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote:
> On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
>> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> > index 22bc6bc147f8..6860675a942f 100644
>> > --- a/include/linux/syscalls.h
>> > +++ b/include/linux/syscalls.h
>> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
>> > long nr_pages,
>> >  				const int __user *nodes,
>> >  				int __user *status,
>> >  				int flags);
>> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
>> > +				    const void __user * __user *pages,
>> > +				    const int __user *nodes,
>> > +				    int __user *status,
>> > +				    int flags);
>> 
>> The prototype looks good from a portability point of view,
>> i.e. I made sure this is portable across CONFIG_COMPAT
>> architectures etc.
>> 
>> What I'm not sure about is whether the representation of physical
>> memory pages as a 'void *' is a good choice, I can see potential
>> problems with this:
>> 
>> - it's not really a pointer, but instead a shifted PFN number
>>   in the implementation
>> 
>> - physical addresses may be wider than pointers on 32-bit
>>   architectures with CONFIG_PHYS_ADDR_T_64BIT
>> 
>
> Hm, good points.
>
> I tried to keep the code close to move_pages for the sake of
> familiarity and ease of review, but the physical address length
> is not something i'd considered, and I'm not sure how exactly
> we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
> I'm happy to run with it.

I think a pointer to '__u64' is the most appropriate here,
that is compatible between 32-bit and 64-bit architectures
and covers all addresses until we get architectures with
128-bit addressing.

Thinking about it more, I noticed an existing bug in
both sys_move_pages and your current version of
sys_move_phys_pages: the 'pages' array is in fact not
suitable for compat tasks and needs an is_compat_task
check to load a 32-bit address from compat userspace on
the "if (get_user(p, pages + i))" line.

> on address vs pfn:
>
> Would using PFNs cause issues with portability of userland code? User
> code that gets access to a physical address would have to convert
> that to a PFN, which would be kernel-defined.  That could be easy
> enough if the kernel exposed the shift size somewhere.

I don't think we currently use PFN anywhere in the syscall
ABI, so this introduces a new basic type into uapi headers that
is currently kernel internal.

A 32-bit PFN is always sufficient to represent all physical
addresses on native 32-bit kernel, but not necessarily for
compat user space on 64-bit kernels with an address space wider
than 44 bits (16 terabyte address range).

For the interface it would also require the same quirk
for compat tasks that I pointed out above that is missing
in sys_move_pages today.

A minor quirk of PFN values is that they require knowing
the page size to convert addresses, but I suppose you need
that anyway.

>> I'm also not sure where the user space caller gets the
>> physical addresses from, are those not intentionally kept
>> hidden from userspace?
>> 
>
> There are presently 4 places (that I know of), and 1 that is being
> proposed here in the near future
>
> 1) Generally: /proc/pid/pagemap can be used to do page table
>      translations.  I think this is only really useful for testing, 
>      since if you have the virtual address and pid you would use
>      move_pages, but it's certainly useful for testing this.
>
> 2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
>      address information instead of virtual address information. In fact
>      you can configure it to give you both the virtual and physical
>      address for a process.

Ah right. I see these already require CAP_SYS_ADMIN permissions
because of the risk of rowhammer or speculative execution attacks,
so I suppose users of move_phys_pages() would need additional
permissions compared to move_pages() to actually use that information.

> 3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones
>
> 4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
>      While itself seemingly not useful, if the goal is to migrate
>      less-used pages to "lower tiers" of memory, then you can query
>      the bitmap, directly recover idle PFNs, and attempt to migrate
>      them (which may fail, for a variety of reasons).
>
>      https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
>
>
> 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
>      hot/cold information about its memory.  If a heatmap is provided,
>      for example, it would have to be a device address (0-based) or a
>      physical address (some base defined by the kernel and programmed
>      into device decoders such that it can convert them to 0-based).
>
>      This is presently being proposed but has not been agreed upon yet.

These do not seem to be problematic from the ASLR perspective, so
I guess it may still be useful without CAP_SYS_ADMIN.

     Arnd
Gregory Price Sept. 10, 2023, 11:49 a.m. UTC | #4
On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > Similar to the move_pages system call, instead of taking a pid and
> > list of virtual addresses, this system call takes a list of physical
> > addresses.
> >
> > Because there is no task to validate the memory policy against, each
> > page needs to be interrogated to determine whether the migration is
> > valid, and all tasks that map it need to be interrogated.
> >
> > This is accomplished via an rmap_walk on the folio containing
> > the page, and interrogating all tasks that map the page.
> >
> > Each page must be interrogated individually, which should be
> > considered when using this to migrate shared regions.
> >
> > The remaining logic is the same as the move_pages syscall. One
> > change to do_pages_move is made (to check whether an mm_struct is
> > passed) in order to re-use the existing migration code.
> >
> > Signed-off-by: Gregory Price <gregory.price@memverge.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
> >  include/linux/syscalls.h                |   5 +
> >  include/uapi/asm-generic/unistd.h       |   8 +-
> >  kernel/sys_ni.c                         |   1 +
> >  mm/migrate.c                            | 178 +++++++++++++++++++++++-
> >  tools/include/uapi/asm-generic/unistd.h |   8 +-
> >  7 files changed, 197 insertions(+), 5 deletions(-)
> 
> So this is probably a silly question, but just to be sure ... what is
> the permission model for this system call?  As far as I can tell, the
> ability to move pages is entirely unrestricted, with the exception of
> pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
> but probably I'm just missing something ... ?
> 
> Thanks,
> 
> jon

Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
check against), check i neglected to add a CAP_SYS_ADMIN check.

Oversight on my part, I'll work it in with other feedback.

Thanks!
~Gregory
Gregory Price Sept. 10, 2023, 12:52 p.m. UTC | #5
On Sat, Sep 09, 2023 at 05:18:13PM +0200, Arnd Bergmann wrote:
> 
> I think a pointer to '__u64' is the most appropriate here,
> that is compatible between 32-bit and 64-bit architectures
> and covers all addresses until we get architectures with
> 128-bit addressing.
> 
> Thinking about it more, I noticed an existing bug in
> both sys_move_pages and your current version of
> sys_move_phys_pages: the 'pages' array is in fact not
> suitable for compat tasks and needs an is_compat_task
> check to load a 32-bit address from compat userspace on
> the "if (get_user(p, pages + i))" line.
>

I'll clean up the current implementation for what I have on a v2 of an
RFC, and then look at adding some pull-ahead patches to fix both
move_pages and move_phys_pages for compat processes.  Might take me a
bit, I've only done compat work once before and I remember it being
annoying to get right.

I did see other work on migrate.c hanging around on the list, I'll
double check this hasn't already been discovered/handled.

> > on address vs pfn:
> >
> > Would using PFNs cause issues with portability of userland code? User
> > code that gets access to a physical address would have to convert
> > that to a PFN, which would be kernel-defined.  That could be easy
> > enough if the kernel exposed the shift size somewhere.
> 
> I don't think we currently use PFN anywhere in the syscall
> ABI, so this introduces a new basic type into uapi headers that
> is currently kernel internal.
> 
> A 32-bit PFN is always sufficient to represent all physical
> addresses on native 32-bit kernel, but not necessarily for
> compat user space on 64-bit kernels with an address space wider
> than 44 bits (16 terabyte address range).
> 
> For the interface it would also require the same quirk
> for compat tasks that I pointed out above that is missing
> in sys_move_pages today.
> 
> A minor quirk of PFN values is that they require knowing
> the page size to convert addresses, but I suppose you need
> that anyway.
>

Hm.  So, based on the 5 sources I listed below, some can be PFN and
others can be phys-addr.  To validate a physical address and
ensure the page is online, we have to convert to PFN anyway.

So for the user's sake, lets consider

Source Data: PFN
User Actions:
    1) If interface requires phys-addr, convert to phys-addr
    2) If interface allows PFN, simply pass in as-is
Kernel Actions:
    0) If Phys-Addr, convert to PFN
    1) Validate the PFN (no bits beyond architecture length)
    2) Get and validate struct page

Source Data: Phys-Addr
User Actions:
    1) If interface requires phys-addr, simply pass in as-is
    2) If interface requries PFN, convert to PFN
Kernel Actions:
    0) If Phys-addr, convert to PFN
    1) Validate the PFN (no bits beyond arch length)
    2) Get and validate struct page

I think the take-away here is that the kernel has to validate the PFN
regardless, and the kernel already has the information to do that.

If you want to use both PFN (Idle Bit) and Phys Addr (IBS/PEBS)
information, then the user needs the information to do page shifts
That requires more exposure of kernel information, and is possibly
complicated by things like hugepages.  It seems better to simply
allow the interface to take both values, and add flags which
format is being used.

The worst possible result of malformed input is that the wrong page
is migrated.  That's no different than move_phys_pages(/dev/random),
and the user already requires CAP_SYS_NICE or _ADMIN, so there are
probably worse things they could accomplish.

Additionally, this allows the page buffer to simply use __u64*,
and no need to expose a new user type for PFN.

Now the interactions look like:

Source Data: Phys-Addr or PFN
User Actions:
    0) move_phys_pages([phys-addr | pfn], MPOL_MF_(PHYS_ADDR ^ PFN))
Kernel Actions:
    0) If Phys-addr, convert to PFN
    1) Validate the PFN (no bits beyond arch length)
    2) Get and validate struct page

This only requires plumbing new 2 flags through do_pages_move, and no
new user-exposed types or information.

Is there an ick-factor with the idea of adding the following?

MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical
MPOL_MF_PFN : Treat page migration addresses as PFNs

> 
> These do not seem to be problematic from the ASLR perspective, so
> I guess it may still be useful without CAP_SYS_ADMIN.
> 
>      Arnd

After reviewing the capabilities documentation it seems like
CAP_SYS_NICE is the appropriate capability.  My last meassage I said
CAP_SYS_ADMIN was probably correct, but I think using CAP_SYS_NICE
is more appropriate unless there are strong feelings due to the use of
PFN and Physcall Address.

I'm not sure rowhammer is of great concern in this interface because you
can't choose the destination address, only the destination node. Though
I suppose someone could go nuts and try to "massage" a node in some way
to get a statistical likelihood of placement (similar heap grooming).

~Gregory
Gregory Price Sept. 10, 2023, 3:01 p.m. UTC | #6
On Mon, Sep 11, 2023 at 07:26:45PM +0200, Arnd Bergmann wrote:
> On Sun, Sep 10, 2023, at 14:52, Gregory Price wrote:
> > I'll clean up the current implementation for what I have on a v2 of an
> > RFC, and then look at adding some pull-ahead patches to fix both
> > move_pages and move_phys_pages for compat processes.  Might take me a
> > bit, I've only done compat work once before and I remember it being
> > annoying to get right.
> 
> I think what you want is roughly this (untested):
> 
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2159,6 +2159,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>                          const int __user *nodes,
>                          int __user *status, int flags)
>  {
> +       struct compat_uptr_t __user *compat_pages = (void __user *)pages;
>         int current_node = NUMA_NO_NODE;
>         LIST_HEAD(pagelist);
>         int start, i;
> @@ -2171,8 +2172,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>                 int node;
>  
>                 err = -EFAULT;
> -               if (get_user(p, pages + i))
> -                       goto out_flush;
> +               if (in_compat_syscall() {
> +                       compat_uptr_t cp;
> +
> +                       if (get_user(cp, compat_pages + i))
> +                               goto out_flush;
> +
> +                       p = compat_ptr(cp);
> +               } else {
> +                       if (get_user(p, pages + i))
> +                               goto out_flush;
> +               }
>                 if (get_user(node, nodes + i))
>                         goto out_flush;
>  
> alternatively you could use the get_compat_pages_array()
> helper that is already used in the do_pages_stat()
> function.
> 

Appreciated, i'll give it a hack before i submit V2.

Just to be clear, it sounds like you want move_pages to be converted
from (const __user * __user *pages) to (const __u64 __user *pages) as
well, correct?  That seems like a fairly trivial change.

> 
> >
> > This only requires plumbing new 2 flags through do_pages_move, and no
> > new user-exposed types or information.
> >
> > Is there an ick-factor with the idea of adding the following?
> >
> > MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical
> > MPOL_MF_PFN : Treat page migration addresses as PFNs
> 
> I would strongly prefer supporting only one of the two, and
> a 64-bit physical address seems like the logical choice here.
> 
> I agree that this doesn't introduce any additional risk for rowhammer
> attacks, but it seems slightly more logical to me to use CAP_SYS_ADMIN
> if that is what the other interfaces use that handle physical addresses
> and may leak address information.
> 
>      Arnd

Fair enough, I'll swap to ADMIN and limit to phys_addr.

I suppose I could add /sys/kernel/mm/page_size accessible only by root
for the same purpose, so that PFNs from idle and such can be useful.

I don't know of another way for userland to determine the shift.

~Gregory
Jonathan Corbet Sept. 10, 2023, 8:36 p.m. UTC | #7
Gregory Price <gourry.memverge@gmail.com> writes:

> Similar to the move_pages system call, instead of taking a pid and
> list of virtual addresses, this system call takes a list of physical
> addresses.
>
> Because there is no task to validate the memory policy against, each
> page needs to be interrogated to determine whether the migration is
> valid, and all tasks that map it need to be interrogated.
>
> This is accomplished via an rmap_walk on the folio containing
> the page, and interrogating all tasks that map the page.
>
> Each page must be interrogated individually, which should be
> considered when using this to migrate shared regions.
>
> The remaining logic is the same as the move_pages syscall. One
> change to do_pages_move is made (to check whether an mm_struct is
> passed) in order to re-use the existing migration code.
>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>  include/linux/syscalls.h                |   5 +
>  include/uapi/asm-generic/unistd.h       |   8 +-
>  kernel/sys_ni.c                         |   1 +
>  mm/migrate.c                            | 178 +++++++++++++++++++++++-
>  tools/include/uapi/asm-generic/unistd.h |   8 +-
>  7 files changed, 197 insertions(+), 5 deletions(-)

So this is probably a silly question, but just to be sure ... what is
the permission model for this system call?  As far as I can tell, the
ability to move pages is entirely unrestricted, with the exception of
pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
but probably I'm just missing something ... ?

Thanks,

jon
Arnd Bergmann Sept. 11, 2023, 5:26 p.m. UTC | #8
On Sun, Sep 10, 2023, at 14:52, Gregory Price wrote:
> On Sat, Sep 09, 2023 at 05:18:13PM +0200, Arnd Bergmann wrote:
>> 
>> I think a pointer to '__u64' is the most appropriate here,
>> that is compatible between 32-bit and 64-bit architectures
>> and covers all addresses until we get architectures with
>> 128-bit addressing.
>> 
>> Thinking about it more, I noticed an existing bug in
>> both sys_move_pages and your current version of
>> sys_move_phys_pages: the 'pages' array is in fact not
>> suitable for compat tasks and needs an is_compat_task
>> check to load a 32-bit address from compat userspace on
>> the "if (get_user(p, pages + i))" line.
>>
>
> I'll clean up the current implementation for what I have on a v2 of an
> RFC, and then look at adding some pull-ahead patches to fix both
> move_pages and move_phys_pages for compat processes.  Might take me a
> bit, I've only done compat work once before and I remember it being
> annoying to get right.

I think what you want is roughly this (untested):

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2159,6 +2159,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
                         const int __user *nodes,
                         int __user *status, int flags)
 {
+       struct compat_uptr_t __user *compat_pages = (void __user *)pages;
        int current_node = NUMA_NO_NODE;
        LIST_HEAD(pagelist);
        int start, i;
@@ -2171,8 +2172,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
                int node;
 
                err = -EFAULT;
-               if (get_user(p, pages + i))
-                       goto out_flush;
+               if (in_compat_syscall() {
+                       compat_uptr_t cp;
+
+                       if (get_user(cp, compat_pages + i))
+                               goto out_flush;
+
+                       p = compat_ptr(cp);
+               } else {
+                       if (get_user(p, pages + i))
+                               goto out_flush;
+               }
                if (get_user(node, nodes + i))
                        goto out_flush;
 
alternatively you could use the get_compat_pages_array()
helper that is already used in the do_pages_stat()
function.

> I did see other work on migrate.c hanging around on the list, I'll
> double check this hasn't already been discovered/handled.

It looks like I broke it, and it was working before my own
5b1b561ba73c8 ("mm: simplify compat_sys_move_pages"), which
only handled the nodes=NULL path.

I suppose nobody noticed the regression because there are very
few 32-bit NUMA systems, and even fewer cases in which one
would run compat userspace to manage a 64-bit NUMA machine.

>
> This only requires plumbing new 2 flags through do_pages_move, and no
> new user-exposed types or information.
>
> Is there an ick-factor with the idea of adding the following?
>
> MPOL_MF_PHYS_ADDR : Treat page migration addresses as physical
> MPOL_MF_PFN : Treat page migration addresses as PFNs

I would strongly prefer supporting only one of the two, and
a 64-bit physical address seems like the logical choice here.

>> These do not seem to be problematic from the ASLR perspective, so
>> I guess it may still be useful without CAP_SYS_ADMIN.
>
> After reviewing the capabilities documentation it seems like
> CAP_SYS_NICE is the appropriate capability.  My last meassage I said
> CAP_SYS_ADMIN was probably correct, but I think using CAP_SYS_NICE
> is more appropriate unless there are strong feelings due to the use of
> PFN and Physcall Address.
>
> I'm not sure rowhammer is of great concern in this interface because you
> can't choose the destination address, only the destination node. Though
> I suppose someone could go nuts and try to "massage" a node in some way
> to get a statistical likelihood of placement (similar heap grooming).

I agree that this doesn't introduce any additional risk for rowhammer
attacks, but it seems slightly more logical to me to use CAP_SYS_ADMIN
if that is what the other interfaces use that handle physical addresses
and may leak address information.

     Arnd
Thomas Gleixner Sept. 19, 2023, 12:17 a.m. UTC | #9
On Thu, Sep 07 2023 at 03:54, Gregory Price wrote:
> Similar to the move_pages system call, instead of taking a pid and
> list of virtual addresses, this system call takes a list of physical
> addresses.

Silly question. Where are these physical addresses coming from?

In my naive understanding user space deals with virtual addresses for a
reason.

Exposing access to physical addresses is definitely helpful to write
more powerful exploits, so what are the restriction applied to this?

> +/*
> + * Move a list of pages in the address space of the currently executing
> + * process.
> + */
> +static int kernel_move_phys_pages(unsigned long nr_pages,
> +				  const void __user * __user *pages,
> +				  const int __user *nodes,
> +				  int __user *status, int flags)
> +{
> +	int err;
> +	nodemask_t target_nodes;
> +
> +	/* Check flags */

Documeting the obvious ...

> +	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> +		return -EINVAL;
> +
> +	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> +		return -EPERM;

According to this logic here MPOL_MF_MOVE is unrestricted, right?

But how is an unpriviledged process knowing which physical address the
pages have? Confused....

> +	/* All tasks mapping each page is checked in phys_page_migratable */
> +	nodes_setall(target_nodes);

How is the comment related to nodes_setall() and why is nodes_setall()
unconditional when target_nodes is only used in the @nodes != NULL case?

> +	if (nodes)
> +		err = do_pages_move(NULL, target_nodes, nr_pages, pages,
> +			nodes, status, flags);
> +	else
> +		err = do_pages_stat(NULL, nr_pages, pages, status);

Thanks,

        tglx
Andy Lutomirski Sept. 19, 2023, 3:34 a.m. UTC | #10
On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
> On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > Similar to the move_pages system call, instead of taking a pid and
>> > list of virtual addresses, this system call takes a list of physical
>> > addresses.
>> >
>> > Because there is no task to validate the memory policy against, each
>> > page needs to be interrogated to determine whether the migration is
>> > valid, and all tasks that map it need to be interrogated.
>> >
>> > This is accomplished via an rmap_walk on the folio containing
>> > the page, and interrogating all tasks that map the page.
>> >
>> > Each page must be interrogated individually, which should be
>> > considered when using this to migrate shared regions.
>> >
>> > The remaining logic is the same as the move_pages syscall. One
>> > change to do_pages_move is made (to check whether an mm_struct is
>> > passed) in order to re-use the existing migration code.
>> >
>> > Signed-off-by: Gregory Price <gregory.price@memverge.com>
>> > ---
>> >  arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
>> >  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>> >  include/linux/syscalls.h                |   5 +
>> >  include/uapi/asm-generic/unistd.h       |   8 +-
>> >  kernel/sys_ni.c                         |   1 +
>> >  mm/migrate.c                            | 178 +++++++++++++++++++++++-
>> >  tools/include/uapi/asm-generic/unistd.h |   8 +-
>> >  7 files changed, 197 insertions(+), 5 deletions(-)
>> 
>> So this is probably a silly question, but just to be sure ... what is
>> the permission model for this system call?  As far as I can tell, the
>> ability to move pages is entirely unrestricted, with the exception of
>> pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
>> but probably I'm just missing something ... ?
>> 
>> Thanks,
>> 
>> jon
>
> Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
> check against), check i neglected to add a CAP_SYS_ADMIN check.

Global, I presume?

I have to admit that I don’t think this patch set makes sense at all.

As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.

Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)

Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.

I fully believe that there’s some use for migrating host memory to a node that's closer to a device.  But I don't think this API is the right way.  First, something needs to figure out that the host memory should be migrated.  Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it.  Maybe new APIs are needed to enable this.

But this API is IMO rather silly.  Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*.  So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.

Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense.  That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.

I just don't see it this code working well, never mind the security issues.
Gregory Price Sept. 19, 2023, 3:57 p.m. UTC | #11
On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote:
> On Thu, Sep 07 2023 at 03:54, Gregory Price wrote:
> > Similar to the move_pages system call, instead of taking a pid and
> > list of virtual addresses, this system call takes a list of physical
> > addresses.
> 
> Silly question. Where are these physical addresses coming from?
> 
> In my naive understanding user space deals with virtual addresses for a
> reason.
> 
> Exposing access to physical addresses is definitely helpful to write
> more powerful exploits, so what are the restriction applied to this?
> 

There are a variety of sources from which userspace can acquire physical
addresses, most require SYS_CAP_ADMIN.

I should probably add this list explicitly to the RFC for clarification:

1) /proc/pid/pagemap: can be used to do page table translation.
     This is only really useful for testing, and it requires
     CAP_SYS_ADMIN.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return
     physical, virtual, or both physical and virtual adressing.
     This requires CAP_SYS_ADMIN.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones. Not largely
     useful in this context, but noteably it does expose a PFN does not
     require privilege.

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     One tiering strategy can be to use idle information to move
     less-used pages to "lower tiers" of memory.

     With page_idle, you can query page_idle migrate based on PFN,
     without the need to know which process it belongs to.  Obviously
     migrations must still respect memcg restrictions, which is why
     iterating through each mapping VMA is required.

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.


> > +	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> > +		return -EINVAL;
> > +
> > +	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> > +		return -EPERM;
> 
> According to this logic here MPOL_MF_MOVE is unrestricted, right?
> 
> But how is an unpriviledged process knowing which physical address the
> pages have? Confused....
> 

Someone else pointed this out as well, I should have a SYS_CAP_ADMIN
check here that i neglected to add, I have a v2 of the RFC with a test
and fixes, and I am working on a sample man page.

This entire syscall should require SYS_CAP_ADMIN, though there may be an
argument for CAP_SYS_NICE.  I personally am of the opinion that ADMIN is
the right choice, because you're right that operations on physical
addresses are a major security issue.

> > +	/* All tasks mapping each page is checked in phys_page_migratable */
> > +	nodes_setall(target_nodes);
> 
> How is the comment related to nodes_setall() and why is nodes_setall()
> unconditional when target_nodes is only used in the @nodes != NULL case?
> 

Yeah this comment is confusing.  This is a bit of a wart that comes from
trying to re-use the existing do_pages_move to ensure correctness, and
I really should have added this information directly in the comments.

The relevant code is here:

static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
                         unsigned long nr_pages,
                         const void __user * __user *pages,
                         const int __user *nodes,
                         int __user *status, int flags)
{
...
	err = -EACCES;
	if (!node_isset(node, task_nodes))
		goto out_flush;
...
	if (mm)
		err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
					     flags & MPOL_MF_MOVE_ALL);
	else
		err = add_phys_page_for_migration(p, current_node, &pagelist,
				flags & MPOL_MF_MOVE_ALL);


do_pages_move was originally written to check that the memcg contained
the node as a valid destination for the entire request. This nodemask
is aquired from find_mm_struct in a call to cpuset_mems_allowed(task).

e.g. if memcg.valid_nodes=0,1 and you attempt a move_pages(task, 2), then
the entire syscall will error out with -EACCES.

The first chunk above checks this condition.

The second chunk (add_virt/phys...), these calls check that individual
pages are actually migratable, and if so removes them from the LRU and
places them onto the list of pages to migrate in-bulk at a later time.

In the physical addressing path, there is no pid/task provided by the
user, because physical addresses are not associated directly with task.
We must look up each VMA, and each owner of that VMA, and validate
the intersection of the allowed nodes defined by that set contain the
requested node.

This is implemented here:

static int add_phys_page_for_migration(const void __user *p, int node,
                                       struct list_head *pagelist,
                                       bool migrate_all)
...
        pfn = ((unsigned long)p) >> PAGE_SHIFT;
        page = pfn_to_online_page(pfn);
        if (!page || PageTail(page))
                return -ENOENT;

        folio = phys_migrate_get_folio(page);
        if (folio) {
                rmap_walk(folio, &rwc); <---- per-vma walk
                folio_put(folio);
        }


static bool phys_page_migratable(struct folio *folio,
                                 struct vm_area_struct *vma,
                                 unsigned long address,
                                 void *arg)
{
        struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg;
        struct task_struct *owner = vma->vm_mm->owner;
        nodemask_t task_nodes = cpuset_mems_allowed(owner);
... the actual node check here ...
        ctxt->node_allowed &= node_isset(ctxt->node, task_nodes);



I will add some better comments that lay this out.

~Gregory
Gregory Price Sept. 19, 2023, 4:31 p.m. UTC | #12
On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote:
> 
> 
> On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
> > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
> >> 
> >> So this is probably a silly question, but just to be sure ... what is
> >> the permission model for this system call?  As far as I can tell, the
> >> ability to move pages is entirely unrestricted, with the exception of
> >> pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
> >> but probably I'm just missing something ... ?
> >> 
> >> Thanks,
> >> 
> >> jon
> >
> > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
> > check against), check i neglected to add a CAP_SYS_ADMIN check.
> 
> Global, I presume?
> 
> I have to admit that I don’t think this patch set makes sense at all.
> 
> As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.
> 
> Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)
> 
> Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.
> 

The CXL memory description here is a bit inaccurate.  Memory on the CXL
bus is not limited to host and accelerator, CXL memory devices may also
present memory for use by the system as-if it were just DRAM as well.
The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]'
and the result is a cacheline fetch that goes over the cxl bus rather
than the DRAM memory controllers).

Small CXL background for the sake of clarity:

type-2 devices are "accelerators", and the memory relationships you
describe here are roughly accurate.  The intent of this interface is not
really for the purpose of managing type-2/accelerator device memory.

type-3 devices are "memory devices", whose intent is to provide the
system additional memory resources that get mapped into one or more numa
nodes.  The intent of these devices is to present memory to the kernel
*as-if* it were regular old DRAM just with different latency and
bandwidth attributes. This is a simplification of the overall goal.


So from the perspective of the kernel and a memory-tiering system, we
have numa nodes which abstract physical memory, and that physical memory
may actually live anywhere (DRAM, CXL, where-ever).  This memory is
fungible with the exception that CXL memory should be placed in
ZONE_MOVABLE to ensure the hot-plugability of those memory devices.


The intent of this interface is to make page-migration decisions without
the need to track individual processes or virtual address mappings.

One example would be to utilize the idle page tracking mechanism from
userland to make migration decisions.

https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

This mechanism allows a user to determine which PFNs are idle.  Combine
this information with a move_phys_page syscall, you can now implement
demotion/promotion in user-land without having to identify the virtual
address mapping of those PFN's in user-land.


> I fully believe that there’s some use for migrating host memory to a node that's closer to a device.  But I don't think this API is the right way.  First, something needs to figure out that the host memory should be migrated.  Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it.  Maybe new APIs are needed to enable this.
> 

The intent is not to migrate memory to making it "closer to a device",
assuming you mean the intent is to make that data closer to a device
that is using it (i.e. an accelerator).

The intent is to allow migration of memory based on a user-defined
policy via the usage of physical addresses.

Lets consider a bandwidth-expansion focused tiering policy.  Each
additional CXL Type-3 Memory device provides additional memory
bandwidth to a processor via its pcie/cxl lanes.

If all you care about is latency, moving/migrating pages closer to the
processor is beneficial.  However, if you care about maximizing
bandwidth, distributing memory across all possible devices with some
statistical distribution is a better goal.

So sometimes you actually want hot data further away because it allows
for more concurrent cacheline fetches to occur.


The question as to whether getting the logical memory address is
required, useful, or performant depends on what sources of information
you can pull physical address information from.

Above I explained idle page tracking, but another example would be the
CXL device directly, which knows 2 pieces of information (generally):

1) The extent of the memory it is hosting (some size)
2) The physical-to-device address mapping for the system contacting it.

The device talks (internally) in 0-based addressing (0x0 up to 0x...),
but the host places the host physical address (HPA) on the bus
(0x123450000).  The device receives and converts 0x123450000 (HPA) into
a 0-base address (device-physical-address, DPA).

How does this relate to this interface?

Consider a device which provides a "heat-map" for the memory it is
hosting.  If a user or system requests this heat-map, the device can
only provide that information in terms of either HPA or DPA.  If DPA,
then the host can recover the HPA by simply looking at the mapping it
programmed the device with.  This reverse-transaction (DPA-to-HPA) is
relatively inexpensive.

The idle-page tracking interface is actually a good example of this. It
is functionally an heat-map for the entire system.

However, things get extraordinary expensive after this.  HPA to host
virtual address translation (HPA to HVA) requires inspecting every task
that may map that HPA in its page tables.  When the cacheline fetch hits
the bus, you are well below the construct of a "task", and the devices
has no way of telling you what task is using memory on that device.

This makes any kind of tiering operation based on this type of heat-map
information somewhat of a non-starter.  You steal so much performance
just converting that information into task-specific information, that
you may as well not bother doing it.

Instead, this interface would allow for a tiering policy to operate on
such heat-map information directly, and since all CXL memory is intended
to be placed in ZONE_MOVABLE, that memory should always be migratable.

> But this API is IMO rather silly.  Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*.  So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.
> 
> Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense.  That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.
> 
> I just don't see it this code working well, never mind the security issues.

I think this is more of a terminology issue.  I'm not married to the
name, but to me move_phys_page is intuitively easier to understand
because move_page exists and the only difference between the two
interfaces is virtual vs physical addressing.

move_pages doesn't "migrate a virtual page" either, it "migrates the
data pointed to by this virtual address to another physical page located 
on the target numa node".

Likewise this interface "migrates the data located at the physical address,
assuming the physical address is mapped, to another page on the target numa
node".

The virtual-address sister-function doesn't "move the physical page"
either, it moves the data from one physical page to another and updates
the page tables.

~Gregory
Gregory Price Sept. 19, 2023, 4:37 p.m. UTC | #13
On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote:
> On Thu, Sep 07 2023 at 03:54, Gregory Price wrote:
> > +	/* All tasks mapping each page is checked in phys_page_migratable */
> > +	nodes_setall(target_nodes);
> 
> How is the comment related to nodes_setall() and why is nodes_setall()
> unconditional when target_nodes is only used in the @nodes != NULL case?
> 

Short follow up, sorry for the spam.  I realized there is a better way
to do this by simply bypassing the task node check in do_pages_moves if
the mm_struct is NULL.  This removes the need for this struct
all-together.

Will simplify down for a v2 and add comments to do_pages_move
accordingly.

~Gregory
Andy Lutomirski Sept. 19, 2023, 5:59 p.m. UTC | #14
On Tue, Sep 19, 2023, at 9:31 AM, Gregory Price wrote:
> On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote:
>> 
>> 
>> On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
>> > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
>> >> 
>> >> So this is probably a silly question, but just to be sure ... what is
>> >> the permission model for this system call?  As far as I can tell, the
>> >> ability to move pages is entirely unrestricted, with the exception of
>> >> pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
>> >> but probably I'm just missing something ... ?
>> >> 
>> >> Thanks,
>> >> 
>> >> jon
>> >
>> > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
>> > check against), check i neglected to add a CAP_SYS_ADMIN check.
>> 
>> Global, I presume?
>> 
>> I have to admit that I don’t think this patch set makes sense at all.
>> 
>> As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.
>> 
>> Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)
>> 
>> Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.
>> 
>
> The CXL memory description here is a bit inaccurate.  Memory on the CXL
> bus is not limited to host and accelerator, CXL memory devices may also
> present memory for use by the system as-if it were just DRAM as well.
> The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]'
> and the result is a cacheline fetch that goes over the cxl bus rather
> than the DRAM memory controllers).
>
> Small CXL background for the sake of clarity:
>
> type-2 devices are "accelerators", and the memory relationships you
> describe here are roughly accurate.  The intent of this interface is not
> really for the purpose of managing type-2/accelerator device memory.
>
> type-3 devices are "memory devices", whose intent is to provide the
> system additional memory resources that get mapped into one or more numa
> nodes.  The intent of these devices is to present memory to the kernel
> *as-if* it were regular old DRAM just with different latency and
> bandwidth attributes. This is a simplification of the overall goal.
>
>
> So from the perspective of the kernel and a memory-tiering system, we
> have numa nodes which abstract physical memory, and that physical memory
> may actually live anywhere (DRAM, CXL, where-ever).  This memory is
> fungible with the exception that CXL memory should be placed in
> ZONE_MOVABLE to ensure the hot-plugability of those memory devices.
>
>
> The intent of this interface is to make page-migration decisions without
> the need to track individual processes or virtual address mappings.
>
> One example would be to utilize the idle page tracking mechanism from
> userland to make migration decisions.
>
> https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
>
> This mechanism allows a user to determine which PFNs are idle.  Combine
> this information with a move_phys_page syscall, you can now implement
> demotion/promotion in user-land without having to identify the virtual
> address mapping of those PFN's in user-land.
>
>
>> I fully believe that there’s some use for migrating host memory to a node that's closer to a device.  But I don't think this API is the right way.  First, something needs to figure out that the host memory should be migrated.  Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it.  Maybe new APIs are needed to enable this.
>> 
>
> The intent is not to migrate memory to making it "closer to a device",
> assuming you mean the intent is to make that data closer to a device
> that is using it (i.e. an accelerator).
>
> The intent is to allow migration of memory based on a user-defined
> policy via the usage of physical addresses.
>
> Lets consider a bandwidth-expansion focused tiering policy.  Each
> additional CXL Type-3 Memory device provides additional memory
> bandwidth to a processor via its pcie/cxl lanes.
>
> If all you care about is latency, moving/migrating pages closer to the
> processor is beneficial.  However, if you care about maximizing
> bandwidth, distributing memory across all possible devices with some
> statistical distribution is a better goal.
>
> So sometimes you actually want hot data further away because it allows
> for more concurrent cacheline fetches to occur.
>
>
> The question as to whether getting the logical memory address is
> required, useful, or performant depends on what sources of information
> you can pull physical address information from.
>
> Above I explained idle page tracking, but another example would be the
> CXL device directly, which knows 2 pieces of information (generally):
>
> 1) The extent of the memory it is hosting (some size)
> 2) The physical-to-device address mapping for the system contacting it.
>
> The device talks (internally) in 0-based addressing (0x0 up to 0x...),
> but the host places the host physical address (HPA) on the bus
> (0x123450000).  The device receives and converts 0x123450000 (HPA) into
> a 0-base address (device-physical-address, DPA).
>
> How does this relate to this interface?
>
> Consider a device which provides a "heat-map" for the memory it is
> hosting.  If a user or system requests this heat-map, the device can
> only provide that information in terms of either HPA or DPA.  If DPA,
> then the host can recover the HPA by simply looking at the mapping it
> programmed the device with.  This reverse-transaction (DPA-to-HPA) is
> relatively inexpensive.
>
> The idle-page tracking interface is actually a good example of this. It
> is functionally an heat-map for the entire system.
>
> However, things get extraordinary expensive after this.  HPA to host
> virtual address translation (HPA to HVA) requires inspecting every task
> that may map that HPA in its page tables.  When the cacheline fetch hits
> the bus, you are well below the construct of a "task", and the devices
> has no way of telling you what task is using memory on that device.
>
> This makes any kind of tiering operation based on this type of heat-map
> information somewhat of a non-starter.  You steal so much performance
> just converting that information into task-specific information, that
> you may as well not bother doing it.
>
> Instead, this interface would allow for a tiering policy to operate on
> such heat-map information directly, and since all CXL memory is intended
> to be placed in ZONE_MOVABLE, that memory should always be migratable.
>
>> But this API is IMO rather silly.  Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*.  So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.
>> 
>> Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense.  That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.
>> 
>> I just don't see it this code working well, never mind the security issues.
>
> I think this is more of a terminology issue.  I'm not married to the
> name, but to me move_phys_page is intuitively easier to understand
> because move_page exists and the only difference between the two
> interfaces is virtual vs physical addressing.
>
> move_pages doesn't "migrate a virtual page" either, it "migrates the
> data pointed to by this virtual address to another physical page located 
> on the target numa node".
>
> Likewise this interface "migrates the data located at the physical address,
> assuming the physical address is mapped, to another page on the target numa
> node".

I'm not complaining about the name.  I'm objecting about the semantics.

Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
Gregory Price Sept. 19, 2023, 6:20 p.m. UTC | #15
On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
> 
> I'm not complaining about the name.  I'm objecting about the semantics.
> 
> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.

How does this change if I use virtual address based migration?

I could do sampling based on virtual address (page faults, IBS/PEBs,
whatever), and by the time I make a decision, the kernel could have
migrated the data or even my task from Node A to Node B.  The sample I
took is now stale, and I could make a poor migration decision.

If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
page from NodeA to NodeB, then the device-side collection is likewise
no longer valid.  This problem doesn't change because I used virtual
address compared to physical address.

But if i have a 512GB memory device, and i can see a wide swath of that
512GB is hot, while a good chunk of my local DRAM is not - then I
probably don't care *what* gets migrated up to DRAM, i just care that a
vast majority of that hot data does.

The goal here isn't 100% precision, you will never get there. The goal
here is broad-scope performance enhancements of the overall system
while minimizing the cost to compute the migration actions to be taken.

I don't think the contents of the page are always relevant.  The entire
concept here is to enable migration without caring about what programs
are using the memory for - just so long as the memcg's and zoning is
respected.

~Gregory
Andy Lutomirski Sept. 19, 2023, 6:52 p.m. UTC | #16
On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote:
> On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
>> 
>> I'm not complaining about the name.  I'm objecting about the semantics.
>> 
>> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
>
> How does this change if I use virtual address based migration?
>
> I could do sampling based on virtual address (page faults, IBS/PEBs,
> whatever), and by the time I make a decision, the kernel could have
> migrated the data or even my task from Node A to Node B.  The sample I
> took is now stale, and I could make a poor migration decision.

The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example).  The kernel can do this without races from page faults handlers.  If you sample based on PA, you fundamentally race against anything that does migration.

>
> If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
> page from NodeA to NodeB, then the device-side collection is likewise
> no longer valid.  This problem doesn't change because I used virtual
> address compared to physical address.

Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use.

>
> But if i have a 512GB memory device, and i can see a wide swath of that
> 512GB is hot, while a good chunk of my local DRAM is not - then I
> probably don't care *what* gets migrated up to DRAM, i just care that a
> vast majority of that hot data does.
>
> The goal here isn't 100% precision, you will never get there. The goal
> here is broad-scope performance enhancements of the overall system
> while minimizing the cost to compute the migration actions to be taken.
>
> I don't think the contents of the page are always relevant.  The entire
> concept here is to enable migration without caring about what programs
> are using the memory for - just so long as the memcg's and zoning is
> respected.
>

At the very least I think you need to be aware of page *size*.  And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation.

I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.
Gregory Price Sept. 19, 2023, 7:59 p.m. UTC | #17
On Tue, Sep 19, 2023 at 11:52:27AM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote:
> > On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
> >> 
> >> I'm not complaining about the name.  I'm objecting about the semantics.
> >> 
> >> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
> >
> > How does this change if I use virtual address based migration?
> >
> > I could do sampling based on virtual address (page faults, IBS/PEBs,
> > whatever), and by the time I make a decision, the kernel could have
> > migrated the data or even my task from Node A to Node B.  The sample I
> > took is now stale, and I could make a poor migration decision.
> 
> The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example).  The kernel can do this without races from page faults handlers.  If you sample based on PA, you fundamentally race against anything that does migration.
> 

Race conditions are race conditions, narrow or otherwise.  The narrow-ness
of the condition is either irrelevant, or you can accept some level of race
because the goal allows for slop within a certain window.

In fact, the entire purpose of this interface is to very explicity
reduce the time it takes to go from sampling to migration decision.

Certainly a user could simply use a heatmap with cgroups.numa_stat
to determine what processes they should interrogate - but for systems
with many tennants/processes/tasks, sometimes just acting on the overall
view of memory would be better served if *major* changes have to be made
to the overall distribution.


Similarly, collection of data can likewise we made more performant.

Faults introduce runtime overhead.  Pushing heatmap collection to
devices alleviates the need for introducing that overhead.  Use of
IBS/PEBS is (by nature of sampling and a variety of quirks having to do
with the prefetcher) quite inaccurate and costly to compute after being
collected.


Ultimately if your goal is extremely high precision for the performance
of a single process, I agree with your analysis.  However, I never claimed
the goal is precision on the level a single process.  On the contrary,
the goal is system-wide tiering based on cheap-to-acquire information.

> >
> > If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
> > page from NodeA to NodeB, then the device-side collection is likewise
> > no longer valid.  This problem doesn't change because I used virtual
> > address compared to physical address.
> 
> Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use.
>

I'm not keen on "the kernel should..." implying "user land should not".

I don't disagree that the kernel could/should/would use this same
information.  My proposal here is that there is value in enabling
userland to do the same thing (so long as security is not compromised).

> >
> > But if i have a 512GB memory device, and i can see a wide swath of that
> > 512GB is hot, while a good chunk of my local DRAM is not - then I
> > probably don't care *what* gets migrated up to DRAM, i just care that a
> > vast majority of that hot data does.
> >
> > The goal here isn't 100% precision, you will never get there. The goal
> > here is broad-scope performance enhancements of the overall system
> > while minimizing the cost to compute the migration actions to be taken.
> >
> > I don't think the contents of the page are always relevant.  The entire
> > concept here is to enable migration without caring about what programs
> > are using the memory for - just so long as the memcg's and zoning is
> > respected.
> >
> 
> At the very least I think you need to be aware of page *size*.  And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation.
>

Re page size: this was brought up in a prior email, and I agree,
but that also doesn't really change for move_pages.  The man page for
move_pages already says this:

"""
pages is an array of pointers to the pages that should be moved.
These are pointers that should be aligned to page boundaries.
"""

But to an extent I agree with you.

A device collecting data as I describe won't necessarily know the
page size (or there may even be multiple sized pages hosted on the
device).  Operating on that data blind does have some risks.

Some work probably could be done to ensure the data being used doesn't,
for example, cause a hugepage to be migrated multiple times.  In that
regard, if the address doesn't match the base of the page the migration
should fail.

I think there are simple ways to avoid these footguns.

> 
> I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.

"entire process" and "blind" are a little bit of an straw man here. The
goal is to marry many pieces of data to make decisions about how and
what to move, with as many (flexible) tools available to enact changes
quickly to reduce the staleness of information issue.

Sometimes it would be "more blind" than others, yes.  In other cases,
such a device-provided heatmap would simply be informational for overall
system health.

To me, unless I'm misunderstanding, it sounds like you think system-wide
tiering decisions should not be a function userland is empowered to do
directly, but instead should be implemented entirely in the kernel?

~Gregory
diff mbox series

Patch

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..25db6d71af0c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,4 @@ 
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
+454	i386	move_phys_pages		sys_move_phys_pages
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..9676f2e7698c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,7 @@ 
 451	common	cachestat		sys_cachestat
 452	common	fchmodat2		sys_fchmodat2
 453	64	map_shadow_stack	sys_map_shadow_stack
+454	common	move_phys_pages		sys_move_phys_pages
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..6860675a942f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,11 @@  asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
 				const int __user *nodes,
 				int __user *status,
 				int flags);
+asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
+				    const void __user * __user *pages,
+				    const int __user *nodes,
+				    int __user *status,
+				    int flags);
 asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t  pid, int sig,
 		siginfo_t __user *uinfo);
 asmlinkage long sys_perf_event_open(
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..8838fcfaf261 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,14 @@  __SYSCALL(__NR_cachestat, sys_cachestat)
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 
+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_move_phys_pages 454
+__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 455
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e137c1385c56..07441b10f92a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -192,6 +192,7 @@  COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
 COND_SYSCALL(cachestat);
+COND_SYSCALL(move_phys_pages);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3506b8202937..8a6f1eb6e512 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2161,6 +2161,101 @@  static int move_pages_and_store_status(int node,
 	return store_status(status, start, node, i - start);
 }
 
+struct rmap_page_ctxt {
+	bool found;
+	bool migratable;
+	bool node_allowed;
+	int node;
+};
+
+/*
+ * Walks each vma mapping a given page and determines if those
+ * vma's are both migratable, and that the target node is within
+ * the allowed cpuset of the owning task.
+ */
+static bool phys_page_migratable(struct folio *folio,
+				 struct vm_area_struct *vma,
+				 unsigned long address,
+				 void *arg)
+{
+	struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg;
+	struct task_struct *owner = vma->vm_mm->owner;
+	nodemask_t task_nodes = cpuset_mems_allowed(owner);
+
+	ctxt->found |= true;
+	ctxt->migratable &= vma_migratable(vma);
+	ctxt->node_allowed &= node_isset(ctxt->node, task_nodes);
+
+	return ctxt->migratable && ctxt->node_allowed;
+}
+
+static struct folio *phys_migrate_get_folio(struct page *page)
+{
+	struct folio *folio;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio) || !folio_try_get(folio))
+		return NULL;
+	if (unlikely(page_folio(page) != folio || !folio_test_lru(folio))) {
+		folio_put(folio);
+		folio = NULL;
+	}
+	return folio;
+}
+
+/*
+ * Validates the physical address is online and migratable.  Walks the folio
+ * containing the page to validate the vma is migratable and the cpuset node
+ * restrictions.  Then calls add_page_for_migration to isolate it from the
+ * LRU and place it into the given pagelist.
+ * Returns:
+ *     errno - if the page is not online, migratable, or can't be isolated
+ *     0 - when it doesn't have to be migrated because it is already on the
+ *         target node
+ *     1 - when it has been queued
+ */
+static int add_phys_page_for_migration(const void __user *p, int node,
+				       struct list_head *pagelist,
+				       bool migrate_all)
+{
+	unsigned long pfn;
+	struct page *page;
+	struct folio *folio;
+	int err;
+	struct rmap_page_ctxt rmctxt = {
+		.found = false,
+		.migratable = true,
+		.node_allowed = true,
+		.node = node
+	};
+	struct rmap_walk_control rwc = {
+		.rmap_one = phys_page_migratable,
+		.arg = &rmctxt
+	};
+
+	pfn = ((unsigned long)p) >> PAGE_SHIFT;
+	page = pfn_to_online_page(pfn);
+	if (!page || PageTail(page))
+		return -ENOENT;
+
+	folio = phys_migrate_get_folio(page);
+	if (folio) {
+		rmap_walk(folio, &rwc);
+		folio_put(folio);
+	}
+
+	if (!rmctxt.found)
+		err = -ENOENT;
+	else if (!rmctxt.migratable)
+		err = -EFAULT;
+	else if (!rmctxt.node_allowed)
+		err = -EACCES;
+	else
+		err = add_page_for_migration(page, node, pagelist, migrate_all);
+
+	return err;
+}
+
 /*
  * Migrate an array of page address onto an array of nodes and fill
  * the corresponding array of status.
@@ -2214,8 +2309,13 @@  static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 		 * Errors in the page lookup or isolation are not fatal and we simply
 		 * report them via status
 		 */
-		err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
-					     flags & MPOL_MF_MOVE_ALL);
+		if (mm)
+			err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
+						     flags & MPOL_MF_MOVE_ALL);
+		else
+			err = add_phys_page_for_migration(p, current_node, &pagelist,
+					flags & MPOL_MF_MOVE_ALL);
+
 
 		if (err > 0) {
 			/* The page is successfully queued for migration */
@@ -2303,6 +2403,36 @@  static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 	mmap_read_unlock(mm);
 }
 
+/*
+ * Determine the nodes of an array of pages and store it in an array of status.
+ */
+static void do_phys_pages_stat_array(unsigned long nr_pages,
+				     const void __user **pages, int *status)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++) {
+		unsigned long pfn = (unsigned long)(*pages) >> PAGE_SHIFT;
+		struct page *page = pfn_to_online_page(pfn);
+		int err = -ENOENT;
+
+		if (!page)
+			goto set_status;
+
+		get_page(page);
+
+		if (!is_zone_device_page(page))
+			err = page_to_nid(page);
+
+		put_page(page);
+set_status:
+		*status = err;
+
+		pages++;
+		status++;
+	}
+}
+
 static int get_compat_pages_array(const void __user *chunk_pages[],
 				  const void __user * __user *pages,
 				  unsigned long chunk_nr)
@@ -2345,7 +2475,10 @@  static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
 				break;
 		}
 
-		do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
+		if (mm)
+			do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
+		else
+			do_phys_pages_stat_array(chunk_nr, chunk_pages, chunk_status);
 
 		if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
 			break;
@@ -2446,6 +2579,45 @@  SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
 	return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags);
 }
 
+/*
+ * Move a list of pages in the address space of the currently executing
+ * process.
+ */
+static int kernel_move_phys_pages(unsigned long nr_pages,
+				  const void __user * __user *pages,
+				  const int __user *nodes,
+				  int __user *status, int flags)
+{
+	int err;
+	nodemask_t target_nodes;
+
+	/* Check flags */
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+		return -EINVAL;
+
+	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
+		return -EPERM;
+
+	/* All tasks mapping each page is checked in phys_page_migratable */
+	nodes_setall(target_nodes);
+	if (nodes)
+		err = do_pages_move(NULL, target_nodes, nr_pages, pages,
+			nodes, status, flags);
+	else
+		err = do_pages_stat(NULL, nr_pages, pages, status);
+
+	return err;
+}
+
+SYSCALL_DEFINE5(move_phys_pages, unsigned long, nr_pages,
+		const void __user * __user *, pages,
+		const int __user *, nodes,
+		int __user *, status, int, flags)
+{
+	return kernel_move_phys_pages(nr_pages, pages, nodes, status, flags);
+}
+
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Returns true if this is a safe migration target node for misplaced NUMA
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index fd6c1cb585db..b140ad444946 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -820,8 +820,14 @@  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 #define __NR_cachestat 451
 __SYSCALL(__NR_cachestat, sys_cachestat)
 
+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_move_phys_pages 454
+__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 452
+#define __NR_syscalls 455
 
 /*
  * 32 bit systems traditionally used different