mbox series

[v2,0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping

Message ID 155836064844.2441.10911127801797083064.stgit@localhost.localdomain (mailing list archive)
Headers show
Series mm: process_vm_mmap() -- syscall for duplication a process mapping | expand

Message

Kirill Tkhai May 20, 2019, 2 p.m. UTC
v2: Add PVMMAP_FIXED_NOREPLACE flag.
    Use find_vma_without_flags() and may_mmap_overlapped_region() helpers,
    so even more code became reused.
    Syscall number is changed.
    Fix whitespaces.

    Prohibited a cloning from local to remote process. Only mapping
    to local process mm is allowed, since I missed initially, that
    get_unmapped_area() can't be used for remote process. This may
    be very simply solved by passing @mm argument to all .get_unmapped_area
    handlers. In this patchset I don't do this, since this gives a lot
    of cleanup patches, which hides main logic away. I'm going to
    send them later, as another series, after we finish with this.

[Summary]

New syscall, which allows to clone a remote process VMA
into local process VM. The remote process's page table
entries related to the VMA are cloned into local process's
page table (in any desired address, which makes this different
from that happens during fork()). Huge pages are handled
appropriately.

This allows to improve performance in significant way like
it's shows in the example below.

[Description] 

This patchset adds a new syscall, which makes possible
to clone a VMA from a process to current process.
The syscall supplements the functionality provided
by process_vm_writev() and process_vm_readv() syscalls,
and it may be useful in many situation.

For example, it allows to make a zero copy of data,
when process_vm_writev() was previously used:

	struct iovec local_iov, remote_iov;
	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

	local_iov->iov_base = buf;
	local_iov->iov_len = n * PAGE_SIZE;
	remove_iov = ...;

	process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
	munmap(buf, n * PAGE_SIZE);

	(Note, that above completely ignores error handling)

There are several problems with process_vm_writev() in this example:

1)it causes pagefault on remote process memory, and it forces
  allocation of a new page (if was not preallocated);

2)amount of memory for this example is doubled in a moment --
  n pages in current and n pages in remote tasks are occupied
  at the same time;

3)received data has no a chance to be properly swapped for
  a long time.

The third is the most critical in case of remote process touches
the data pages some time after process_vm_writev() was made.
Imagine, node is under memory pressure:

a)kernel moves @buf pages into swap right after recv();
b)process_vm_writev() reads the data back from swap to pages;
c)process_vm_writev() allocates duplicate pages in remote
  process and populates them;
d)munmap() unmaps @buf;
e)5 minutes later remote task touches data.

In stages "a" and "b" kernel submits unneeded IO and makes
system IO throughput worse. To make "b" and "c", kernel
reclaims memory, and moves pages of some other processes
to swap, so they have to read pages from swap back. Also,
unneeded copying of pages is occured, while zero-copy is
more preferred.

We observe similar problem during online migration of big enough
containers, when after doubling of container's size, the time
increases 100 times. The system resides under high IO and
throwing out of useful cashes.

The proposed syscall aims to introduce an interface, which
supplements currently existing process_vm_writev() and
process_vm_readv(), and allows to solve the problem with
anonymous memory transfer. The above example may be rewritten as:

[Task 1]
	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

[Task 2]
	buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);

This creates a copy of VMA related to buf from task1 in task2's VM.
Task1's page table entries are copied into corresponding page table
entries of VM of task2.

It is swap-friendly: in case of memory is swapped right after recv(),
the syscall just copies pagetable entries like we do on fork(),
so real access to pages does not occurs, and no IO is needed.
No excess pages are reclaimed, and number of pages is not doubled.
Also, zero-copy takes a place, and this also reduces overhead.

The patchset does not introduce much new code, since we simply
reuse existing copy_page_range() and copy_vma() functions.
We extend copy_vma() to be able merge VMAs in remote task [2/7],
and teach copy_page_range() to work with different local and
remote addresses [3/7]. Patch [7/7] introduces the syscall logic,
which mostly consists of sanity checks. The rest of patches
are preparations.

This syscall may be used for page servers like in example
above, for migration (I assume, even virtual machines may
want something like this), for zero-copy desiring users
of process_vm_writev() and process_vm_readv(), for debug
purposes, etc. It requires the same permittions like
existing proc_vm_xxx() syscalls have.

The tests I used may be obtained here (UPDATED):

[1]https://gist.github.com/tkhai/ce46502fc53580372da35e8c3b7818b9
[2]https://gist.github.com/tkhai/40bda78e304d2fe0d90863214b9ac5b5

Previous version (RFC):
[3]https://lore.kernel.org/lkml/CAG48ez0itiEE1x=SXeMbjKvMGkrj7wxjM6c+ZB00LpXAAhqmiw@mail.gmail.com/T/

---

Kirill Tkhai (7):
      mm: Add process_vm_mmap() syscall declaration
      mm: Extend copy_vma()
      mm: Extend copy_page_range()
      mm: Export round_hint_to_min()
      mm: Introduce may_mmap_overlapped_region() helper
      mm: Introduce find_vma_filter_flags() helper
      mm: Add process_vm_mmap()


 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    2 
 include/linux/huge_mm.h                |    6 +
 include/linux/mm.h                     |   14 ++
 include/linux/mm_types.h               |    2 
 include/linux/mman.h                   |   14 ++
 include/linux/syscalls.h               |    5 +
 include/uapi/asm-generic/mman-common.h |    6 +
 include/uapi/asm-generic/unistd.h      |    5 +
 init/Kconfig                           |    9 +-
 kernel/fork.c                          |    5 +
 kernel/sys_ni.c                        |    2 
 mm/huge_memory.c                       |   30 ++++-
 mm/memory.c                            |  165 +++++++++++++++++++---------
 mm/mmap.c                              |  186 ++++++++++++++++++++++++++------
 mm/mremap.c                            |   43 +++++--
 mm/process_vm_access.c                 |   69 ++++++++++++
 17 files changed, 439 insertions(+), 125 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Comments

Andy Lutomirski May 21, 2019, 2:43 p.m. UTC | #1
On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>

> [Summary]
>
> New syscall, which allows to clone a remote process VMA
> into local process VM. The remote process's page table
> entries related to the VMA are cloned into local process's
> page table (in any desired address, which makes this different
> from that happens during fork()). Huge pages are handled
> appropriately.
>
> This allows to improve performance in significant way like
> it's shows in the example below.
>
> [Description]
>
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
>
>         struct iovec local_iov, remote_iov;
>         void *buf;
>
>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
>         local_iov->iov_base = buf;
>         local_iov->iov_len = n * PAGE_SIZE;
>         remove_iov = ...;
>
>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>         munmap(buf, n * PAGE_SIZE);
>
>         (Note, that above completely ignores error handling)
>
> There are several problems with process_vm_writev() in this example:
>
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);

I don't see how your new syscall helps.  You're writing to remote
memory.  If that memory wasn't allocated, it's going to get allocated
regardless of whether you use a write-like interface or an mmap-like
interface.  Keep in mind that, on x86, just the hardware part of a
page fault is very slow -- populating the memory with a syscall
instead of a fault may well be faster.

>
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;

This seems disingenuous.  If you're writing p pages total in chunks of
n pages, you will use a total of p pages if you use mmap and p+n if
you use write.  That only doubles the amount of memory if you let n
scale linearly with p, which seems unlikely.

>
> 3)received data has no a chance to be properly swapped for
>   a long time.

...

> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;

If you're under that much memory pressure and thrashing that badly,
your performance is going to be awful no matter what you're doing.  If
you indeed observe this behavior under normal loads, then this seems
like a VM issue that should be addressed in its own right.

>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
> [Task 2]
>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>
> This creates a copy of VMA related to buf from task1 in task2's VM.
> Task1's page table entries are copied into corresponding page table
> entries of VM of task2.

You need to fully explain a whole bunch of details that you're
ignored.  For example, if the remote VMA is MAP_ANONYMOUS, do you get
a CoW copy of it?  I assume you don't since the whole point is to
write to remote memory, but it's at the very least quite unusual in
Linux to have two different anonymous VMAs such that writing one of
them changes the other one.  But there are plenty of other questions.
What happens if the remote VMA is a gate area or other special mapping
(vDSO, vvar area, etc)?  What if the remote memory comes from a driver
that wasn't expecting the mapping to get magically copied to a
different process?

This new API seems quite dangerous and complex to me, and I don't
think the value has been adequately demonstrated.
Kirill Tkhai May 21, 2019, 3:52 p.m. UTC | #2
On 21.05.2019 17:43, Andy Lutomirski wrote:
> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
> 
>> [Summary]
>>
>> New syscall, which allows to clone a remote process VMA
>> into local process VM. The remote process's page table
>> entries related to the VMA are cloned into local process's
>> page table (in any desired address, which makes this different
>> from that happens during fork()). Huge pages are handled
>> appropriately.
>>
>> This allows to improve performance in significant way like
>> it's shows in the example below.
>>
>> [Description]
>>
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
>>
>>         struct iovec local_iov, remote_iov;
>>         void *buf;
>>
>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>>         local_iov->iov_base = buf;
>>         local_iov->iov_len = n * PAGE_SIZE;
>>         remove_iov = ...;
>>
>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>         munmap(buf, n * PAGE_SIZE);
>>
>>         (Note, that above completely ignores error handling)
>>
>> There are several problems with process_vm_writev() in this example:
>>
>> 1)it causes pagefault on remote process memory, and it forces
>>   allocation of a new page (if was not preallocated);
> 
> I don't see how your new syscall helps.  You're writing to remote
> memory.  If that memory wasn't allocated, it's going to get allocated
> regardless of whether you use a write-like interface or an mmap-like
> interface.

No, the talk is not about just another interface for copying memory.
The talk is about borrowing of remote task's VMA and corresponding
page table's content. Syscall allows to copy part of page table
with preallocated pages from remote to local process. See here:

[task1]                                                        [task2]

buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS, ...);

<task1 populates buf>

                                                               buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
munmap(buf);


process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
just like in the way we do during fork syscall.

There is no copying of buf memory content, unless COW happens. This is
the principal difference to process_vm_writev(), which just allocates
pages in remote VM.

> Keep in mind that, on x86, just the hardware part of a
> page fault is very slow -- populating the memory with a syscall
> instead of a fault may well be faster.

It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
pages related to buf of task1 are swapped:

1)process_vm_writev() reads them back into memory;

2)process_vm_mmap() just copies swap PTEs from task1 page table
  to task2 page table.

Also, for faster page faults one may use huge pages for the mappings.
But really, it's funny to think about page faults, when there are
disk IO problems I shown.

>>
>> 2)amount of memory for this example is doubled in a moment --
>>   n pages in current and n pages in remote tasks are occupied
>>   at the same time;
> 
> This seems disingenuous.  If you're writing p pages total in chunks of
> n pages, you will use a total of p pages if you use mmap and p+n if
> you use write.

I didn't understand this sentence because of many ifs, sorry. Could you
please explain your thought once again?

> That only doubles the amount of memory if you let n
> scale linearly with p, which seems unlikely.
>
>>
>> 3)received data has no a chance to be properly swapped for
>>   a long time.
> 
> ...
> 
>> a)kernel moves @buf pages into swap right after recv();
>> b)process_vm_writev() reads the data back from swap to pages;
> 
> If you're under that much memory pressure and thrashing that badly,
> your performance is going to be awful no matter what you're doing.  If
> you indeed observe this behavior under normal loads, then this seems
> like a VM issue that should be addressed in its own right.

I don't think so. Imagine: a container migrates from one node to another.
The nodes are the same, say, every of them has 4GB of RAM.

Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
After the page server on the second node received the pages, we want these
pages become swapped as soon as possible, and we don't want to read them from
swap to pass a read consumer.

The page server is task1 in the example. The real consumer is task2.

This is a rather normal load, I think.

>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>> [Task 2]
>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>
>> This creates a copy of VMA related to buf from task1 in task2's VM.
>> Task1's page table entries are copied into corresponding page table
>> entries of VM of task2.
> 
> You need to fully explain a whole bunch of details that you're
> ignored.

Yeah, it's not a problem :) I'm ready to explain and describe everything,
what may cause a question. Just ask ;) 

> For example, if the remote VMA is MAP_ANONYMOUS, do you get
> a CoW copy of it? I assume you don't since the whole point is to
> write to remote memory

But, no, there *is* COW semantic. We do not copy memory. We copy
page table content. This is just the same we have on fork(), when
children duplicates parent's VMA and related page table subset,
and parent's PTEs lose _PAGE_RW flag.

There is all copy_page_range() code reused for that. Please, see [3/7]
for the details.

I'm going to get special performance using THP, when number of entries
to copy is smaller than in case of PTE.

Copy several of PMD from one task page table to another's is much much much faster,
than process_vm_write() copies pages (even not mention about its reading from swap).

>,but it's at the very least quite unusual in
> Linux to have two different anonymous VMAs such that writing one of
> them changes the other one.
Writing to a new VMA does not affect old VMA. Old VMA is just used to
get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
each other.

> But there are plenty of other questions.
> What happens if the remote VMA is a gate area or other special mapping
> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> that wasn't expecting the mapping to get magically copied to a
> different process?

In case of someone wants to duplicate such the mappings, we may consider
that, and extend the interface in the future for VMA types, which are
safe for that.

But now the logic is very overprotective, and all the unusual mappings
like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
for the details.

> This new API seems quite dangerous and complex to me, and I don't
> think the value has been adequately demonstrated.

I don't think it's dangerous and complex, because of I haven't introduced
any principal VMA conceptions different to what we have now. We just
borrow vma->anon_vma and vma->vm_file from remote process to local
like we did on fork() (borrowing of vma->anon_vma means not blindly
copying, but ordinary anon_vma_fork()).

Maybe I had to focus the description more on copying of PTE/PMD
instead of vma duplication. So, it's unexpected for me, that people
think about simple memory copying after reading the example I gave.
But I gave more explanation here, so I hope the situation became
clearer for a reader. Anyway, if you have any questions, please
ask me.

Thanks,
Kirill
Kirill Tkhai May 21, 2019, 3:59 p.m. UTC | #3
On 21.05.2019 18:52, Kirill Tkhai wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>
>>
>>> [Summary]
>>>
>>> New syscall, which allows to clone a remote process VMA
>>> into local process VM. The remote process's page table
>>> entries related to the VMA are cloned into local process's
>>> page table (in any desired address, which makes this different
>>> from that happens during fork()). Huge pages are handled
>>> appropriately.
>>>
>>> This allows to improve performance in significant way like
>>> it's shows in the example below.
>>>
>>> [Description]
>>>
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>>
>>> For example, it allows to make a zero copy of data,
>>> when process_vm_writev() was previously used:
>>>
>>>         struct iovec local_iov, remote_iov;
>>>         void *buf;
>>>
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>>         local_iov->iov_base = buf;
>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>         remove_iov = ...;
>>>
>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>         munmap(buf, n * PAGE_SIZE);
>>>
>>>         (Note, that above completely ignores error handling)
>>>
>>> There are several problems with process_vm_writev() in this example:
>>>
>>> 1)it causes pagefault on remote process memory, and it forces
>>>   allocation of a new page (if was not preallocated);
>>
>> I don't see how your new syscall helps.  You're writing to remote
>> memory.  If that memory wasn't allocated, it's going to get allocated
>> regardless of whether you use a write-like interface or an mmap-like
>> interface.
> 
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
> 
> [task1]                                                        [task2]
> 
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 
> <task1 populates buf>
> 
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
> 
> 
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
> 
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
> 
>> Keep in mind that, on x86, just the hardware part of a
>> page fault is very slow -- populating the memory with a syscall
>> instead of a fault may well be faster.
> 
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
> 
> 1)process_vm_writev() reads them back into memory;
> 
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
> 
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
> 
>>>
>>> 2)amount of memory for this example is doubled in a moment --
>>>   n pages in current and n pages in remote tasks are occupied
>>>   at the same time;
>>
>> This seems disingenuous.  If you're writing p pages total in chunks of
>> n pages, you will use a total of p pages if you use mmap and p+n if
>> you use write.
> 
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?
> 
>> That only doubles the amount of memory if you let n
>> scale linearly with p, which seems unlikely.
>>
>>>
>>> 3)received data has no a chance to be properly swapped for
>>>   a long time.
>>
>> ...
>>
>>> a)kernel moves @buf pages into swap right after recv();
>>> b)process_vm_writev() reads the data back from swap to pages;
>>
>> If you're under that much memory pressure and thrashing that badly,
>> your performance is going to be awful no matter what you're doing.  If
>> you indeed observe this behavior under normal loads, then this seems
>> like a VM issue that should be addressed in its own right.
> 
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
> 
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

Should be "to pass a *real* consumer".

> 
> The page server is task1 in the example. The real consumer is task2.
> 
> This is a rather normal load, I think.
> 
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> [Task 2]
>>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>>
>>> This creates a copy of VMA related to buf from task1 in task2's VM.
>>> Task1's page table entries are copied into corresponding page table
>>> entries of VM of task2.
>>
>> You need to fully explain a whole bunch of details that you're
>> ignored.
> 
> Yeah, it's not a problem :) I'm ready to explain and describe everything,
> what may cause a question. Just ask ;) 
> 
>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>> a CoW copy of it? I assume you don't since the whole point is to
>> write to remote memory
> 
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.
> 
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.
> 
> I'm going to get special performance using THP, when number of entries
> to copy is smaller than in case of PTE.
> 
> Copy several of PMD from one task page table to another's is much much much faster,
> than process_vm_write() copies pages (even not mention about its reading from swap).
> 
>> ,but it's at the very least quite unusual in
>> Linux to have two different anonymous VMAs such that writing one of
>> them changes the other one.
> Writing to a new VMA does not affect old VMA. Old VMA is just used to
> get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
> each other.
> 
>> But there are plenty of other questions.
>> What happens if the remote VMA is a gate area or other special mapping
>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>> that wasn't expecting the mapping to get magically copied to a
>> different process?
> 
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.
> 
> But now the logic is very overprotective, and all the unusual mappings
> like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
> for the details.
> 
>> This new API seems quite dangerous and complex to me, and I don't
>> think the value has been adequately demonstrated.
> 
> I don't think it's dangerous and complex, because of I haven't introduced
> any principal VMA conceptions different to what we have now. We just
> borrow vma->anon_vma and vma->vm_file from remote process to local
> like we did on fork() (borrowing of vma->anon_vma means not blindly
> copying, but ordinary anon_vma_fork()).
> 
> Maybe I had to focus the description more on copying of PTE/PMD
> instead of vma duplication. So, it's unexpected for me, that people
> think about simple memory copying after reading the example I gave.
> But I gave more explanation here, so I hope the situation became
> clearer for a reader. Anyway, if you have any questions, please
> ask me.
> 
> Thanks,
> Kirill
>
Jann Horn May 21, 2019, 4:20 p.m. UTC | #4
On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
[...]
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
[...]
> > That only doubles the amount of memory if you let n
> > scale linearly with p, which seems unlikely.
> >
> >>
> >> 3)received data has no a chance to be properly swapped for
> >>   a long time.
> >
> > ...
> >
> >> a)kernel moves @buf pages into swap right after recv();
> >> b)process_vm_writev() reads the data back from swap to pages;
> >
> > If you're under that much memory pressure and thrashing that badly,
> > your performance is going to be awful no matter what you're doing.  If
> > you indeed observe this behavior under normal loads, then this seems
> > like a VM issue that should be addressed in its own right.
>
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
>
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

But you don't have to copy that memory into the container's tasks all
at once, right? Can't you, every time you've received a few dozen
kilobytes of data or whatever, shove them into the target task? That
way you don't have problems with swap because the time before the data
has arrived in its final VMA is tiny.
Andy Lutomirski May 21, 2019, 4:43 p.m. UTC | #5
On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >
> >> [Summary]
> >>
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
> >>
> >> This allows to improve performance in significant way like
> >> it's shows in the example below.
> >>
> >> [Description]
> >>
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> >>
> >>         struct iovec local_iov, remote_iov;
> >>         void *buf;
> >>
> >>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>         recv(sock, buf, n * PAGE_SIZE, 0);
> >>
> >>         local_iov->iov_base = buf;
> >>         local_iov->iov_len = n * PAGE_SIZE;
> >>         remove_iov = ...;
> >>
> >>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> >>         munmap(buf, n * PAGE_SIZE);
> >>
> >>         (Note, that above completely ignores error handling)
> >>
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.

If I understand this correctly, your intended use is to have one task
allocate memory and fill it, have the other task clone the VMA, and
have the first task free the VMA?  If so, that wasn't at all obvious
from your original email.

Why don't you use splice() instead?  splice() the data to the remote
task and have the remove task read() it?  All these VMA games will
result in a lot of flushes, which is bad for performance.  Or,
depending on your exact constraints, you could map a memfd in both
tasks instead, which has the same flushing issues but at least has a
sensible API.

>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.

What are you doing that is causing *disk* IO in any of this?  I
suspect your real problem is that you are using far too large of a
buffer. See below.

>
> >>
> >> 2)amount of memory for this example is doubled in a moment --
> >>   n pages in current and n pages in remote tasks are occupied
> >>   at the same time;
> >
> > This seems disingenuous.  If you're writing p pages total in chunks of
> > n pages, you will use a total of p pages if you use mmap and p+n if
> > you use write.
>
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?

You seem to have a function that tries to populate p pages of memory
with data received from a socket.  It looks like you're doing
something like this:

void copy_p_pages(size_t p)
{
  size_t n = some_value(p);
  char *buf = malloc(n * PAGE_SIZE);
  for (int i = 0; i < p; i += n*PAGE_SIZE) {
    read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
    process_vm_writev(write n*PAGE_SIZE bytes to remote process);
  }
  free(buf);
}

If you have a *constant* n (i.e. some_value(p) is just a number like
16)), then you aren't doubling memory usage.  If you have
some_value(p) return p, then you are indeed doubling memory usage.  So
don't do that!
If buf is getting swapped out, you are very likely doing something
wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
surprised you have problems.  Try something reasonable like 128kB. For
extra fun, you could mlock() that buf, but if you're thrashing on
access to a 128kB working set, you will probably also get your *code*
swapped out, in which case you pretty much lose.


> > For example, if the remote VMA is MAP_ANONYMOUS, do you get
> > a CoW copy of it? I assume you don't since the whole point is to
> > write to remote memory
>
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.

Then you need to document this very carefully, because other people
will use your syscall in different ways than you use it.

And, if you are doing CoW like this, then your syscall is basically
only useful for your really weird use case in which you're using it to
import an already-populated VMA.  Maybe this is a reasonable feature
to add to the kernel, but it needs a benchmark against a reasonable
alternative.

>
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.

You can't as users of a syscall to read the nitty gritty mm code to
figure out what the syscall does from a user's perspective.

> > But there are plenty of other questions.
> > What happens if the remote VMA is a gate area or other special mapping
> > (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> > that wasn't expecting the mapping to get magically copied to a
> > different process?
>
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.

Do you mean that the code you sent rejects this case?  If so, please
document it.  In any case, I looked at the code, and it seems to be
trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
would reject copying a vDSO.
Kirill Tkhai May 21, 2019, 5:03 p.m. UTC | #6
On 21.05.2019 19:20, Jann Horn wrote:
> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
> [...]
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> [...]
>>> That only doubles the amount of memory if you let n
>>> scale linearly with p, which seems unlikely.
>>>
>>>>
>>>> 3)received data has no a chance to be properly swapped for
>>>>   a long time.
>>>
>>> ...
>>>
>>>> a)kernel moves @buf pages into swap right after recv();
>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>
>>> If you're under that much memory pressure and thrashing that badly,
>>> your performance is going to be awful no matter what you're doing.  If
>>> you indeed observe this behavior under normal loads, then this seems
>>> like a VM issue that should be addressed in its own right.
>>
>> I don't think so. Imagine: a container migrates from one node to another.
>> The nodes are the same, say, every of them has 4GB of RAM.
>>
>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>> After the page server on the second node received the pages, we want these
>> pages become swapped as soon as possible, and we don't want to read them from
>> swap to pass a read consumer.
> 
> But you don't have to copy that memory into the container's tasks all
> at once, right? Can't you, every time you've received a few dozen
> kilobytes of data or whatever, shove them into the target task? That
> way you don't have problems with swap because the time before the data
> has arrived in its final VMA is tiny.

We try to maintain online migration with as small downtime as possible,
and the container on source node is completely stopped at the very end.
Memory of container tasks is copied in background without container
completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.

Container may create any new processes during the migration, and these
processes may contain any memory mappings.

Imagine the situation. We migrate a big web server with a lot of processes,
and some of children processes have the same COW mapping as parent has.
In case of all memory dump is available at the moment of the grand parent
web server process creation, we populate the mapping in parent, and all
the children may inherit the mapping in case of they want after fork.
COW works here. But in case of some processes are created before all memory
is available on destination node, we can't do such the COW inheritance.
This will be the reason, the memory consumed by container grows many
times after the migration. So, the only solution is to create process
tree after memory is available and all mappings are known.

It's on of the examples. But believe me, there are a lot of another reasons,
why process tree should be created only after all process tree is freezed,
and no new tasks on source are possible. PGID and SSID inheritance, for
example. All of this requires special order of tasks creation. In case of
you try to restore process tree with correct namespaces and especial in
case of many user namespaces in a container, you will just see like a hell
will open before your eyes, and we never can think about this.

So, no, we can't create any task before the whole process tree is knows.
Believe me, the reason is heavy and serious.

Kirill
Jann Horn May 21, 2019, 5:28 p.m. UTC | #7
On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 19:20, Jann Horn wrote:
> > On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> On 21.05.2019 17:43, Andy Lutomirski wrote:
> >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>> New syscall, which allows to clone a remote process VMA
> >>>> into local process VM. The remote process's page table
> >>>> entries related to the VMA are cloned into local process's
> >>>> page table (in any desired address, which makes this different
> >>>> from that happens during fork()). Huge pages are handled
> >>>> appropriately.
> > [...]
> >>>> There are several problems with process_vm_writev() in this example:
> >>>>
> >>>> 1)it causes pagefault on remote process memory, and it forces
> >>>>   allocation of a new page (if was not preallocated);
> >>>
> >>> I don't see how your new syscall helps.  You're writing to remote
> >>> memory.  If that memory wasn't allocated, it's going to get allocated
> >>> regardless of whether you use a write-like interface or an mmap-like
> >>> interface.
> >>
> >> No, the talk is not about just another interface for copying memory.
> >> The talk is about borrowing of remote task's VMA and corresponding
> >> page table's content. Syscall allows to copy part of page table
> >> with preallocated pages from remote to local process. See here:
> >>
> >> [task1]                                                        [task2]
> >>
> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>
> >> <task1 populates buf>
> >>
> >>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> >> munmap(buf);
> >>
> >>
> >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> >> just like in the way we do during fork syscall.
> >>
> >> There is no copying of buf memory content, unless COW happens. This is
> >> the principal difference to process_vm_writev(), which just allocates
> >> pages in remote VM.
> >>
> >>> Keep in mind that, on x86, just the hardware part of a
> >>> page fault is very slow -- populating the memory with a syscall
> >>> instead of a fault may well be faster.
> >>
> >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> >> pages related to buf of task1 are swapped:
> >>
> >> 1)process_vm_writev() reads them back into memory;
> >>
> >> 2)process_vm_mmap() just copies swap PTEs from task1 page table
> >>   to task2 page table.
> >>
> >> Also, for faster page faults one may use huge pages for the mappings.
> >> But really, it's funny to think about page faults, when there are
> >> disk IO problems I shown.
> > [...]
> >>> That only doubles the amount of memory if you let n
> >>> scale linearly with p, which seems unlikely.
> >>>
> >>>>
> >>>> 3)received data has no a chance to be properly swapped for
> >>>>   a long time.
> >>>
> >>> ...
> >>>
> >>>> a)kernel moves @buf pages into swap right after recv();
> >>>> b)process_vm_writev() reads the data back from swap to pages;
> >>>
> >>> If you're under that much memory pressure and thrashing that badly,
> >>> your performance is going to be awful no matter what you're doing.  If
> >>> you indeed observe this behavior under normal loads, then this seems
> >>> like a VM issue that should be addressed in its own right.
> >>
> >> I don't think so. Imagine: a container migrates from one node to another.
> >> The nodes are the same, say, every of them has 4GB of RAM.
> >>
> >> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> >> After the page server on the second node received the pages, we want these
> >> pages become swapped as soon as possible, and we don't want to read them from
> >> swap to pass a read consumer.
> >
> > But you don't have to copy that memory into the container's tasks all
> > at once, right? Can't you, every time you've received a few dozen
> > kilobytes of data or whatever, shove them into the target task? That
> > way you don't have problems with swap because the time before the data
> > has arrived in its final VMA is tiny.
>
> We try to maintain online migration with as small downtime as possible,
> and the container on source node is completely stopped at the very end.
> Memory of container tasks is copied in background without container
> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>
> Container may create any new processes during the migration, and these
> processes may contain any memory mappings.
>
> Imagine the situation. We migrate a big web server with a lot of processes,
> and some of children processes have the same COW mapping as parent has.
> In case of all memory dump is available at the moment of the grand parent
> web server process creation, we populate the mapping in parent, and all
> the children may inherit the mapping in case of they want after fork.
> COW works here. But in case of some processes are created before all memory
> is available on destination node, we can't do such the COW inheritance.
> This will be the reason, the memory consumed by container grows many
> times after the migration. So, the only solution is to create process
> tree after memory is available and all mappings are known.

But if one of the processes modifies the memory after you've started
migrating it to the new machine, that memory can't be CoW anymore
anyway, right? So it should work if you first do a first pass of
copying the memory and creating the process hierarchy, and then copy
more recent changes into the individual processes, breaking the CoW
for those pages, right?

> It's on of the examples. But believe me, there are a lot of another reasons,
> why process tree should be created only after all process tree is freezed,
> and no new tasks on source are possible. PGID and SSID inheritance, for
> example. All of this requires special order of tasks creation. In case of
> you try to restore process tree with correct namespaces and especial in
> case of many user namespaces in a container, you will just see like a hell
> will open before your eyes, and we never can think about this.

Could you elaborate on why that is so hellish?


> So, no, we can't create any task before the whole process tree is knows.
> Believe me, the reason is heavy and serious.
>
> Kirill
>
Kirill Tkhai May 21, 2019, 5:44 p.m. UTC | #8
On 21.05.2019 19:43, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>
>>>> [Summary]
>>>>
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
>>>>
>>>> This allows to improve performance in significant way like
>>>> it's shows in the example below.
>>>>
>>>> [Description]
>>>>
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>>
>>>> For example, it allows to make a zero copy of data,
>>>> when process_vm_writev() was previously used:
>>>>
>>>>         struct iovec local_iov, remote_iov;
>>>>         void *buf;
>>>>
>>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>>
>>>>         local_iov->iov_base = buf;
>>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>>         remove_iov = ...;
>>>>
>>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>>         munmap(buf, n * PAGE_SIZE);
>>>>
>>>>         (Note, that above completely ignores error handling)
>>>>
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
> 
> If I understand this correctly, your intended use is to have one task
> allocate memory and fill it, have the other task clone the VMA, and
> have the first task free the VMA?  If so, that wasn't at all obvious
> from your original email.

Yes, exactly this. Sorry for confusing in initial description, it's not intentionally.

> Why don't you use splice() instead?

I just don't see a possibility of anonymous memory may be moved from
one process to another via splice(). Maybe you may explain your idea
more detailed?

> splice() the data to the remote
> task and have the remove task read() it?  All these VMA games will
> result in a lot of flushes, which is bad for performance.  Or,
> depending on your exact constraints, you could map a memfd in both
> tasks instead, which has the same flushing issues but at least has a
> sensible API.

memfd() is file-backed mapping, and it is not suitable for that.
In case of a process had anonymous mapping before the migration,
it wants the mapping remains the same after the migration. So,
if we use memfd(), we have to copy the memory from memfd mapping
to its real anonymous mapping target, which has the same problems
as process_vm_writev().

>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> 
> What are you doing that is causing *disk* IO in any of this?  I
> suspect your real problem is that you are using far too large of a
> buffer. See below.

Imagine, we are migrating a container, which consists of 9 GB of pages,
and we have 8GB RAM on destination node. Before the migration, we had
some of pages in RAM and some of pages in swap.

Source node sends pages to destination node. And there are limitations,
which do not allow to start creation of process tree on the destination
node, before all memory is received.

Pages are received by some page server task on destination. After all pages
are received, we create process tree and populate container tasks mappings.

When we're populating tasks mapping, we have to copy memory from page server
to a target task. In case of the pages were swapped from page server's
address space, we have to read synchronously them from swap. This introduces
big latency, and big IO I talked.

> 
>>
>>>>
>>>> 2)amount of memory for this example is doubled in a moment --
>>>>   n pages in current and n pages in remote tasks are occupied
>>>>   at the same time;
>>>
>>> This seems disingenuous.  If you're writing p pages total in chunks of
>>> n pages, you will use a total of p pages if you use mmap and p+n if
>>> you use write.
>>
>> I didn't understand this sentence because of many ifs, sorry. Could you
>> please explain your thought once again?
> 
> You seem to have a function that tries to populate p pages of memory
> with data received from a socket.  It looks like you're doing
> something like this:
> 
> void copy_p_pages(size_t p)
> {
>   size_t n = some_value(p);
>   char *buf = malloc(n * PAGE_SIZE);
>   for (int i = 0; i < p; i += n*PAGE_SIZE) {
>     read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
>     process_vm_writev(write n*PAGE_SIZE bytes to remote process);
>   }
>   free(buf);
> }
> 
> If you have a *constant* n (i.e. some_value(p) is just a number like
> 16)), then you aren't doubling memory usage.  If you have
> some_value(p) return p, then you are indeed doubling memory usage.  So
> don't do that!
> If buf is getting swapped out, you are very likely doing something
> wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
> surprised you have problems.  Try something reasonable like 128kB. For
> extra fun, you could mlock() that buf, but if you're thrashing on
> access to a 128kB working set, you will probably also get your *code*
> swapped out, in which case you pretty much lose.

The thing is we can't use small buffer. We have to receive all the restored
tasks pages on the destination node, before we start the process tree
creation like I wrote above. All the anonymous memory is mapped into
page server's MM, so it becomes swapped before container's process
tree starts to create.
 
>>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>>> a CoW copy of it? I assume you don't since the whole point is to
>>> write to remote memory
>>
>> But, no, there *is* COW semantic. We do not copy memory. We copy
>> page table content. This is just the same we have on fork(), when
>> children duplicates parent's VMA and related page table subset,
>> and parent's PTEs lose _PAGE_RW flag.
> 
> Then you need to document this very carefully, because other people
> will use your syscall in different ways than you use it.

Ok, I'll do.

> And, if you are doing CoW like this, then your syscall is basically
> only useful for your really weird use case in which you're using it to
> import an already-populated VMA.  Maybe this is a reasonable feature
> to add to the kernel, but it needs a benchmark against a reasonable
> alternative.

Do you mean comparison with process_vm_writev/readv() or something like
this?

>>
>> There is all copy_page_range() code reused for that. Please, see [3/7]
>> for the details.
> 
> You can't as users of a syscall to read the nitty gritty mm code to
> figure out what the syscall does from a user's perspective.

Yeah, sure :)
 
>>> But there are plenty of other questions.
>>> What happens if the remote VMA is a gate area or other special mapping
>>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>>> that wasn't expecting the mapping to get magically copied to a
>>> different process?
>>
>> In case of someone wants to duplicate such the mappings, we may consider
>> that, and extend the interface in the future for VMA types, which are
>> safe for that.
> 
> Do you mean that the code you sent rejects this case?  If so, please
> document it.  In any case, I looked at the code, and it seems to be
> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
> would reject copying a vDSO.

I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
I'll check carefully, whether it's enough for vDSO.

Thanks,
Kirill
Kirill Tkhai May 22, 2019, 10:03 a.m. UTC | #9
On 21.05.2019 20:28, Jann Horn wrote:
> On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 19:20, Jann Horn wrote:
>>> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>> New syscall, which allows to clone a remote process VMA
>>>>>> into local process VM. The remote process's page table
>>>>>> entries related to the VMA are cloned into local process's
>>>>>> page table (in any desired address, which makes this different
>>>>>> from that happens during fork()). Huge pages are handled
>>>>>> appropriately.
>>> [...]
>>>>>> There are several problems with process_vm_writev() in this example:
>>>>>>
>>>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>>>   allocation of a new page (if was not preallocated);
>>>>>
>>>>> I don't see how your new syscall helps.  You're writing to remote
>>>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>>>> regardless of whether you use a write-like interface or an mmap-like
>>>>> interface.
>>>>
>>>> No, the talk is not about just another interface for copying memory.
>>>> The talk is about borrowing of remote task's VMA and corresponding
>>>> page table's content. Syscall allows to copy part of page table
>>>> with preallocated pages from remote to local process. See here:
>>>>
>>>> [task1]                                                        [task2]
>>>>
>>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>
>>>> <task1 populates buf>
>>>>
>>>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>>>> munmap(buf);
>>>>
>>>>
>>>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>>>> just like in the way we do during fork syscall.
>>>>
>>>> There is no copying of buf memory content, unless COW happens. This is
>>>> the principal difference to process_vm_writev(), which just allocates
>>>> pages in remote VM.
>>>>
>>>>> Keep in mind that, on x86, just the hardware part of a
>>>>> page fault is very slow -- populating the memory with a syscall
>>>>> instead of a fault may well be faster.
>>>>
>>>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>>>> pages related to buf of task1 are swapped:
>>>>
>>>> 1)process_vm_writev() reads them back into memory;
>>>>
>>>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>>>   to task2 page table.
>>>>
>>>> Also, for faster page faults one may use huge pages for the mappings.
>>>> But really, it's funny to think about page faults, when there are
>>>> disk IO problems I shown.
>>> [...]
>>>>> That only doubles the amount of memory if you let n
>>>>> scale linearly with p, which seems unlikely.
>>>>>
>>>>>>
>>>>>> 3)received data has no a chance to be properly swapped for
>>>>>>   a long time.
>>>>>
>>>>> ...
>>>>>
>>>>>> a)kernel moves @buf pages into swap right after recv();
>>>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>>>
>>>>> If you're under that much memory pressure and thrashing that badly,
>>>>> your performance is going to be awful no matter what you're doing.  If
>>>>> you indeed observe this behavior under normal loads, then this seems
>>>>> like a VM issue that should be addressed in its own right.
>>>>
>>>> I don't think so. Imagine: a container migrates from one node to another.
>>>> The nodes are the same, say, every of them has 4GB of RAM.
>>>>
>>>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>>>> After the page server on the second node received the pages, we want these
>>>> pages become swapped as soon as possible, and we don't want to read them from
>>>> swap to pass a read consumer.
>>>
>>> But you don't have to copy that memory into the container's tasks all
>>> at once, right? Can't you, every time you've received a few dozen
>>> kilobytes of data or whatever, shove them into the target task? That
>>> way you don't have problems with swap because the time before the data
>>> has arrived in its final VMA is tiny.
>>
>> We try to maintain online migration with as small downtime as possible,
>> and the container on source node is completely stopped at the very end.
>> Memory of container tasks is copied in background without container
>> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>>
>> Container may create any new processes during the migration, and these
>> processes may contain any memory mappings.
>>
>> Imagine the situation. We migrate a big web server with a lot of processes,
>> and some of children processes have the same COW mapping as parent has.
>> In case of all memory dump is available at the moment of the grand parent
>> web server process creation, we populate the mapping in parent, and all
>> the children may inherit the mapping in case of they want after fork.
>> COW works here. But in case of some processes are created before all memory
>> is available on destination node, we can't do such the COW inheritance.
>> This will be the reason, the memory consumed by container grows many
>> times after the migration. So, the only solution is to create process
>> tree after memory is available and all mappings are known.
> 
> But if one of the processes modifies the memory after you've started
> migrating it to the new machine, that memory can't be CoW anymore
> anyway, right? So it should work if you first do a first pass of
> copying the memory and creating the process hierarchy, and then copy
> more recent changes into the individual processes, breaking the CoW
> for those pages, right?

Not so. We have to have all processes killed on source node, before
creation the same process tree on destination machine. The process
tree should be completely analyzed before a try of its recreation
from ground. The analysis allows to choose the strategy and the sequence
of each process creation and inheritance of entities like namespaces,
mm, fdtables, etc. It's impossible to restore a process tree in case
of you already have started to create it, but haven't stopped processes
on source node. Also, we can restore only subset of process trees,
but you never know what will happen with a live process tree in further.
So, source process tree must be freezed before restore.

A restore of arbitrary process tree in laws of all linux limitations
on sequence of action to recreate an entity of a process makes this
nontrivial mathematical problem, which has no a solution at the moment,
and it's unknown whether it has a solution.

So, at the moment of restore starts, we know all about all tasks and
their relation ships. Yes, we start to copy memory from source to destination,
when container on source is alive. But we track this memory with _PAGE_SOFT_DIRTY
flag, and dirtied pages are repopulated with new content. Please, see the comment
about hellish below.
 
>> It's on of the examples. But believe me, there are a lot of another reasons,
>> why process tree should be created only after all process tree is freezed,
>> and no new tasks on source are possible. PGID and SSID inheritance, for
>> example. All of this requires special order of tasks creation. In case of
>> you try to restore process tree with correct namespaces and especial in
>> case of many user namespaces in a container, you will just see like a hell
>> will open before your eyes, and we never can think about this.
> 
> Could you elaborate on why that is so hellish?

Because you never know the way, the system came into the state you're seeing
at the moment. Even in a simple process chain like:

          task1
            |
          task2
            |
          task3
          /   \
      task4  task5

any of these processes may change its namespace, pgid, ssid, unshare mm,
files, become a child subreaper, die and make its children reparented,
do all of these before or after parent did one of its actions. All of
these actions are not independent between each other, and some actions
prohibit another actions. And you can restore the process tree only
in case you repeat all of the sequence in the same order it was made
by the container.

It's impossible to say "task2, set your session to 5!", because the only
way to set ssid is call setsid() syscall, which may be made only once in
a process life. But setsid itself implies limitations on further setpgid()
syscall (see the code, if interesting). And this limitation does not mean
we always should call setpgid() before it, no, these are no such the simple
rules. The same and moreover with inheritance of namespaces, when some of
task's children may inherit old task's namespaces, some may inherit current,
some will inherit further. A child will be able to assign a namespace, say,
net ns, only in case of its userns allows to do this. This implies another
limitation on process creation order, while this does not mean this gives
a stable rule of choosing an order you create process tree. No, the only
rule is "in the moment of time, one of tasks should made one of the above
actions".

This is a mathematical problem of ordering finite number of actors with
finite number of rules and relationship between them, and the rules do not
imply unambiguous replay from the ground by the end state.

We behave well in limited and likely cases of process tree configurations,
and this is enough for most cases. But common problem is very difficult,
and currently it's not proven it even has a solution as a finite set of
rules you may apply to restore any process tree.

Kirill
Kirill A. Shutemov May 22, 2019, 3:22 p.m. UTC | #10
On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.

Kirill, could you explain how the change affects rmap and how it is safe.

My concern is that the patchset allows to map the same page multiple times
within one process or even map page allocated by child to the parrent.

It was not allowed before.

In the best case it makes reasoning about rmap substantially more difficult.

But I'm worry it will introduce hard-to-debug bugs, like described in
https://lwn.net/Articles/383162/.

Note, that is some cases we care about rmap walk order (see for instance
mremap() case). I'm not convinced that the feature will not break
something in the area.
Kirill Tkhai May 23, 2019, 4:11 p.m. UTC | #11
On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
>
> In the best case it makes reasoning about rmap substantially more difficult.

I don't think here is big impact from process relationships, because of
as it existed before, the main rule of VMA chaining is that VMA is younger
or older each other. For example, reusing of anon_vma in anon_vma_clone()
may be done either children or siblings. Also, it is possible reparenting
after some of processes dies; or splitting two branches of processes having
the same grand parent into two chains after the grand parent dies, so it looks
there should be many combinations already available.

Mapping of the same page multiple times is a different thing, and it was never
allowed for rmap.

Could you please say more specifically what looks suspicious for you and I'll
try to answer then? Otherwise, it's possible to write explanations as big as
a dissertation and to miss all answers to that is interested for you :)

> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

I read the article, but there are a lot of messages in thread, I'm not sure,
that found the actual fix there. But it looks like one of the fixes may be
be usage of anon_vma->root in __page_set_anon_rmap().

> Note, that is some cases we care about rmap walk order (see for instance
> mremap() case). I'm not convinced that the feature will not break
> something in the area.

Yeah, thanks for pointing, I'll check this.

Kirill
Andy Lutomirski May 23, 2019, 4:19 p.m. UTC | #12
On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.05.2019 19:43, Andy Lutomirski wrote:
> > On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 21.05.2019 17:43, Andy Lutomirski wrote:

> > Do you mean that the code you sent rejects this case?  If so, please
> > document it.  In any case, I looked at the code, and it seems to be
> > trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
> > would reject copying a vDSO.
>
> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
> I'll check carefully, whether it's enough for vDSO.

I think you could make the new syscall a lot more comprehensible bg
restricting it to just MAP_ANONYMOUS, by making it unmap the source,
or possibly both.  If the new syscall unmaps the source (in order so
that the source is gone before the newly mapped pages become
accessible), then you avoid issues in which you need to define
sensible semantics for what happens if both copies are accessed
simultaneously.


--Andy
Kirill Tkhai May 24, 2019, 10:36 a.m. UTC | #13
On 23.05.2019 19:19, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 19:43, Andy Lutomirski wrote:
>>> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
> 
>>> Do you mean that the code you sent rejects this case?  If so, please
>>> document it.  In any case, I looked at the code, and it seems to be
>>> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
>>> would reject copying a vDSO.
>>
>> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
>> I'll check carefully, whether it's enough for vDSO.
> 
> I think you could make the new syscall a lot more comprehensible bg
> restricting it to just MAP_ANONYMOUS, by making it unmap the source,
> or possibly both.  If the new syscall unmaps the source (in order so
> that the source is gone before the newly mapped pages become
> accessible), then you avoid issues in which you need to define
> sensible semantics for what happens if both copies are accessed
> simultaneously.

In case of we unmap source, this does not introduce a new principal
behavior with the same page mapped twice in a single process like
Kirill pointed. This sounds as a good idea and this covers my
application area.

The only new principal thing is a child process will be able to inherit
a parent's VMA, which is not possible now. But it looks like we never
depend on processes relationship in the mapping code, and process
reparenting already gives many combinations, so the new change should
not affect much on this.

Kirill
Kirill Tkhai May 24, 2019, 10:45 a.m. UTC | #14
On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
> 
> In the best case it makes reasoning about rmap substantially more difficult.
> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

Andy suggested to unmap PTEs from source page table, and this make the single
page never be mapped in the same process twice. This is OK for my use case,
and here we will just do a small step "allow to inherit VMA by a child process",
which we didn't have before this. If someone still needs to continue the work
to allow the same page be mapped twice in a single process in the future, this
person will have a supported basis we do in this small step. I believe, someone
like debugger may want to have this to make a fast snapshot of a process private
memory (when the task is stopped for a small time to get its memory). But for
me remapping is enough at the moment.

What do you think about this?

[...]

Kirill
Kirill A. Shutemov May 24, 2019, 11:52 a.m. UTC | #15
On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> > 
> > Kirill, could you explain how the change affects rmap and how it is safe.
> > 
> > My concern is that the patchset allows to map the same page multiple times
> > within one process or even map page allocated by child to the parrent.
> > 
> > It was not allowed before.
> > 
> > In the best case it makes reasoning about rmap substantially more difficult.
> > 
> > But I'm worry it will introduce hard-to-debug bugs, like described in
> > https://lwn.net/Articles/383162/.
> 
> Andy suggested to unmap PTEs from source page table, and this make the single
> page never be mapped in the same process twice. This is OK for my use case,
> and here we will just do a small step "allow to inherit VMA by a child process",
> which we didn't have before this. If someone still needs to continue the work
> to allow the same page be mapped twice in a single process in the future, this
> person will have a supported basis we do in this small step. I believe, someone
> like debugger may want to have this to make a fast snapshot of a process private
> memory (when the task is stopped for a small time to get its memory). But for
> me remapping is enough at the moment.
> 
> What do you think about this?

I don't think that unmapping alone will do. Consider the following
scenario:

1. Task A creates and populates the mapping.
2. Task A forks. We have now Task B mapping the same pages, but
write-protected.
3. Task B calls process_vm_mmap() and passes the mapping to the parent.

After this Task A will have the same anon pages mapped twice.

One possible way out would be to force CoW on all pages in the mapping,
before passing the mapping to the new process.

Thanks,
Kirill.
Kirill Tkhai May 24, 2019, 2 p.m. UTC | #16
On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>
>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>
>>> My concern is that the patchset allows to map the same page multiple times
>>> within one process or even map page allocated by child to the parrent.
>>>
>>> It was not allowed before.
>>>
>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>
>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>> https://lwn.net/Articles/383162/.
>>
>> Andy suggested to unmap PTEs from source page table, and this make the single
>> page never be mapped in the same process twice. This is OK for my use case,
>> and here we will just do a small step "allow to inherit VMA by a child process",
>> which we didn't have before this. If someone still needs to continue the work
>> to allow the same page be mapped twice in a single process in the future, this
>> person will have a supported basis we do in this small step. I believe, someone
>> like debugger may want to have this to make a fast snapshot of a process private
>> memory (when the task is stopped for a small time to get its memory). But for
>> me remapping is enough at the moment.
>>
>> What do you think about this?
> 
> I don't think that unmapping alone will do. Consider the following
> scenario:
> 
> 1. Task A creates and populates the mapping.
> 2. Task A forks. We have now Task B mapping the same pages, but
> write-protected.
> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> 
> After this Task A will have the same anon pages mapped twice.

Ah, sure.

> One possible way out would be to force CoW on all pages in the mapping,
> before passing the mapping to the new process.

This will pop all swapped pages up, which is the thing the patchset aims
to prevent.

Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
only chain and which vma->anon_vma_chain contains single entry? This is
a vma, which were faulted, but its mm never were duplicated (or which
forks already died).

Thanks,
Kirill
Kirill A. Shutemov May 27, 2019, 11:30 p.m. UTC | #17
On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> > On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> >> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>>> This patchset adds a new syscall, which makes possible
> >>>> to clone a VMA from a process to current process.
> >>>> The syscall supplements the functionality provided
> >>>> by process_vm_writev() and process_vm_readv() syscalls,
> >>>> and it may be useful in many situation.
> >>>
> >>> Kirill, could you explain how the change affects rmap and how it is safe.
> >>>
> >>> My concern is that the patchset allows to map the same page multiple times
> >>> within one process or even map page allocated by child to the parrent.
> >>>
> >>> It was not allowed before.
> >>>
> >>> In the best case it makes reasoning about rmap substantially more difficult.
> >>>
> >>> But I'm worry it will introduce hard-to-debug bugs, like described in
> >>> https://lwn.net/Articles/383162/.
> >>
> >> Andy suggested to unmap PTEs from source page table, and this make the single
> >> page never be mapped in the same process twice. This is OK for my use case,
> >> and here we will just do a small step "allow to inherit VMA by a child process",
> >> which we didn't have before this. If someone still needs to continue the work
> >> to allow the same page be mapped twice in a single process in the future, this
> >> person will have a supported basis we do in this small step. I believe, someone
> >> like debugger may want to have this to make a fast snapshot of a process private
> >> memory (when the task is stopped for a small time to get its memory). But for
> >> me remapping is enough at the moment.
> >>
> >> What do you think about this?
> > 
> > I don't think that unmapping alone will do. Consider the following
> > scenario:
> > 
> > 1. Task A creates and populates the mapping.
> > 2. Task A forks. We have now Task B mapping the same pages, but
> > write-protected.
> > 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> > 
> > After this Task A will have the same anon pages mapped twice.
> 
> Ah, sure.
> 
> > One possible way out would be to force CoW on all pages in the mapping,
> > before passing the mapping to the new process.
> 
> This will pop all swapped pages up, which is the thing the patchset aims
> to prevent.
> 
> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
> only chain and which vma->anon_vma_chain contains single entry? This is
> a vma, which were faulted, but its mm never were duplicated (or which
> forks already died).

The requirement for the VMA to be faulted (have any pages mapped) looks
excessive to me, but the general idea may work.

One issue I see is that userspace may not have full control to create such
VMA. vma_merge() can merge the VMA to the next one without any consent
from userspace and you'll get anon_vma inherited from the VMA you've
justed merged with.

I don't have any valid idea on how to get around this.
Kirill Tkhai May 28, 2019, 9:15 a.m. UTC | #18
On 28.05.2019 02:30, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>> This patchset adds a new syscall, which makes possible
>>>>>> to clone a VMA from a process to current process.
>>>>>> The syscall supplements the functionality provided
>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>> and it may be useful in many situation.
>>>>>
>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>
>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>> within one process or even map page allocated by child to the parrent.
>>>>>
>>>>> It was not allowed before.
>>>>>
>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>
>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>> https://lwn.net/Articles/383162/.
>>>>
>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>> which we didn't have before this. If someone still needs to continue the work
>>>> to allow the same page be mapped twice in a single process in the future, this
>>>> person will have a supported basis we do in this small step. I believe, someone
>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>> me remapping is enough at the moment.
>>>>
>>>> What do you think about this?
>>>
>>> I don't think that unmapping alone will do. Consider the following
>>> scenario:
>>>
>>> 1. Task A creates and populates the mapping.
>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>> write-protected.
>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>
>>> After this Task A will have the same anon pages mapped twice.
>>
>> Ah, sure.
>>
>>> One possible way out would be to force CoW on all pages in the mapping,
>>> before passing the mapping to the new process.
>>
>> This will pop all swapped pages up, which is the thing the patchset aims
>> to prevent.
>>
>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>> only chain and which vma->anon_vma_chain contains single entry? This is
>> a vma, which were faulted, but its mm never were duplicated (or which
>> forks already died).
> 
> The requirement for the VMA to be faulted (have any pages mapped) looks
> excessive to me, but the general idea may work.
> 
> One issue I see is that userspace may not have full control to create such
> VMA. vma_merge() can merge the VMA to the next one without any consent
> from userspace and you'll get anon_vma inherited from the VMA you've
> justed merged with.
> 
> I don't have any valid idea on how to get around this.

Technically it is possible by creating boundary 1-page VMAs with another protection:
one above and one below the desired region, then map the desired mapping. But this
is not comfortable.

I don't think it's difficult to find a natural limitation, which prevents mapping
a single page twice if we want to avoid this at least on start. Another suggestion:

prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
is the same as root of one of local process's VMA.

What about this?

Thanks,
Kirill
Kirill A. Shutemov May 28, 2019, 4:15 p.m. UTC | #19
On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote:
> On 28.05.2019 02:30, Kirill A. Shutemov wrote:
> > On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
> >> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> >>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> >>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>>>>> This patchset adds a new syscall, which makes possible
> >>>>>> to clone a VMA from a process to current process.
> >>>>>> The syscall supplements the functionality provided
> >>>>>> by process_vm_writev() and process_vm_readv() syscalls,
> >>>>>> and it may be useful in many situation.
> >>>>>
> >>>>> Kirill, could you explain how the change affects rmap and how it is safe.
> >>>>>
> >>>>> My concern is that the patchset allows to map the same page multiple times
> >>>>> within one process or even map page allocated by child to the parrent.
> >>>>>
> >>>>> It was not allowed before.
> >>>>>
> >>>>> In the best case it makes reasoning about rmap substantially more difficult.
> >>>>>
> >>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
> >>>>> https://lwn.net/Articles/383162/.
> >>>>
> >>>> Andy suggested to unmap PTEs from source page table, and this make the single
> >>>> page never be mapped in the same process twice. This is OK for my use case,
> >>>> and here we will just do a small step "allow to inherit VMA by a child process",
> >>>> which we didn't have before this. If someone still needs to continue the work
> >>>> to allow the same page be mapped twice in a single process in the future, this
> >>>> person will have a supported basis we do in this small step. I believe, someone
> >>>> like debugger may want to have this to make a fast snapshot of a process private
> >>>> memory (when the task is stopped for a small time to get its memory). But for
> >>>> me remapping is enough at the moment.
> >>>>
> >>>> What do you think about this?
> >>>
> >>> I don't think that unmapping alone will do. Consider the following
> >>> scenario:
> >>>
> >>> 1. Task A creates and populates the mapping.
> >>> 2. Task A forks. We have now Task B mapping the same pages, but
> >>> write-protected.
> >>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> >>>
> >>> After this Task A will have the same anon pages mapped twice.
> >>
> >> Ah, sure.
> >>
> >>> One possible way out would be to force CoW on all pages in the mapping,
> >>> before passing the mapping to the new process.
> >>
> >> This will pop all swapped pages up, which is the thing the patchset aims
> >> to prevent.
> >>
> >> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
> >> only chain and which vma->anon_vma_chain contains single entry? This is
> >> a vma, which were faulted, but its mm never were duplicated (or which
> >> forks already died).
> > 
> > The requirement for the VMA to be faulted (have any pages mapped) looks
> > excessive to me, but the general idea may work.
> > 
> > One issue I see is that userspace may not have full control to create such
> > VMA. vma_merge() can merge the VMA to the next one without any consent
> > from userspace and you'll get anon_vma inherited from the VMA you've
> > justed merged with.
> > 
> > I don't have any valid idea on how to get around this.
> 
> Technically it is possible by creating boundary 1-page VMAs with another protection:
> one above and one below the desired region, then map the desired mapping. But this
> is not comfortable.
> 
> I don't think it's difficult to find a natural limitation, which prevents mapping
> a single page twice if we want to avoid this at least on start. Another suggestion:
> 
> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
> is the same as root of one of local process's VMA.
> 
> What about this?

I don't see anything immediately wrong with this, but it's still going to
produce puzzling errors for a user. How would you document such limitation
in the way it makes sense for userspace developer?
Kirill Tkhai May 29, 2019, 2:33 p.m. UTC | #20
On 28.05.2019 19:15, Kirill A. Shutemov wrote:
> On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote:
>> On 28.05.2019 02:30, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>>>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>>>> This patchset adds a new syscall, which makes possible
>>>>>>>> to clone a VMA from a process to current process.
>>>>>>>> The syscall supplements the functionality provided
>>>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>>>> and it may be useful in many situation.
>>>>>>>
>>>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>>>
>>>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>>>> within one process or even map page allocated by child to the parrent.
>>>>>>>
>>>>>>> It was not allowed before.
>>>>>>>
>>>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>>>
>>>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>>>> https://lwn.net/Articles/383162/.
>>>>>>
>>>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>>>> which we didn't have before this. If someone still needs to continue the work
>>>>>> to allow the same page be mapped twice in a single process in the future, this
>>>>>> person will have a supported basis we do in this small step. I believe, someone
>>>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>>>> me remapping is enough at the moment.
>>>>>>
>>>>>> What do you think about this?
>>>>>
>>>>> I don't think that unmapping alone will do. Consider the following
>>>>> scenario:
>>>>>
>>>>> 1. Task A creates and populates the mapping.
>>>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>>>> write-protected.
>>>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>>>
>>>>> After this Task A will have the same anon pages mapped twice.
>>>>
>>>> Ah, sure.
>>>>
>>>>> One possible way out would be to force CoW on all pages in the mapping,
>>>>> before passing the mapping to the new process.
>>>>
>>>> This will pop all swapped pages up, which is the thing the patchset aims
>>>> to prevent.
>>>>
>>>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>>>> only chain and which vma->anon_vma_chain contains single entry? This is
>>>> a vma, which were faulted, but its mm never were duplicated (or which
>>>> forks already died).
>>>
>>> The requirement for the VMA to be faulted (have any pages mapped) looks
>>> excessive to me, but the general idea may work.
>>>
>>> One issue I see is that userspace may not have full control to create such
>>> VMA. vma_merge() can merge the VMA to the next one without any consent
>>> from userspace and you'll get anon_vma inherited from the VMA you've
>>> justed merged with.
>>>
>>> I don't have any valid idea on how to get around this.
>>
>> Technically it is possible by creating boundary 1-page VMAs with another protection:
>> one above and one below the desired region, then map the desired mapping. But this
>> is not comfortable.
>>
>> I don't think it's difficult to find a natural limitation, which prevents mapping
>> a single page twice if we want to avoid this at least on start. Another suggestion:
>>
>> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
>> is the same as root of one of local process's VMA.
>>
>> What about this?
> 
> I don't see anything immediately wrong with this, but it's still going to
> produce puzzling errors for a user. How would you document such limitation
> in the way it makes sense for userspace developer?

It's difficult, since the limitation is artificial.

I just may to suggest more strict limitation.

Something like "VMA may be remapped only as a whole region,
and only in the case of there were not fork() after VMA
appeared in a process (by mmap or remapping from another
remote process). In case of VMA were merged with a neighbouring
VMA, the same rules are applied to the neighbours.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..0bcd6f598e73 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -287,13 +287,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_MAY_REMOTE_REMAP	VM_HIGH_ARCH_5
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
diff --git a/kernel/fork.c b/kernel/fork.c
index ff4efd16fd82..a3c758c8cd54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -584,8 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
 			retval = copy_page_range(mm, oldmm, mpnt);
+			mpnt->vm_flags &= ~VM_MAY_REMOTE_REMAP;
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Kirill Tkhai June 3, 2019, 2:38 p.m. UTC | #21
On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.

Speaking honestly, we already support this model, since ZERO_PAGE() may
be mapped multiply times in any number of mappings.

Kirill
Kirill Tkhai June 3, 2019, 2:56 p.m. UTC | #22
On 03.06.2019 17:38, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>
>> Kirill, could you explain how the change affects rmap and how it is safe.
>>
>> My concern is that the patchset allows to map the same page multiple times
>> within one process or even map page allocated by child to the parrent.
> 
> Speaking honestly, we already support this model, since ZERO_PAGE() may
> be mapped multiply times in any number of mappings.

Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
the case, when the same huge page is mapped as huge page and as set of ordinary
pages in the same process.

Summing up two above cases, is there really a fundamental problem with
the functionality the patch set introduces? It looks like we already have
these cases in stable kernel supported.

Thanks,
Kirill
Kirill A. Shutemov June 3, 2019, 5:47 p.m. UTC | #23
On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
> On 03.06.2019 17:38, Kirill Tkhai wrote:
> > On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>> This patchset adds a new syscall, which makes possible
> >>> to clone a VMA from a process to current process.
> >>> The syscall supplements the functionality provided
> >>> by process_vm_writev() and process_vm_readv() syscalls,
> >>> and it may be useful in many situation.
> >>
> >> Kirill, could you explain how the change affects rmap and how it is safe.
> >>
> >> My concern is that the patchset allows to map the same page multiple times
> >> within one process or even map page allocated by child to the parrent.
> > 
> > Speaking honestly, we already support this model, since ZERO_PAGE() may
> > be mapped multiply times in any number of mappings.
> 
> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
> the case, when the same huge page is mapped as huge page and as set of ordinary
> pages in the same process.
> 
> Summing up two above cases, is there really a fundamental problem with
> the functionality the patch set introduces? It looks like we already have
> these cases in stable kernel supported.

It *might* work. But it requires a lot of audit to prove that it actually
*does* work.

For instance, are you sure it will not break KSM? What does it mean for
memory accounting? memcg?

My point is that you breaking long standing invariant in Linux MM and it
has to be properly justified.

I would expect to see some strange deadlocks or permanent trylock failure
as result of such change.
Kirill Tkhai June 4, 2019, 9:32 a.m. UTC | #24
On 03.06.2019 20:47, Kirill A. Shutemov wrote:> On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
>> On 03.06.2019 17:38, Kirill Tkhai wrote:
>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>> This patchset adds a new syscall, which makes possible
>>>>> to clone a VMA from a process to current process.
>>>>> The syscall supplements the functionality provided
>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>> and it may be useful in many situation.
>>>>
>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>
>>>> My concern is that the patchset allows to map the same page multiple times
>>>> within one process or even map page allocated by child to the parrent.
>>>
>>> Speaking honestly, we already support this model, since ZERO_PAGE() may
>>> be mapped multiply times in any number of mappings.
>>
>> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
>> the case, when the same huge page is mapped as huge page and as set of ordinary
>> pages in the same process.
>>
>> Summing up two above cases, is there really a fundamental problem with
>> the functionality the patch set introduces? It looks like we already have
>> these cases in stable kernel supported.
> 
> It *might* work. But it requires a lot of audit to prove that it actually
> *does* work.

Please, give the represent of the way the audit results should look like
for you. In case of I hadn't done some audit before patchset preparing,
I wouldn't have sent it. So, give an idea that you expect from this.

> For instance, are you sure it will not break KSM?

Yes, it does not break KSM. The main point is that in case of KSM we already
may have not just only a page mapped twice in a single process, but even
a page mapped twice in a single VMA. And this is just a particular case of
generic supported set. (Ordinary page still can't be mapped twice in a single
VMA, since pgoff differences won't allow to merge such two hunks together).

The generic rule of ksm is "everything may happen with a page in a real time,
and all of this will be reflected in stable and unstable trees and rmap_items
some time later". Pages of a duplicated VMA will be interpreted as KSM fork,
and the corresponding checks in unstable_tree_search_insert() and
stable_tree_search() provide this.

When both of source and destination VMAs are mergeable,
1)if page was added to stable tree before the duplication of related VMA,
  then during scanning destination VMA in cmp_and_merge_page() it will be
  detected as a duplicate, and we will just add related rmap_item
  to stable node chain;
2)if page was added to unstable tree before the duplication of related VMA,
  and it is remaining there, then the page will be detected as a duplicate
  in destination VMA, and the scan of page will be skipped till next turn;
3)if page was not added to any tree before the duplication, it may be added
  to one of the trees and it will be handled by one of two rules above.

When one of source or destination VMAs is not mergeable, while a page become
PageKsm() during scanning other of them, the unmergeable VMA becomes to refer
to PageKsm(), which does not have rmap_item. But it still possible to unmap
that page from unmergeable VMA, since rmap_walk_ksm() goes over all anon_vma
under rb_root. Just the same as what happens, when process forks, and its
child makes VMA unmergeable.

> What does it mean for memory accounting? memcg?

Once assigned memcg remains the same after VMA duplication. Mapped page range
advances counters in vm_stat_account(). Since we keep fork() semantics,
the same thing occurs as after fork()+mremap().

> My point is that you breaking long standing invariant in Linux MM and it
> has to be properly justified.

I'm not against that. Please, say, which form of the justification you expect.
I assume you do not mean retelling of every string of existing code, because
this way the words will take 10 times more, than the code, and just not human
possible.

Please, give the specific request what you expect, and how this should look like.

> I would expect to see some strange deadlocks or permanent trylock failure
> as result of such change.

Do you hint some specific area? Do you expect I run some specific test cases?
Do you want we add some debugging engine on top of page locking to detect such
the trylock failures?

Thanks,
Kirill