Message ID | 155836064844.2441.10911127801797083064.stgit@localhost.localdomain (mailing list archive) |
---|---|
Headers | show |
Series | mm: process_vm_mmap() -- syscall for duplication a process mapping | expand |
On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > [Summary] > > New syscall, which allows to clone a remote process VMA > into local process VM. The remote process's page table > entries related to the VMA are cloned into local process's > page table (in any desired address, which makes this different > from that happens during fork()). Huge pages are handled > appropriately. > > This allows to improve performance in significant way like > it's shows in the example below. > > [Description] > > This patchset adds a new syscall, which makes possible > to clone a VMA from a process to current process. > The syscall supplements the functionality provided > by process_vm_writev() and process_vm_readv() syscalls, > and it may be useful in many situation. > > For example, it allows to make a zero copy of data, > when process_vm_writev() was previously used: > > struct iovec local_iov, remote_iov; > void *buf; > > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > recv(sock, buf, n * PAGE_SIZE, 0); > > local_iov->iov_base = buf; > local_iov->iov_len = n * PAGE_SIZE; > remove_iov = ...; > > process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); > munmap(buf, n * PAGE_SIZE); > > (Note, that above completely ignores error handling) > > There are several problems with process_vm_writev() in this example: > > 1)it causes pagefault on remote process memory, and it forces > allocation of a new page (if was not preallocated); I don't see how your new syscall helps. You're writing to remote memory. If that memory wasn't allocated, it's going to get allocated regardless of whether you use a write-like interface or an mmap-like interface. Keep in mind that, on x86, just the hardware part of a page fault is very slow -- populating the memory with a syscall instead of a fault may well be faster. > > 2)amount of memory for this example is doubled in a moment -- > n pages in current and n pages in remote tasks are occupied > at the same time; This seems disingenuous. If you're writing p pages total in chunks of n pages, you will use a total of p pages if you use mmap and p+n if you use write. That only doubles the amount of memory if you let n scale linearly with p, which seems unlikely. > > 3)received data has no a chance to be properly swapped for > a long time. ... > a)kernel moves @buf pages into swap right after recv(); > b)process_vm_writev() reads the data back from swap to pages; If you're under that much memory pressure and thrashing that badly, your performance is going to be awful no matter what you're doing. If you indeed observe this behavior under normal loads, then this seems like a VM issue that should be addressed in its own right. > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > recv(sock, buf, n * PAGE_SIZE, 0); > > [Task 2] > buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0); > > This creates a copy of VMA related to buf from task1 in task2's VM. > Task1's page table entries are copied into corresponding page table > entries of VM of task2. You need to fully explain a whole bunch of details that you're ignored. For example, if the remote VMA is MAP_ANONYMOUS, do you get a CoW copy of it? I assume you don't since the whole point is to write to remote memory, but it's at the very least quite unusual in Linux to have two different anonymous VMAs such that writing one of them changes the other one. But there are plenty of other questions. What happens if the remote VMA is a gate area or other special mapping (vDSO, vvar area, etc)? What if the remote memory comes from a driver that wasn't expecting the mapping to get magically copied to a different process? This new API seems quite dangerous and complex to me, and I don't think the value has been adequately demonstrated.
On 21.05.2019 17:43, Andy Lutomirski wrote: > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> > >> [Summary] >> >> New syscall, which allows to clone a remote process VMA >> into local process VM. The remote process's page table >> entries related to the VMA are cloned into local process's >> page table (in any desired address, which makes this different >> from that happens during fork()). Huge pages are handled >> appropriately. >> >> This allows to improve performance in significant way like >> it's shows in the example below. >> >> [Description] >> >> This patchset adds a new syscall, which makes possible >> to clone a VMA from a process to current process. >> The syscall supplements the functionality provided >> by process_vm_writev() and process_vm_readv() syscalls, >> and it may be useful in many situation. >> >> For example, it allows to make a zero copy of data, >> when process_vm_writev() was previously used: >> >> struct iovec local_iov, remote_iov; >> void *buf; >> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, ...); >> recv(sock, buf, n * PAGE_SIZE, 0); >> >> local_iov->iov_base = buf; >> local_iov->iov_len = n * PAGE_SIZE; >> remove_iov = ...; >> >> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); >> munmap(buf, n * PAGE_SIZE); >> >> (Note, that above completely ignores error handling) >> >> There are several problems with process_vm_writev() in this example: >> >> 1)it causes pagefault on remote process memory, and it forces >> allocation of a new page (if was not preallocated); > > I don't see how your new syscall helps. You're writing to remote > memory. If that memory wasn't allocated, it's going to get allocated > regardless of whether you use a write-like interface or an mmap-like > interface. No, the talk is not about just another interface for copying memory. The talk is about borrowing of remote task's VMA and corresponding page table's content. Syscall allows to copy part of page table with preallocated pages from remote to local process. See here: [task1] [task2] buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, ...); <task1 populates buf> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); munmap(buf); process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 just like in the way we do during fork syscall. There is no copying of buf memory content, unless COW happens. This is the principal difference to process_vm_writev(), which just allocates pages in remote VM. > Keep in mind that, on x86, just the hardware part of a > page fault is very slow -- populating the memory with a syscall > instead of a fault may well be faster. It is not as slow, as disk IO has. Just compare, what happens in case of anonymous pages related to buf of task1 are swapped: 1)process_vm_writev() reads them back into memory; 2)process_vm_mmap() just copies swap PTEs from task1 page table to task2 page table. Also, for faster page faults one may use huge pages for the mappings. But really, it's funny to think about page faults, when there are disk IO problems I shown. >> >> 2)amount of memory for this example is doubled in a moment -- >> n pages in current and n pages in remote tasks are occupied >> at the same time; > > This seems disingenuous. If you're writing p pages total in chunks of > n pages, you will use a total of p pages if you use mmap and p+n if > you use write. I didn't understand this sentence because of many ifs, sorry. Could you please explain your thought once again? > That only doubles the amount of memory if you let n > scale linearly with p, which seems unlikely. > >> >> 3)received data has no a chance to be properly swapped for >> a long time. > > ... > >> a)kernel moves @buf pages into swap right after recv(); >> b)process_vm_writev() reads the data back from swap to pages; > > If you're under that much memory pressure and thrashing that badly, > your performance is going to be awful no matter what you're doing. If > you indeed observe this behavior under normal loads, then this seems > like a VM issue that should be addressed in its own right. I don't think so. Imagine: a container migrates from one node to another. The nodes are the same, say, every of them has 4GB of RAM. Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. After the page server on the second node received the pages, we want these pages become swapped as soon as possible, and we don't want to read them from swap to pass a read consumer. The page server is task1 in the example. The real consumer is task2. This is a rather normal load, I think. >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, ...); >> recv(sock, buf, n * PAGE_SIZE, 0); >> >> [Task 2] >> buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0); >> >> This creates a copy of VMA related to buf from task1 in task2's VM. >> Task1's page table entries are copied into corresponding page table >> entries of VM of task2. > > You need to fully explain a whole bunch of details that you're > ignored. Yeah, it's not a problem :) I'm ready to explain and describe everything, what may cause a question. Just ask ;) > For example, if the remote VMA is MAP_ANONYMOUS, do you get > a CoW copy of it? I assume you don't since the whole point is to > write to remote memory But, no, there *is* COW semantic. We do not copy memory. We copy page table content. This is just the same we have on fork(), when children duplicates parent's VMA and related page table subset, and parent's PTEs lose _PAGE_RW flag. There is all copy_page_range() code reused for that. Please, see [3/7] for the details. I'm going to get special performance using THP, when number of entries to copy is smaller than in case of PTE. Copy several of PMD from one task page table to another's is much much much faster, than process_vm_write() copies pages (even not mention about its reading from swap). >,but it's at the very least quite unusual in > Linux to have two different anonymous VMAs such that writing one of > them changes the other one. Writing to a new VMA does not affect old VMA. Old VMA is just used to get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent each other. > But there are plenty of other questions. > What happens if the remote VMA is a gate area or other special mapping > (vDSO, vvar area, etc)? What if the remote memory comes from a driver > that wasn't expecting the mapping to get magically copied to a > different process? In case of someone wants to duplicate such the mappings, we may consider that, and extend the interface in the future for VMA types, which are safe for that. But now the logic is very overprotective, and all the unusual mappings like you mentioned (also AIO, etc) is prohibited. Please, see [7/7] for the details. > This new API seems quite dangerous and complex to me, and I don't > think the value has been adequately demonstrated. I don't think it's dangerous and complex, because of I haven't introduced any principal VMA conceptions different to what we have now. We just borrow vma->anon_vma and vma->vm_file from remote process to local like we did on fork() (borrowing of vma->anon_vma means not blindly copying, but ordinary anon_vma_fork()). Maybe I had to focus the description more on copying of PTE/PMD instead of vma duplication. So, it's unexpected for me, that people think about simple memory copying after reading the example I gave. But I gave more explanation here, so I hope the situation became clearer for a reader. Anyway, if you have any questions, please ask me. Thanks, Kirill
On 21.05.2019 18:52, Kirill Tkhai wrote: > On 21.05.2019 17:43, Andy Lutomirski wrote: >> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>> >> >>> [Summary] >>> >>> New syscall, which allows to clone a remote process VMA >>> into local process VM. The remote process's page table >>> entries related to the VMA are cloned into local process's >>> page table (in any desired address, which makes this different >>> from that happens during fork()). Huge pages are handled >>> appropriately. >>> >>> This allows to improve performance in significant way like >>> it's shows in the example below. >>> >>> [Description] >>> >>> This patchset adds a new syscall, which makes possible >>> to clone a VMA from a process to current process. >>> The syscall supplements the functionality provided >>> by process_vm_writev() and process_vm_readv() syscalls, >>> and it may be useful in many situation. >>> >>> For example, it allows to make a zero copy of data, >>> when process_vm_writev() was previously used: >>> >>> struct iovec local_iov, remote_iov; >>> void *buf; >>> >>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>> recv(sock, buf, n * PAGE_SIZE, 0); >>> >>> local_iov->iov_base = buf; >>> local_iov->iov_len = n * PAGE_SIZE; >>> remove_iov = ...; >>> >>> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); >>> munmap(buf, n * PAGE_SIZE); >>> >>> (Note, that above completely ignores error handling) >>> >>> There are several problems with process_vm_writev() in this example: >>> >>> 1)it causes pagefault on remote process memory, and it forces >>> allocation of a new page (if was not preallocated); >> >> I don't see how your new syscall helps. You're writing to remote >> memory. If that memory wasn't allocated, it's going to get allocated >> regardless of whether you use a write-like interface or an mmap-like >> interface. > > No, the talk is not about just another interface for copying memory. > The talk is about borrowing of remote task's VMA and corresponding > page table's content. Syscall allows to copy part of page table > with preallocated pages from remote to local process. See here: > > [task1] [task2] > > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > > <task1 populates buf> > > buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); > munmap(buf); > > > process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 > just like in the way we do during fork syscall. > > There is no copying of buf memory content, unless COW happens. This is > the principal difference to process_vm_writev(), which just allocates > pages in remote VM. > >> Keep in mind that, on x86, just the hardware part of a >> page fault is very slow -- populating the memory with a syscall >> instead of a fault may well be faster. > > It is not as slow, as disk IO has. Just compare, what happens in case of anonymous > pages related to buf of task1 are swapped: > > 1)process_vm_writev() reads them back into memory; > > 2)process_vm_mmap() just copies swap PTEs from task1 page table > to task2 page table. > > Also, for faster page faults one may use huge pages for the mappings. > But really, it's funny to think about page faults, when there are > disk IO problems I shown. > >>> >>> 2)amount of memory for this example is doubled in a moment -- >>> n pages in current and n pages in remote tasks are occupied >>> at the same time; >> >> This seems disingenuous. If you're writing p pages total in chunks of >> n pages, you will use a total of p pages if you use mmap and p+n if >> you use write. > > I didn't understand this sentence because of many ifs, sorry. Could you > please explain your thought once again? > >> That only doubles the amount of memory if you let n >> scale linearly with p, which seems unlikely. >> >>> >>> 3)received data has no a chance to be properly swapped for >>> a long time. >> >> ... >> >>> a)kernel moves @buf pages into swap right after recv(); >>> b)process_vm_writev() reads the data back from swap to pages; >> >> If you're under that much memory pressure and thrashing that badly, >> your performance is going to be awful no matter what you're doing. If >> you indeed observe this behavior under normal loads, then this seems >> like a VM issue that should be addressed in its own right. > > I don't think so. Imagine: a container migrates from one node to another. > The nodes are the same, say, every of them has 4GB of RAM. > > Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. > After the page server on the second node received the pages, we want these > pages become swapped as soon as possible, and we don't want to read them from > swap to pass a read consumer. Should be "to pass a *real* consumer". > > The page server is task1 in the example. The real consumer is task2. > > This is a rather normal load, I think. > >>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>> recv(sock, buf, n * PAGE_SIZE, 0); >>> >>> [Task 2] >>> buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0); >>> >>> This creates a copy of VMA related to buf from task1 in task2's VM. >>> Task1's page table entries are copied into corresponding page table >>> entries of VM of task2. >> >> You need to fully explain a whole bunch of details that you're >> ignored. > > Yeah, it's not a problem :) I'm ready to explain and describe everything, > what may cause a question. Just ask ;) > >> For example, if the remote VMA is MAP_ANONYMOUS, do you get >> a CoW copy of it? I assume you don't since the whole point is to >> write to remote memory > > But, no, there *is* COW semantic. We do not copy memory. We copy > page table content. This is just the same we have on fork(), when > children duplicates parent's VMA and related page table subset, > and parent's PTEs lose _PAGE_RW flag. > > There is all copy_page_range() code reused for that. Please, see [3/7] > for the details. > > I'm going to get special performance using THP, when number of entries > to copy is smaller than in case of PTE. > > Copy several of PMD from one task page table to another's is much much much faster, > than process_vm_write() copies pages (even not mention about its reading from swap). > >> ,but it's at the very least quite unusual in >> Linux to have two different anonymous VMAs such that writing one of >> them changes the other one. > Writing to a new VMA does not affect old VMA. Old VMA is just used to > get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent > each other. > >> But there are plenty of other questions. >> What happens if the remote VMA is a gate area or other special mapping >> (vDSO, vvar area, etc)? What if the remote memory comes from a driver >> that wasn't expecting the mapping to get magically copied to a >> different process? > > In case of someone wants to duplicate such the mappings, we may consider > that, and extend the interface in the future for VMA types, which are > safe for that. > > But now the logic is very overprotective, and all the unusual mappings > like you mentioned (also AIO, etc) is prohibited. Please, see [7/7] > for the details. > >> This new API seems quite dangerous and complex to me, and I don't >> think the value has been adequately demonstrated. > > I don't think it's dangerous and complex, because of I haven't introduced > any principal VMA conceptions different to what we have now. We just > borrow vma->anon_vma and vma->vm_file from remote process to local > like we did on fork() (borrowing of vma->anon_vma means not blindly > copying, but ordinary anon_vma_fork()). > > Maybe I had to focus the description more on copying of PTE/PMD > instead of vma duplication. So, it's unexpected for me, that people > think about simple memory copying after reading the example I gave. > But I gave more explanation here, so I hope the situation became > clearer for a reader. Anyway, if you have any questions, please > ask me. > > Thanks, > Kirill >
On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > On 21.05.2019 17:43, Andy Lutomirski wrote: > > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> New syscall, which allows to clone a remote process VMA > >> into local process VM. The remote process's page table > >> entries related to the VMA are cloned into local process's > >> page table (in any desired address, which makes this different > >> from that happens during fork()). Huge pages are handled > >> appropriately. [...] > >> There are several problems with process_vm_writev() in this example: > >> > >> 1)it causes pagefault on remote process memory, and it forces > >> allocation of a new page (if was not preallocated); > > > > I don't see how your new syscall helps. You're writing to remote > > memory. If that memory wasn't allocated, it's going to get allocated > > regardless of whether you use a write-like interface or an mmap-like > > interface. > > No, the talk is not about just another interface for copying memory. > The talk is about borrowing of remote task's VMA and corresponding > page table's content. Syscall allows to copy part of page table > with preallocated pages from remote to local process. See here: > > [task1] [task2] > > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > > <task1 populates buf> > > buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); > munmap(buf); > > > process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 > just like in the way we do during fork syscall. > > There is no copying of buf memory content, unless COW happens. This is > the principal difference to process_vm_writev(), which just allocates > pages in remote VM. > > > Keep in mind that, on x86, just the hardware part of a > > page fault is very slow -- populating the memory with a syscall > > instead of a fault may well be faster. > > It is not as slow, as disk IO has. Just compare, what happens in case of anonymous > pages related to buf of task1 are swapped: > > 1)process_vm_writev() reads them back into memory; > > 2)process_vm_mmap() just copies swap PTEs from task1 page table > to task2 page table. > > Also, for faster page faults one may use huge pages for the mappings. > But really, it's funny to think about page faults, when there are > disk IO problems I shown. [...] > > That only doubles the amount of memory if you let n > > scale linearly with p, which seems unlikely. > > > >> > >> 3)received data has no a chance to be properly swapped for > >> a long time. > > > > ... > > > >> a)kernel moves @buf pages into swap right after recv(); > >> b)process_vm_writev() reads the data back from swap to pages; > > > > If you're under that much memory pressure and thrashing that badly, > > your performance is going to be awful no matter what you're doing. If > > you indeed observe this behavior under normal loads, then this seems > > like a VM issue that should be addressed in its own right. > > I don't think so. Imagine: a container migrates from one node to another. > The nodes are the same, say, every of them has 4GB of RAM. > > Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. > After the page server on the second node received the pages, we want these > pages become swapped as soon as possible, and we don't want to read them from > swap to pass a read consumer. But you don't have to copy that memory into the container's tasks all at once, right? Can't you, every time you've received a few dozen kilobytes of data or whatever, shove them into the target task? That way you don't have problems with swap because the time before the data has arrived in its final VMA is tiny.
On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 21.05.2019 17:43, Andy Lutomirski wrote: > > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> > > > >> [Summary] > >> > >> New syscall, which allows to clone a remote process VMA > >> into local process VM. The remote process's page table > >> entries related to the VMA are cloned into local process's > >> page table (in any desired address, which makes this different > >> from that happens during fork()). Huge pages are handled > >> appropriately. > >> > >> This allows to improve performance in significant way like > >> it's shows in the example below. > >> > >> [Description] > >> > >> This patchset adds a new syscall, which makes possible > >> to clone a VMA from a process to current process. > >> The syscall supplements the functionality provided > >> by process_vm_writev() and process_vm_readv() syscalls, > >> and it may be useful in many situation. > >> > >> For example, it allows to make a zero copy of data, > >> when process_vm_writev() was previously used: > >> > >> struct iovec local_iov, remote_iov; > >> void *buf; > >> > >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > >> MAP_PRIVATE|MAP_ANONYMOUS, ...); > >> recv(sock, buf, n * PAGE_SIZE, 0); > >> > >> local_iov->iov_base = buf; > >> local_iov->iov_len = n * PAGE_SIZE; > >> remove_iov = ...; > >> > >> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); > >> munmap(buf, n * PAGE_SIZE); > >> > >> (Note, that above completely ignores error handling) > >> > >> There are several problems with process_vm_writev() in this example: > >> > >> 1)it causes pagefault on remote process memory, and it forces > >> allocation of a new page (if was not preallocated); > > > > I don't see how your new syscall helps. You're writing to remote > > memory. If that memory wasn't allocated, it's going to get allocated > > regardless of whether you use a write-like interface or an mmap-like > > interface. > > No, the talk is not about just another interface for copying memory. > The talk is about borrowing of remote task's VMA and corresponding > page table's content. Syscall allows to copy part of page table > with preallocated pages from remote to local process. See here: > > [task1] [task2] > > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > > <task1 populates buf> > > buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); > munmap(buf); > > > process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 > just like in the way we do during fork syscall. If I understand this correctly, your intended use is to have one task allocate memory and fill it, have the other task clone the VMA, and have the first task free the VMA? If so, that wasn't at all obvious from your original email. Why don't you use splice() instead? splice() the data to the remote task and have the remove task read() it? All these VMA games will result in a lot of flushes, which is bad for performance. Or, depending on your exact constraints, you could map a memfd in both tasks instead, which has the same flushing issues but at least has a sensible API. > > There is no copying of buf memory content, unless COW happens. This is > the principal difference to process_vm_writev(), which just allocates > pages in remote VM. > > > Keep in mind that, on x86, just the hardware part of a > > page fault is very slow -- populating the memory with a syscall > > instead of a fault may well be faster. > > It is not as slow, as disk IO has. Just compare, what happens in case of anonymous > pages related to buf of task1 are swapped: > > 1)process_vm_writev() reads them back into memory; > > 2)process_vm_mmap() just copies swap PTEs from task1 page table > to task2 page table. > > Also, for faster page faults one may use huge pages for the mappings. > But really, it's funny to think about page faults, when there are > disk IO problems I shown. What are you doing that is causing *disk* IO in any of this? I suspect your real problem is that you are using far too large of a buffer. See below. > > >> > >> 2)amount of memory for this example is doubled in a moment -- > >> n pages in current and n pages in remote tasks are occupied > >> at the same time; > > > > This seems disingenuous. If you're writing p pages total in chunks of > > n pages, you will use a total of p pages if you use mmap and p+n if > > you use write. > > I didn't understand this sentence because of many ifs, sorry. Could you > please explain your thought once again? You seem to have a function that tries to populate p pages of memory with data received from a socket. It looks like you're doing something like this: void copy_p_pages(size_t p) { size_t n = some_value(p); char *buf = malloc(n * PAGE_SIZE); for (int i = 0; i < p; i += n*PAGE_SIZE) { read(fd, buf, n*PAGE_SIZE); /* check return value, etc */ process_vm_writev(write n*PAGE_SIZE bytes to remote process); } free(buf); } If you have a *constant* n (i.e. some_value(p) is just a number like 16)), then you aren't doubling memory usage. If you have some_value(p) return p, then you are indeed doubling memory usage. So don't do that! If buf is getting swapped out, you are very likely doing something wrong. If you're using a 100MB buffer or a 10GB, then I'm not surprised you have problems. Try something reasonable like 128kB. For extra fun, you could mlock() that buf, but if you're thrashing on access to a 128kB working set, you will probably also get your *code* swapped out, in which case you pretty much lose. > > For example, if the remote VMA is MAP_ANONYMOUS, do you get > > a CoW copy of it? I assume you don't since the whole point is to > > write to remote memory > > But, no, there *is* COW semantic. We do not copy memory. We copy > page table content. This is just the same we have on fork(), when > children duplicates parent's VMA and related page table subset, > and parent's PTEs lose _PAGE_RW flag. Then you need to document this very carefully, because other people will use your syscall in different ways than you use it. And, if you are doing CoW like this, then your syscall is basically only useful for your really weird use case in which you're using it to import an already-populated VMA. Maybe this is a reasonable feature to add to the kernel, but it needs a benchmark against a reasonable alternative. > > There is all copy_page_range() code reused for that. Please, see [3/7] > for the details. You can't as users of a syscall to read the nitty gritty mm code to figure out what the syscall does from a user's perspective. > > But there are plenty of other questions. > > What happens if the remote VMA is a gate area or other special mapping > > (vDSO, vvar area, etc)? What if the remote memory comes from a driver > > that wasn't expecting the mapping to get magically copied to a > > different process? > > In case of someone wants to duplicate such the mappings, we may consider > that, and extend the interface in the future for VMA types, which are > safe for that. Do you mean that the code you sent rejects this case? If so, please document it. In any case, I looked at the code, and it seems to be trying to handle MAP_SHARED and MAP_ANONYMOUS. I don't see where it would reject copying a vDSO.
On 21.05.2019 19:20, Jann Horn wrote: > On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> On 21.05.2019 17:43, Andy Lutomirski wrote: >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> New syscall, which allows to clone a remote process VMA >>>> into local process VM. The remote process's page table >>>> entries related to the VMA are cloned into local process's >>>> page table (in any desired address, which makes this different >>>> from that happens during fork()). Huge pages are handled >>>> appropriately. > [...] >>>> There are several problems with process_vm_writev() in this example: >>>> >>>> 1)it causes pagefault on remote process memory, and it forces >>>> allocation of a new page (if was not preallocated); >>> >>> I don't see how your new syscall helps. You're writing to remote >>> memory. If that memory wasn't allocated, it's going to get allocated >>> regardless of whether you use a write-like interface or an mmap-like >>> interface. >> >> No, the talk is not about just another interface for copying memory. >> The talk is about borrowing of remote task's VMA and corresponding >> page table's content. Syscall allows to copy part of page table >> with preallocated pages from remote to local process. See here: >> >> [task1] [task2] >> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, ...); >> >> <task1 populates buf> >> >> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); >> munmap(buf); >> >> >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 >> just like in the way we do during fork syscall. >> >> There is no copying of buf memory content, unless COW happens. This is >> the principal difference to process_vm_writev(), which just allocates >> pages in remote VM. >> >>> Keep in mind that, on x86, just the hardware part of a >>> page fault is very slow -- populating the memory with a syscall >>> instead of a fault may well be faster. >> >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous >> pages related to buf of task1 are swapped: >> >> 1)process_vm_writev() reads them back into memory; >> >> 2)process_vm_mmap() just copies swap PTEs from task1 page table >> to task2 page table. >> >> Also, for faster page faults one may use huge pages for the mappings. >> But really, it's funny to think about page faults, when there are >> disk IO problems I shown. > [...] >>> That only doubles the amount of memory if you let n >>> scale linearly with p, which seems unlikely. >>> >>>> >>>> 3)received data has no a chance to be properly swapped for >>>> a long time. >>> >>> ... >>> >>>> a)kernel moves @buf pages into swap right after recv(); >>>> b)process_vm_writev() reads the data back from swap to pages; >>> >>> If you're under that much memory pressure and thrashing that badly, >>> your performance is going to be awful no matter what you're doing. If >>> you indeed observe this behavior under normal loads, then this seems >>> like a VM issue that should be addressed in its own right. >> >> I don't think so. Imagine: a container migrates from one node to another. >> The nodes are the same, say, every of them has 4GB of RAM. >> >> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. >> After the page server on the second node received the pages, we want these >> pages become swapped as soon as possible, and we don't want to read them from >> swap to pass a read consumer. > > But you don't have to copy that memory into the container's tasks all > at once, right? Can't you, every time you've received a few dozen > kilobytes of data or whatever, shove them into the target task? That > way you don't have problems with swap because the time before the data > has arrived in its final VMA is tiny. We try to maintain online migration with as small downtime as possible, and the container on source node is completely stopped at the very end. Memory of container tasks is copied in background without container completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages. Container may create any new processes during the migration, and these processes may contain any memory mappings. Imagine the situation. We migrate a big web server with a lot of processes, and some of children processes have the same COW mapping as parent has. In case of all memory dump is available at the moment of the grand parent web server process creation, we populate the mapping in parent, and all the children may inherit the mapping in case of they want after fork. COW works here. But in case of some processes are created before all memory is available on destination node, we can't do such the COW inheritance. This will be the reason, the memory consumed by container grows many times after the migration. So, the only solution is to create process tree after memory is available and all mappings are known. It's on of the examples. But believe me, there are a lot of another reasons, why process tree should be created only after all process tree is freezed, and no new tasks on source are possible. PGID and SSID inheritance, for example. All of this requires special order of tasks creation. In case of you try to restore process tree with correct namespaces and especial in case of many user namespaces in a container, you will just see like a hell will open before your eyes, and we never can think about this. So, no, we can't create any task before the whole process tree is knows. Believe me, the reason is heavy and serious. Kirill
On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > On 21.05.2019 19:20, Jann Horn wrote: > > On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> On 21.05.2019 17:43, Andy Lutomirski wrote: > >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >>>> New syscall, which allows to clone a remote process VMA > >>>> into local process VM. The remote process's page table > >>>> entries related to the VMA are cloned into local process's > >>>> page table (in any desired address, which makes this different > >>>> from that happens during fork()). Huge pages are handled > >>>> appropriately. > > [...] > >>>> There are several problems with process_vm_writev() in this example: > >>>> > >>>> 1)it causes pagefault on remote process memory, and it forces > >>>> allocation of a new page (if was not preallocated); > >>> > >>> I don't see how your new syscall helps. You're writing to remote > >>> memory. If that memory wasn't allocated, it's going to get allocated > >>> regardless of whether you use a write-like interface or an mmap-like > >>> interface. > >> > >> No, the talk is not about just another interface for copying memory. > >> The talk is about borrowing of remote task's VMA and corresponding > >> page table's content. Syscall allows to copy part of page table > >> with preallocated pages from remote to local process. See here: > >> > >> [task1] [task2] > >> > >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > >> MAP_PRIVATE|MAP_ANONYMOUS, ...); > >> > >> <task1 populates buf> > >> > >> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); > >> munmap(buf); > >> > >> > >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 > >> just like in the way we do during fork syscall. > >> > >> There is no copying of buf memory content, unless COW happens. This is > >> the principal difference to process_vm_writev(), which just allocates > >> pages in remote VM. > >> > >>> Keep in mind that, on x86, just the hardware part of a > >>> page fault is very slow -- populating the memory with a syscall > >>> instead of a fault may well be faster. > >> > >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous > >> pages related to buf of task1 are swapped: > >> > >> 1)process_vm_writev() reads them back into memory; > >> > >> 2)process_vm_mmap() just copies swap PTEs from task1 page table > >> to task2 page table. > >> > >> Also, for faster page faults one may use huge pages for the mappings. > >> But really, it's funny to think about page faults, when there are > >> disk IO problems I shown. > > [...] > >>> That only doubles the amount of memory if you let n > >>> scale linearly with p, which seems unlikely. > >>> > >>>> > >>>> 3)received data has no a chance to be properly swapped for > >>>> a long time. > >>> > >>> ... > >>> > >>>> a)kernel moves @buf pages into swap right after recv(); > >>>> b)process_vm_writev() reads the data back from swap to pages; > >>> > >>> If you're under that much memory pressure and thrashing that badly, > >>> your performance is going to be awful no matter what you're doing. If > >>> you indeed observe this behavior under normal loads, then this seems > >>> like a VM issue that should be addressed in its own right. > >> > >> I don't think so. Imagine: a container migrates from one node to another. > >> The nodes are the same, say, every of them has 4GB of RAM. > >> > >> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. > >> After the page server on the second node received the pages, we want these > >> pages become swapped as soon as possible, and we don't want to read them from > >> swap to pass a read consumer. > > > > But you don't have to copy that memory into the container's tasks all > > at once, right? Can't you, every time you've received a few dozen > > kilobytes of data or whatever, shove them into the target task? That > > way you don't have problems with swap because the time before the data > > has arrived in its final VMA is tiny. > > We try to maintain online migration with as small downtime as possible, > and the container on source node is completely stopped at the very end. > Memory of container tasks is copied in background without container > completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages. > > Container may create any new processes during the migration, and these > processes may contain any memory mappings. > > Imagine the situation. We migrate a big web server with a lot of processes, > and some of children processes have the same COW mapping as parent has. > In case of all memory dump is available at the moment of the grand parent > web server process creation, we populate the mapping in parent, and all > the children may inherit the mapping in case of they want after fork. > COW works here. But in case of some processes are created before all memory > is available on destination node, we can't do such the COW inheritance. > This will be the reason, the memory consumed by container grows many > times after the migration. So, the only solution is to create process > tree after memory is available and all mappings are known. But if one of the processes modifies the memory after you've started migrating it to the new machine, that memory can't be CoW anymore anyway, right? So it should work if you first do a first pass of copying the memory and creating the process hierarchy, and then copy more recent changes into the individual processes, breaking the CoW for those pages, right? > It's on of the examples. But believe me, there are a lot of another reasons, > why process tree should be created only after all process tree is freezed, > and no new tasks on source are possible. PGID and SSID inheritance, for > example. All of this requires special order of tasks creation. In case of > you try to restore process tree with correct namespaces and especial in > case of many user namespaces in a container, you will just see like a hell > will open before your eyes, and we never can think about this. Could you elaborate on why that is so hellish? > So, no, we can't create any task before the whole process tree is knows. > Believe me, the reason is heavy and serious. > > Kirill >
On 21.05.2019 19:43, Andy Lutomirski wrote: > On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> >> On 21.05.2019 17:43, Andy Lutomirski wrote: >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> >>> >>>> [Summary] >>>> >>>> New syscall, which allows to clone a remote process VMA >>>> into local process VM. The remote process's page table >>>> entries related to the VMA are cloned into local process's >>>> page table (in any desired address, which makes this different >>>> from that happens during fork()). Huge pages are handled >>>> appropriately. >>>> >>>> This allows to improve performance in significant way like >>>> it's shows in the example below. >>>> >>>> [Description] >>>> >>>> This patchset adds a new syscall, which makes possible >>>> to clone a VMA from a process to current process. >>>> The syscall supplements the functionality provided >>>> by process_vm_writev() and process_vm_readv() syscalls, >>>> and it may be useful in many situation. >>>> >>>> For example, it allows to make a zero copy of data, >>>> when process_vm_writev() was previously used: >>>> >>>> struct iovec local_iov, remote_iov; >>>> void *buf; >>>> >>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>>> recv(sock, buf, n * PAGE_SIZE, 0); >>>> >>>> local_iov->iov_base = buf; >>>> local_iov->iov_len = n * PAGE_SIZE; >>>> remove_iov = ...; >>>> >>>> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); >>>> munmap(buf, n * PAGE_SIZE); >>>> >>>> (Note, that above completely ignores error handling) >>>> >>>> There are several problems with process_vm_writev() in this example: >>>> >>>> 1)it causes pagefault on remote process memory, and it forces >>>> allocation of a new page (if was not preallocated); >>> >>> I don't see how your new syscall helps. You're writing to remote >>> memory. If that memory wasn't allocated, it's going to get allocated >>> regardless of whether you use a write-like interface or an mmap-like >>> interface. >> >> No, the talk is not about just another interface for copying memory. >> The talk is about borrowing of remote task's VMA and corresponding >> page table's content. Syscall allows to copy part of page table >> with preallocated pages from remote to local process. See here: >> >> [task1] [task2] >> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, ...); >> >> <task1 populates buf> >> >> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); >> munmap(buf); >> >> >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 >> just like in the way we do during fork syscall. > > If I understand this correctly, your intended use is to have one task > allocate memory and fill it, have the other task clone the VMA, and > have the first task free the VMA? If so, that wasn't at all obvious > from your original email. Yes, exactly this. Sorry for confusing in initial description, it's not intentionally. > Why don't you use splice() instead? I just don't see a possibility of anonymous memory may be moved from one process to another via splice(). Maybe you may explain your idea more detailed? > splice() the data to the remote > task and have the remove task read() it? All these VMA games will > result in a lot of flushes, which is bad for performance. Or, > depending on your exact constraints, you could map a memfd in both > tasks instead, which has the same flushing issues but at least has a > sensible API. memfd() is file-backed mapping, and it is not suitable for that. In case of a process had anonymous mapping before the migration, it wants the mapping remains the same after the migration. So, if we use memfd(), we have to copy the memory from memfd mapping to its real anonymous mapping target, which has the same problems as process_vm_writev(). >> >> There is no copying of buf memory content, unless COW happens. This is >> the principal difference to process_vm_writev(), which just allocates >> pages in remote VM. >> >>> Keep in mind that, on x86, just the hardware part of a >>> page fault is very slow -- populating the memory with a syscall >>> instead of a fault may well be faster. >> >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous >> pages related to buf of task1 are swapped: >> >> 1)process_vm_writev() reads them back into memory; >> >> 2)process_vm_mmap() just copies swap PTEs from task1 page table >> to task2 page table. >> >> Also, for faster page faults one may use huge pages for the mappings. >> But really, it's funny to think about page faults, when there are >> disk IO problems I shown. > > What are you doing that is causing *disk* IO in any of this? I > suspect your real problem is that you are using far too large of a > buffer. See below. Imagine, we are migrating a container, which consists of 9 GB of pages, and we have 8GB RAM on destination node. Before the migration, we had some of pages in RAM and some of pages in swap. Source node sends pages to destination node. And there are limitations, which do not allow to start creation of process tree on the destination node, before all memory is received. Pages are received by some page server task on destination. After all pages are received, we create process tree and populate container tasks mappings. When we're populating tasks mapping, we have to copy memory from page server to a target task. In case of the pages were swapped from page server's address space, we have to read synchronously them from swap. This introduces big latency, and big IO I talked. > >> >>>> >>>> 2)amount of memory for this example is doubled in a moment -- >>>> n pages in current and n pages in remote tasks are occupied >>>> at the same time; >>> >>> This seems disingenuous. If you're writing p pages total in chunks of >>> n pages, you will use a total of p pages if you use mmap and p+n if >>> you use write. >> >> I didn't understand this sentence because of many ifs, sorry. Could you >> please explain your thought once again? > > You seem to have a function that tries to populate p pages of memory > with data received from a socket. It looks like you're doing > something like this: > > void copy_p_pages(size_t p) > { > size_t n = some_value(p); > char *buf = malloc(n * PAGE_SIZE); > for (int i = 0; i < p; i += n*PAGE_SIZE) { > read(fd, buf, n*PAGE_SIZE); /* check return value, etc */ > process_vm_writev(write n*PAGE_SIZE bytes to remote process); > } > free(buf); > } > > If you have a *constant* n (i.e. some_value(p) is just a number like > 16)), then you aren't doubling memory usage. If you have > some_value(p) return p, then you are indeed doubling memory usage. So > don't do that! > If buf is getting swapped out, you are very likely doing something > wrong. If you're using a 100MB buffer or a 10GB, then I'm not > surprised you have problems. Try something reasonable like 128kB. For > extra fun, you could mlock() that buf, but if you're thrashing on > access to a 128kB working set, you will probably also get your *code* > swapped out, in which case you pretty much lose. The thing is we can't use small buffer. We have to receive all the restored tasks pages on the destination node, before we start the process tree creation like I wrote above. All the anonymous memory is mapped into page server's MM, so it becomes swapped before container's process tree starts to create. >>> For example, if the remote VMA is MAP_ANONYMOUS, do you get >>> a CoW copy of it? I assume you don't since the whole point is to >>> write to remote memory >> >> But, no, there *is* COW semantic. We do not copy memory. We copy >> page table content. This is just the same we have on fork(), when >> children duplicates parent's VMA and related page table subset, >> and parent's PTEs lose _PAGE_RW flag. > > Then you need to document this very carefully, because other people > will use your syscall in different ways than you use it. Ok, I'll do. > And, if you are doing CoW like this, then your syscall is basically > only useful for your really weird use case in which you're using it to > import an already-populated VMA. Maybe this is a reasonable feature > to add to the kernel, but it needs a benchmark against a reasonable > alternative. Do you mean comparison with process_vm_writev/readv() or something like this? >> >> There is all copy_page_range() code reused for that. Please, see [3/7] >> for the details. > > You can't as users of a syscall to read the nitty gritty mm code to > figure out what the syscall does from a user's perspective. Yeah, sure :) >>> But there are plenty of other questions. >>> What happens if the remote VMA is a gate area or other special mapping >>> (vDSO, vvar area, etc)? What if the remote memory comes from a driver >>> that wasn't expecting the mapping to get magically copied to a >>> different process? >> >> In case of someone wants to duplicate such the mappings, we may consider >> that, and extend the interface in the future for VMA types, which are >> safe for that. > > Do you mean that the code you sent rejects this case? If so, please > document it. In any case, I looked at the code, and it seems to be > trying to handle MAP_SHARED and MAP_ANONYMOUS. I don't see where it > would reject copying a vDSO. I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO. I'll check carefully, whether it's enough for vDSO. Thanks, Kirill
On 21.05.2019 20:28, Jann Horn wrote: > On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> On 21.05.2019 19:20, Jann Horn wrote: >>> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> On 21.05.2019 17:43, Andy Lutomirski wrote: >>>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>>>> New syscall, which allows to clone a remote process VMA >>>>>> into local process VM. The remote process's page table >>>>>> entries related to the VMA are cloned into local process's >>>>>> page table (in any desired address, which makes this different >>>>>> from that happens during fork()). Huge pages are handled >>>>>> appropriately. >>> [...] >>>>>> There are several problems with process_vm_writev() in this example: >>>>>> >>>>>> 1)it causes pagefault on remote process memory, and it forces >>>>>> allocation of a new page (if was not preallocated); >>>>> >>>>> I don't see how your new syscall helps. You're writing to remote >>>>> memory. If that memory wasn't allocated, it's going to get allocated >>>>> regardless of whether you use a write-like interface or an mmap-like >>>>> interface. >>>> >>>> No, the talk is not about just another interface for copying memory. >>>> The talk is about borrowing of remote task's VMA and corresponding >>>> page table's content. Syscall allows to copy part of page table >>>> with preallocated pages from remote to local process. See here: >>>> >>>> [task1] [task2] >>>> >>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>>> >>>> <task1 populates buf> >>>> >>>> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); >>>> munmap(buf); >>>> >>>> >>>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 >>>> just like in the way we do during fork syscall. >>>> >>>> There is no copying of buf memory content, unless COW happens. This is >>>> the principal difference to process_vm_writev(), which just allocates >>>> pages in remote VM. >>>> >>>>> Keep in mind that, on x86, just the hardware part of a >>>>> page fault is very slow -- populating the memory with a syscall >>>>> instead of a fault may well be faster. >>>> >>>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous >>>> pages related to buf of task1 are swapped: >>>> >>>> 1)process_vm_writev() reads them back into memory; >>>> >>>> 2)process_vm_mmap() just copies swap PTEs from task1 page table >>>> to task2 page table. >>>> >>>> Also, for faster page faults one may use huge pages for the mappings. >>>> But really, it's funny to think about page faults, when there are >>>> disk IO problems I shown. >>> [...] >>>>> That only doubles the amount of memory if you let n >>>>> scale linearly with p, which seems unlikely. >>>>> >>>>>> >>>>>> 3)received data has no a chance to be properly swapped for >>>>>> a long time. >>>>> >>>>> ... >>>>> >>>>>> a)kernel moves @buf pages into swap right after recv(); >>>>>> b)process_vm_writev() reads the data back from swap to pages; >>>>> >>>>> If you're under that much memory pressure and thrashing that badly, >>>>> your performance is going to be awful no matter what you're doing. If >>>>> you indeed observe this behavior under normal loads, then this seems >>>>> like a VM issue that should be addressed in its own right. >>>> >>>> I don't think so. Imagine: a container migrates from one node to another. >>>> The nodes are the same, say, every of them has 4GB of RAM. >>>> >>>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. >>>> After the page server on the second node received the pages, we want these >>>> pages become swapped as soon as possible, and we don't want to read them from >>>> swap to pass a read consumer. >>> >>> But you don't have to copy that memory into the container's tasks all >>> at once, right? Can't you, every time you've received a few dozen >>> kilobytes of data or whatever, shove them into the target task? That >>> way you don't have problems with swap because the time before the data >>> has arrived in its final VMA is tiny. >> >> We try to maintain online migration with as small downtime as possible, >> and the container on source node is completely stopped at the very end. >> Memory of container tasks is copied in background without container >> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages. >> >> Container may create any new processes during the migration, and these >> processes may contain any memory mappings. >> >> Imagine the situation. We migrate a big web server with a lot of processes, >> and some of children processes have the same COW mapping as parent has. >> In case of all memory dump is available at the moment of the grand parent >> web server process creation, we populate the mapping in parent, and all >> the children may inherit the mapping in case of they want after fork. >> COW works here. But in case of some processes are created before all memory >> is available on destination node, we can't do such the COW inheritance. >> This will be the reason, the memory consumed by container grows many >> times after the migration. So, the only solution is to create process >> tree after memory is available and all mappings are known. > > But if one of the processes modifies the memory after you've started > migrating it to the new machine, that memory can't be CoW anymore > anyway, right? So it should work if you first do a first pass of > copying the memory and creating the process hierarchy, and then copy > more recent changes into the individual processes, breaking the CoW > for those pages, right? Not so. We have to have all processes killed on source node, before creation the same process tree on destination machine. The process tree should be completely analyzed before a try of its recreation from ground. The analysis allows to choose the strategy and the sequence of each process creation and inheritance of entities like namespaces, mm, fdtables, etc. It's impossible to restore a process tree in case of you already have started to create it, but haven't stopped processes on source node. Also, we can restore only subset of process trees, but you never know what will happen with a live process tree in further. So, source process tree must be freezed before restore. A restore of arbitrary process tree in laws of all linux limitations on sequence of action to recreate an entity of a process makes this nontrivial mathematical problem, which has no a solution at the moment, and it's unknown whether it has a solution. So, at the moment of restore starts, we know all about all tasks and their relation ships. Yes, we start to copy memory from source to destination, when container on source is alive. But we track this memory with _PAGE_SOFT_DIRTY flag, and dirtied pages are repopulated with new content. Please, see the comment about hellish below. >> It's on of the examples. But believe me, there are a lot of another reasons, >> why process tree should be created only after all process tree is freezed, >> and no new tasks on source are possible. PGID and SSID inheritance, for >> example. All of this requires special order of tasks creation. In case of >> you try to restore process tree with correct namespaces and especial in >> case of many user namespaces in a container, you will just see like a hell >> will open before your eyes, and we never can think about this. > > Could you elaborate on why that is so hellish? Because you never know the way, the system came into the state you're seeing at the moment. Even in a simple process chain like: task1 | task2 | task3 / \ task4 task5 any of these processes may change its namespace, pgid, ssid, unshare mm, files, become a child subreaper, die and make its children reparented, do all of these before or after parent did one of its actions. All of these actions are not independent between each other, and some actions prohibit another actions. And you can restore the process tree only in case you repeat all of the sequence in the same order it was made by the container. It's impossible to say "task2, set your session to 5!", because the only way to set ssid is call setsid() syscall, which may be made only once in a process life. But setsid itself implies limitations on further setpgid() syscall (see the code, if interesting). And this limitation does not mean we always should call setpgid() before it, no, these are no such the simple rules. The same and moreover with inheritance of namespaces, when some of task's children may inherit old task's namespaces, some may inherit current, some will inherit further. A child will be able to assign a namespace, say, net ns, only in case of its userns allows to do this. This implies another limitation on process creation order, while this does not mean this gives a stable rule of choosing an order you create process tree. No, the only rule is "in the moment of time, one of tasks should made one of the above actions". This is a mathematical problem of ordering finite number of actors with finite number of rules and relationship between them, and the rules do not imply unambiguous replay from the ground by the end state. We behave well in limited and likely cases of process tree configurations, and this is enough for most cases. But common problem is very difficult, and currently it's not proven it even has a solution as a finite set of rules you may apply to restore any process tree. Kirill
On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: > This patchset adds a new syscall, which makes possible > to clone a VMA from a process to current process. > The syscall supplements the functionality provided > by process_vm_writev() and process_vm_readv() syscalls, > and it may be useful in many situation. Kirill, could you explain how the change affects rmap and how it is safe. My concern is that the patchset allows to map the same page multiple times within one process or even map page allocated by child to the parrent. It was not allowed before. In the best case it makes reasoning about rmap substantially more difficult. But I'm worry it will introduce hard-to-debug bugs, like described in https://lwn.net/Articles/383162/. Note, that is some cases we care about rmap walk order (see for instance mremap() case). I'm not convinced that the feature will not break something in the area.
On 22.05.2019 18:22, Kirill A. Shutemov wrote: > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >> This patchset adds a new syscall, which makes possible >> to clone a VMA from a process to current process. >> The syscall supplements the functionality provided >> by process_vm_writev() and process_vm_readv() syscalls, >> and it may be useful in many situation. > > Kirill, could you explain how the change affects rmap and how it is safe. > > My concern is that the patchset allows to map the same page multiple times > within one process or even map page allocated by child to the parrent. > > It was not allowed before. > > In the best case it makes reasoning about rmap substantially more difficult. I don't think here is big impact from process relationships, because of as it existed before, the main rule of VMA chaining is that VMA is younger or older each other. For example, reusing of anon_vma in anon_vma_clone() may be done either children or siblings. Also, it is possible reparenting after some of processes dies; or splitting two branches of processes having the same grand parent into two chains after the grand parent dies, so it looks there should be many combinations already available. Mapping of the same page multiple times is a different thing, and it was never allowed for rmap. Could you please say more specifically what looks suspicious for you and I'll try to answer then? Otherwise, it's possible to write explanations as big as a dissertation and to miss all answers to that is interested for you :) > > But I'm worry it will introduce hard-to-debug bugs, like described in > https://lwn.net/Articles/383162/. I read the article, but there are a lot of messages in thread, I'm not sure, that found the actual fix there. But it looks like one of the fixes may be be usage of anon_vma->root in __page_set_anon_rmap(). > Note, that is some cases we care about rmap walk order (see for instance > mremap() case). I'm not convinced that the feature will not break > something in the area. Yeah, thanks for pointing, I'll check this. Kirill
On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 21.05.2019 19:43, Andy Lutomirski wrote: > > On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> > >> On 21.05.2019 17:43, Andy Lutomirski wrote: > > Do you mean that the code you sent rejects this case? If so, please > > document it. In any case, I looked at the code, and it seems to be > > trying to handle MAP_SHARED and MAP_ANONYMOUS. I don't see where it > > would reject copying a vDSO. > > I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO. > I'll check carefully, whether it's enough for vDSO. I think you could make the new syscall a lot more comprehensible bg restricting it to just MAP_ANONYMOUS, by making it unmap the source, or possibly both. If the new syscall unmaps the source (in order so that the source is gone before the newly mapped pages become accessible), then you avoid issues in which you need to define sensible semantics for what happens if both copies are accessed simultaneously. --Andy
On 23.05.2019 19:19, Andy Lutomirski wrote: > On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> >> On 21.05.2019 19:43, Andy Lutomirski wrote: >>> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> >>>> On 21.05.2019 17:43, Andy Lutomirski wrote: > >>> Do you mean that the code you sent rejects this case? If so, please >>> document it. In any case, I looked at the code, and it seems to be >>> trying to handle MAP_SHARED and MAP_ANONYMOUS. I don't see where it >>> would reject copying a vDSO. >> >> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO. >> I'll check carefully, whether it's enough for vDSO. > > I think you could make the new syscall a lot more comprehensible bg > restricting it to just MAP_ANONYMOUS, by making it unmap the source, > or possibly both. If the new syscall unmaps the source (in order so > that the source is gone before the newly mapped pages become > accessible), then you avoid issues in which you need to define > sensible semantics for what happens if both copies are accessed > simultaneously. In case of we unmap source, this does not introduce a new principal behavior with the same page mapped twice in a single process like Kirill pointed. This sounds as a good idea and this covers my application area. The only new principal thing is a child process will be able to inherit a parent's VMA, which is not possible now. But it looks like we never depend on processes relationship in the mapping code, and process reparenting already gives many combinations, so the new change should not affect much on this. Kirill
On 22.05.2019 18:22, Kirill A. Shutemov wrote: > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >> This patchset adds a new syscall, which makes possible >> to clone a VMA from a process to current process. >> The syscall supplements the functionality provided >> by process_vm_writev() and process_vm_readv() syscalls, >> and it may be useful in many situation. > > Kirill, could you explain how the change affects rmap and how it is safe. > > My concern is that the patchset allows to map the same page multiple times > within one process or even map page allocated by child to the parrent. > > It was not allowed before. > > In the best case it makes reasoning about rmap substantially more difficult. > > But I'm worry it will introduce hard-to-debug bugs, like described in > https://lwn.net/Articles/383162/. Andy suggested to unmap PTEs from source page table, and this make the single page never be mapped in the same process twice. This is OK for my use case, and here we will just do a small step "allow to inherit VMA by a child process", which we didn't have before this. If someone still needs to continue the work to allow the same page be mapped twice in a single process in the future, this person will have a supported basis we do in this small step. I believe, someone like debugger may want to have this to make a fast snapshot of a process private memory (when the task is stopped for a small time to get its memory). But for me remapping is enough at the moment. What do you think about this? [...] Kirill
On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: > On 22.05.2019 18:22, Kirill A. Shutemov wrote: > > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: > >> This patchset adds a new syscall, which makes possible > >> to clone a VMA from a process to current process. > >> The syscall supplements the functionality provided > >> by process_vm_writev() and process_vm_readv() syscalls, > >> and it may be useful in many situation. > > > > Kirill, could you explain how the change affects rmap and how it is safe. > > > > My concern is that the patchset allows to map the same page multiple times > > within one process or even map page allocated by child to the parrent. > > > > It was not allowed before. > > > > In the best case it makes reasoning about rmap substantially more difficult. > > > > But I'm worry it will introduce hard-to-debug bugs, like described in > > https://lwn.net/Articles/383162/. > > Andy suggested to unmap PTEs from source page table, and this make the single > page never be mapped in the same process twice. This is OK for my use case, > and here we will just do a small step "allow to inherit VMA by a child process", > which we didn't have before this. If someone still needs to continue the work > to allow the same page be mapped twice in a single process in the future, this > person will have a supported basis we do in this small step. I believe, someone > like debugger may want to have this to make a fast snapshot of a process private > memory (when the task is stopped for a small time to get its memory). But for > me remapping is enough at the moment. > > What do you think about this? I don't think that unmapping alone will do. Consider the following scenario: 1. Task A creates and populates the mapping. 2. Task A forks. We have now Task B mapping the same pages, but write-protected. 3. Task B calls process_vm_mmap() and passes the mapping to the parent. After this Task A will have the same anon pages mapped twice. One possible way out would be to force CoW on all pages in the mapping, before passing the mapping to the new process. Thanks, Kirill.
On 24.05.2019 14:52, Kirill A. Shutemov wrote: > On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: >> On 22.05.2019 18:22, Kirill A. Shutemov wrote: >>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >>>> This patchset adds a new syscall, which makes possible >>>> to clone a VMA from a process to current process. >>>> The syscall supplements the functionality provided >>>> by process_vm_writev() and process_vm_readv() syscalls, >>>> and it may be useful in many situation. >>> >>> Kirill, could you explain how the change affects rmap and how it is safe. >>> >>> My concern is that the patchset allows to map the same page multiple times >>> within one process or even map page allocated by child to the parrent. >>> >>> It was not allowed before. >>> >>> In the best case it makes reasoning about rmap substantially more difficult. >>> >>> But I'm worry it will introduce hard-to-debug bugs, like described in >>> https://lwn.net/Articles/383162/. >> >> Andy suggested to unmap PTEs from source page table, and this make the single >> page never be mapped in the same process twice. This is OK for my use case, >> and here we will just do a small step "allow to inherit VMA by a child process", >> which we didn't have before this. If someone still needs to continue the work >> to allow the same page be mapped twice in a single process in the future, this >> person will have a supported basis we do in this small step. I believe, someone >> like debugger may want to have this to make a fast snapshot of a process private >> memory (when the task is stopped for a small time to get its memory). But for >> me remapping is enough at the moment. >> >> What do you think about this? > > I don't think that unmapping alone will do. Consider the following > scenario: > > 1. Task A creates and populates the mapping. > 2. Task A forks. We have now Task B mapping the same pages, but > write-protected. > 3. Task B calls process_vm_mmap() and passes the mapping to the parent. > > After this Task A will have the same anon pages mapped twice. Ah, sure. > One possible way out would be to force CoW on all pages in the mapping, > before passing the mapping to the new process. This will pop all swapped pages up, which is the thing the patchset aims to prevent. Hm, what about allow remapping only VMA, which anon_vma::rb_root contain only chain and which vma->anon_vma_chain contains single entry? This is a vma, which were faulted, but its mm never were duplicated (or which forks already died). Thanks, Kirill
On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote: > On 24.05.2019 14:52, Kirill A. Shutemov wrote: > > On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: > >> On 22.05.2019 18:22, Kirill A. Shutemov wrote: > >>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: > >>>> This patchset adds a new syscall, which makes possible > >>>> to clone a VMA from a process to current process. > >>>> The syscall supplements the functionality provided > >>>> by process_vm_writev() and process_vm_readv() syscalls, > >>>> and it may be useful in many situation. > >>> > >>> Kirill, could you explain how the change affects rmap and how it is safe. > >>> > >>> My concern is that the patchset allows to map the same page multiple times > >>> within one process or even map page allocated by child to the parrent. > >>> > >>> It was not allowed before. > >>> > >>> In the best case it makes reasoning about rmap substantially more difficult. > >>> > >>> But I'm worry it will introduce hard-to-debug bugs, like described in > >>> https://lwn.net/Articles/383162/. > >> > >> Andy suggested to unmap PTEs from source page table, and this make the single > >> page never be mapped in the same process twice. This is OK for my use case, > >> and here we will just do a small step "allow to inherit VMA by a child process", > >> which we didn't have before this. If someone still needs to continue the work > >> to allow the same page be mapped twice in a single process in the future, this > >> person will have a supported basis we do in this small step. I believe, someone > >> like debugger may want to have this to make a fast snapshot of a process private > >> memory (when the task is stopped for a small time to get its memory). But for > >> me remapping is enough at the moment. > >> > >> What do you think about this? > > > > I don't think that unmapping alone will do. Consider the following > > scenario: > > > > 1. Task A creates and populates the mapping. > > 2. Task A forks. We have now Task B mapping the same pages, but > > write-protected. > > 3. Task B calls process_vm_mmap() and passes the mapping to the parent. > > > > After this Task A will have the same anon pages mapped twice. > > Ah, sure. > > > One possible way out would be to force CoW on all pages in the mapping, > > before passing the mapping to the new process. > > This will pop all swapped pages up, which is the thing the patchset aims > to prevent. > > Hm, what about allow remapping only VMA, which anon_vma::rb_root contain > only chain and which vma->anon_vma_chain contains single entry? This is > a vma, which were faulted, but its mm never were duplicated (or which > forks already died). The requirement for the VMA to be faulted (have any pages mapped) looks excessive to me, but the general idea may work. One issue I see is that userspace may not have full control to create such VMA. vma_merge() can merge the VMA to the next one without any consent from userspace and you'll get anon_vma inherited from the VMA you've justed merged with. I don't have any valid idea on how to get around this.
On 28.05.2019 02:30, Kirill A. Shutemov wrote: > On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote: >> On 24.05.2019 14:52, Kirill A. Shutemov wrote: >>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: >>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote: >>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >>>>>> This patchset adds a new syscall, which makes possible >>>>>> to clone a VMA from a process to current process. >>>>>> The syscall supplements the functionality provided >>>>>> by process_vm_writev() and process_vm_readv() syscalls, >>>>>> and it may be useful in many situation. >>>>> >>>>> Kirill, could you explain how the change affects rmap and how it is safe. >>>>> >>>>> My concern is that the patchset allows to map the same page multiple times >>>>> within one process or even map page allocated by child to the parrent. >>>>> >>>>> It was not allowed before. >>>>> >>>>> In the best case it makes reasoning about rmap substantially more difficult. >>>>> >>>>> But I'm worry it will introduce hard-to-debug bugs, like described in >>>>> https://lwn.net/Articles/383162/. >>>> >>>> Andy suggested to unmap PTEs from source page table, and this make the single >>>> page never be mapped in the same process twice. This is OK for my use case, >>>> and here we will just do a small step "allow to inherit VMA by a child process", >>>> which we didn't have before this. If someone still needs to continue the work >>>> to allow the same page be mapped twice in a single process in the future, this >>>> person will have a supported basis we do in this small step. I believe, someone >>>> like debugger may want to have this to make a fast snapshot of a process private >>>> memory (when the task is stopped for a small time to get its memory). But for >>>> me remapping is enough at the moment. >>>> >>>> What do you think about this? >>> >>> I don't think that unmapping alone will do. Consider the following >>> scenario: >>> >>> 1. Task A creates and populates the mapping. >>> 2. Task A forks. We have now Task B mapping the same pages, but >>> write-protected. >>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent. >>> >>> After this Task A will have the same anon pages mapped twice. >> >> Ah, sure. >> >>> One possible way out would be to force CoW on all pages in the mapping, >>> before passing the mapping to the new process. >> >> This will pop all swapped pages up, which is the thing the patchset aims >> to prevent. >> >> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain >> only chain and which vma->anon_vma_chain contains single entry? This is >> a vma, which were faulted, but its mm never were duplicated (or which >> forks already died). > > The requirement for the VMA to be faulted (have any pages mapped) looks > excessive to me, but the general idea may work. > > One issue I see is that userspace may not have full control to create such > VMA. vma_merge() can merge the VMA to the next one without any consent > from userspace and you'll get anon_vma inherited from the VMA you've > justed merged with. > > I don't have any valid idea on how to get around this. Technically it is possible by creating boundary 1-page VMAs with another protection: one above and one below the desired region, then map the desired mapping. But this is not comfortable. I don't think it's difficult to find a natural limitation, which prevents mapping a single page twice if we want to avoid this at least on start. Another suggestion: prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root is the same as root of one of local process's VMA. What about this? Thanks, Kirill
On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote: > On 28.05.2019 02:30, Kirill A. Shutemov wrote: > > On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote: > >> On 24.05.2019 14:52, Kirill A. Shutemov wrote: > >>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: > >>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote: > >>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: > >>>>>> This patchset adds a new syscall, which makes possible > >>>>>> to clone a VMA from a process to current process. > >>>>>> The syscall supplements the functionality provided > >>>>>> by process_vm_writev() and process_vm_readv() syscalls, > >>>>>> and it may be useful in many situation. > >>>>> > >>>>> Kirill, could you explain how the change affects rmap and how it is safe. > >>>>> > >>>>> My concern is that the patchset allows to map the same page multiple times > >>>>> within one process or even map page allocated by child to the parrent. > >>>>> > >>>>> It was not allowed before. > >>>>> > >>>>> In the best case it makes reasoning about rmap substantially more difficult. > >>>>> > >>>>> But I'm worry it will introduce hard-to-debug bugs, like described in > >>>>> https://lwn.net/Articles/383162/. > >>>> > >>>> Andy suggested to unmap PTEs from source page table, and this make the single > >>>> page never be mapped in the same process twice. This is OK for my use case, > >>>> and here we will just do a small step "allow to inherit VMA by a child process", > >>>> which we didn't have before this. If someone still needs to continue the work > >>>> to allow the same page be mapped twice in a single process in the future, this > >>>> person will have a supported basis we do in this small step. I believe, someone > >>>> like debugger may want to have this to make a fast snapshot of a process private > >>>> memory (when the task is stopped for a small time to get its memory). But for > >>>> me remapping is enough at the moment. > >>>> > >>>> What do you think about this? > >>> > >>> I don't think that unmapping alone will do. Consider the following > >>> scenario: > >>> > >>> 1. Task A creates and populates the mapping. > >>> 2. Task A forks. We have now Task B mapping the same pages, but > >>> write-protected. > >>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent. > >>> > >>> After this Task A will have the same anon pages mapped twice. > >> > >> Ah, sure. > >> > >>> One possible way out would be to force CoW on all pages in the mapping, > >>> before passing the mapping to the new process. > >> > >> This will pop all swapped pages up, which is the thing the patchset aims > >> to prevent. > >> > >> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain > >> only chain and which vma->anon_vma_chain contains single entry? This is > >> a vma, which were faulted, but its mm never were duplicated (or which > >> forks already died). > > > > The requirement for the VMA to be faulted (have any pages mapped) looks > > excessive to me, but the general idea may work. > > > > One issue I see is that userspace may not have full control to create such > > VMA. vma_merge() can merge the VMA to the next one without any consent > > from userspace and you'll get anon_vma inherited from the VMA you've > > justed merged with. > > > > I don't have any valid idea on how to get around this. > > Technically it is possible by creating boundary 1-page VMAs with another protection: > one above and one below the desired region, then map the desired mapping. But this > is not comfortable. > > I don't think it's difficult to find a natural limitation, which prevents mapping > a single page twice if we want to avoid this at least on start. Another suggestion: > > prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root > is the same as root of one of local process's VMA. > > What about this? I don't see anything immediately wrong with this, but it's still going to produce puzzling errors for a user. How would you document such limitation in the way it makes sense for userspace developer?
On 28.05.2019 19:15, Kirill A. Shutemov wrote: > On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote: >> On 28.05.2019 02:30, Kirill A. Shutemov wrote: >>> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote: >>>> On 24.05.2019 14:52, Kirill A. Shutemov wrote: >>>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote: >>>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote: >>>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >>>>>>>> This patchset adds a new syscall, which makes possible >>>>>>>> to clone a VMA from a process to current process. >>>>>>>> The syscall supplements the functionality provided >>>>>>>> by process_vm_writev() and process_vm_readv() syscalls, >>>>>>>> and it may be useful in many situation. >>>>>>> >>>>>>> Kirill, could you explain how the change affects rmap and how it is safe. >>>>>>> >>>>>>> My concern is that the patchset allows to map the same page multiple times >>>>>>> within one process or even map page allocated by child to the parrent. >>>>>>> >>>>>>> It was not allowed before. >>>>>>> >>>>>>> In the best case it makes reasoning about rmap substantially more difficult. >>>>>>> >>>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in >>>>>>> https://lwn.net/Articles/383162/. >>>>>> >>>>>> Andy suggested to unmap PTEs from source page table, and this make the single >>>>>> page never be mapped in the same process twice. This is OK for my use case, >>>>>> and here we will just do a small step "allow to inherit VMA by a child process", >>>>>> which we didn't have before this. If someone still needs to continue the work >>>>>> to allow the same page be mapped twice in a single process in the future, this >>>>>> person will have a supported basis we do in this small step. I believe, someone >>>>>> like debugger may want to have this to make a fast snapshot of a process private >>>>>> memory (when the task is stopped for a small time to get its memory). But for >>>>>> me remapping is enough at the moment. >>>>>> >>>>>> What do you think about this? >>>>> >>>>> I don't think that unmapping alone will do. Consider the following >>>>> scenario: >>>>> >>>>> 1. Task A creates and populates the mapping. >>>>> 2. Task A forks. We have now Task B mapping the same pages, but >>>>> write-protected. >>>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent. >>>>> >>>>> After this Task A will have the same anon pages mapped twice. >>>> >>>> Ah, sure. >>>> >>>>> One possible way out would be to force CoW on all pages in the mapping, >>>>> before passing the mapping to the new process. >>>> >>>> This will pop all swapped pages up, which is the thing the patchset aims >>>> to prevent. >>>> >>>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain >>>> only chain and which vma->anon_vma_chain contains single entry? This is >>>> a vma, which were faulted, but its mm never were duplicated (or which >>>> forks already died). >>> >>> The requirement for the VMA to be faulted (have any pages mapped) looks >>> excessive to me, but the general idea may work. >>> >>> One issue I see is that userspace may not have full control to create such >>> VMA. vma_merge() can merge the VMA to the next one without any consent >>> from userspace and you'll get anon_vma inherited from the VMA you've >>> justed merged with. >>> >>> I don't have any valid idea on how to get around this. >> >> Technically it is possible by creating boundary 1-page VMAs with another protection: >> one above and one below the desired region, then map the desired mapping. But this >> is not comfortable. >> >> I don't think it's difficult to find a natural limitation, which prevents mapping >> a single page twice if we want to avoid this at least on start. Another suggestion: >> >> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root >> is the same as root of one of local process's VMA. >> >> What about this? > > I don't see anything immediately wrong with this, but it's still going to > produce puzzling errors for a user. How would you document such limitation > in the way it makes sense for userspace developer? It's difficult, since the limitation is artificial. I just may to suggest more strict limitation. Something like "VMA may be remapped only as a whole region, and only in the case of there were not fork() after VMA appeared in a process (by mmap or remapping from another remote process). In case of VMA were merged with a neighbouring VMA, the same rules are applied to the neighbours. diff --git a/include/linux/mm.h b/include/linux/mm.h index 0e8834ac32b7..0bcd6f598e73 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -287,13 +287,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ +#define VM_MAY_REMOTE_REMAP VM_HIGH_ARCH_5 + #ifdef CONFIG_ARCH_HAS_PKEYS # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ diff --git a/kernel/fork.c b/kernel/fork.c index ff4efd16fd82..a3c758c8cd54 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -584,8 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, rb_parent = &tmp->vm_rb; mm->map_count++; - if (!(tmp->vm_flags & VM_WIPEONFORK)) + if (!(tmp->vm_flags & VM_WIPEONFORK)) { retval = copy_page_range(mm, oldmm, mpnt); + mpnt->vm_flags &= ~VM_MAY_REMOTE_REMAP; + } if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);
On 22.05.2019 18:22, Kirill A. Shutemov wrote: > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >> This patchset adds a new syscall, which makes possible >> to clone a VMA from a process to current process. >> The syscall supplements the functionality provided >> by process_vm_writev() and process_vm_readv() syscalls, >> and it may be useful in many situation. > > Kirill, could you explain how the change affects rmap and how it is safe. > > My concern is that the patchset allows to map the same page multiple times > within one process or even map page allocated by child to the parrent. Speaking honestly, we already support this model, since ZERO_PAGE() may be mapped multiply times in any number of mappings. Kirill
On 03.06.2019 17:38, Kirill Tkhai wrote: > On 22.05.2019 18:22, Kirill A. Shutemov wrote: >> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >>> This patchset adds a new syscall, which makes possible >>> to clone a VMA from a process to current process. >>> The syscall supplements the functionality provided >>> by process_vm_writev() and process_vm_readv() syscalls, >>> and it may be useful in many situation. >> >> Kirill, could you explain how the change affects rmap and how it is safe. >> >> My concern is that the patchset allows to map the same page multiple times >> within one process or even map page allocated by child to the parrent. > > Speaking honestly, we already support this model, since ZERO_PAGE() may > be mapped multiply times in any number of mappings. Picking of huge_zero_page and mremapping its VMA to unaligned address also gives the case, when the same huge page is mapped as huge page and as set of ordinary pages in the same process. Summing up two above cases, is there really a fundamental problem with the functionality the patch set introduces? It looks like we already have these cases in stable kernel supported. Thanks, Kirill
On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote: > On 03.06.2019 17:38, Kirill Tkhai wrote: > > On 22.05.2019 18:22, Kirill A. Shutemov wrote: > >> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: > >>> This patchset adds a new syscall, which makes possible > >>> to clone a VMA from a process to current process. > >>> The syscall supplements the functionality provided > >>> by process_vm_writev() and process_vm_readv() syscalls, > >>> and it may be useful in many situation. > >> > >> Kirill, could you explain how the change affects rmap and how it is safe. > >> > >> My concern is that the patchset allows to map the same page multiple times > >> within one process or even map page allocated by child to the parrent. > > > > Speaking honestly, we already support this model, since ZERO_PAGE() may > > be mapped multiply times in any number of mappings. > > Picking of huge_zero_page and mremapping its VMA to unaligned address also gives > the case, when the same huge page is mapped as huge page and as set of ordinary > pages in the same process. > > Summing up two above cases, is there really a fundamental problem with > the functionality the patch set introduces? It looks like we already have > these cases in stable kernel supported. It *might* work. But it requires a lot of audit to prove that it actually *does* work. For instance, are you sure it will not break KSM? What does it mean for memory accounting? memcg? My point is that you breaking long standing invariant in Linux MM and it has to be properly justified. I would expect to see some strange deadlocks or permanent trylock failure as result of such change.
On 03.06.2019 20:47, Kirill A. Shutemov wrote:> On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote: >> On 03.06.2019 17:38, Kirill Tkhai wrote: >>> On 22.05.2019 18:22, Kirill A. Shutemov wrote: >>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote: >>>>> This patchset adds a new syscall, which makes possible >>>>> to clone a VMA from a process to current process. >>>>> The syscall supplements the functionality provided >>>>> by process_vm_writev() and process_vm_readv() syscalls, >>>>> and it may be useful in many situation. >>>> >>>> Kirill, could you explain how the change affects rmap and how it is safe. >>>> >>>> My concern is that the patchset allows to map the same page multiple times >>>> within one process or even map page allocated by child to the parrent. >>> >>> Speaking honestly, we already support this model, since ZERO_PAGE() may >>> be mapped multiply times in any number of mappings. >> >> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives >> the case, when the same huge page is mapped as huge page and as set of ordinary >> pages in the same process. >> >> Summing up two above cases, is there really a fundamental problem with >> the functionality the patch set introduces? It looks like we already have >> these cases in stable kernel supported. > > It *might* work. But it requires a lot of audit to prove that it actually > *does* work. Please, give the represent of the way the audit results should look like for you. In case of I hadn't done some audit before patchset preparing, I wouldn't have sent it. So, give an idea that you expect from this. > For instance, are you sure it will not break KSM? Yes, it does not break KSM. The main point is that in case of KSM we already may have not just only a page mapped twice in a single process, but even a page mapped twice in a single VMA. And this is just a particular case of generic supported set. (Ordinary page still can't be mapped twice in a single VMA, since pgoff differences won't allow to merge such two hunks together). The generic rule of ksm is "everything may happen with a page in a real time, and all of this will be reflected in stable and unstable trees and rmap_items some time later". Pages of a duplicated VMA will be interpreted as KSM fork, and the corresponding checks in unstable_tree_search_insert() and stable_tree_search() provide this. When both of source and destination VMAs are mergeable, 1)if page was added to stable tree before the duplication of related VMA, then during scanning destination VMA in cmp_and_merge_page() it will be detected as a duplicate, and we will just add related rmap_item to stable node chain; 2)if page was added to unstable tree before the duplication of related VMA, and it is remaining there, then the page will be detected as a duplicate in destination VMA, and the scan of page will be skipped till next turn; 3)if page was not added to any tree before the duplication, it may be added to one of the trees and it will be handled by one of two rules above. When one of source or destination VMAs is not mergeable, while a page become PageKsm() during scanning other of them, the unmergeable VMA becomes to refer to PageKsm(), which does not have rmap_item. But it still possible to unmap that page from unmergeable VMA, since rmap_walk_ksm() goes over all anon_vma under rb_root. Just the same as what happens, when process forks, and its child makes VMA unmergeable. > What does it mean for memory accounting? memcg? Once assigned memcg remains the same after VMA duplication. Mapped page range advances counters in vm_stat_account(). Since we keep fork() semantics, the same thing occurs as after fork()+mremap(). > My point is that you breaking long standing invariant in Linux MM and it > has to be properly justified. I'm not against that. Please, say, which form of the justification you expect. I assume you do not mean retelling of every string of existing code, because this way the words will take 10 times more, than the code, and just not human possible. Please, give the specific request what you expect, and how this should look like. > I would expect to see some strange deadlocks or permanent trylock failure > as result of such change. Do you hint some specific area? Do you expect I run some specific test cases? Do you want we add some debugging engine on top of page locking to detect such the trylock failures? Thanks, Kirill
v2: Add PVMMAP_FIXED_NOREPLACE flag. Use find_vma_without_flags() and may_mmap_overlapped_region() helpers, so even more code became reused. Syscall number is changed. Fix whitespaces. Prohibited a cloning from local to remote process. Only mapping to local process mm is allowed, since I missed initially, that get_unmapped_area() can't be used for remote process. This may be very simply solved by passing @mm argument to all .get_unmapped_area handlers. In this patchset I don't do this, since this gives a lot of cleanup patches, which hides main logic away. I'm going to send them later, as another series, after we finish with this. [Summary] New syscall, which allows to clone a remote process VMA into local process VM. The remote process's page table entries related to the VMA are cloned into local process's page table (in any desired address, which makes this different from that happens during fork()). Huge pages are handled appropriately. This allows to improve performance in significant way like it's shows in the example below. [Description] This patchset adds a new syscall, which makes possible to clone a VMA from a process to current process. The syscall supplements the functionality provided by process_vm_writev() and process_vm_readv() syscalls, and it may be useful in many situation. For example, it allows to make a zero copy of data, when process_vm_writev() was previously used: struct iovec local_iov, remote_iov; void *buf; buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, ...); recv(sock, buf, n * PAGE_SIZE, 0); local_iov->iov_base = buf; local_iov->iov_len = n * PAGE_SIZE; remove_iov = ...; process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); munmap(buf, n * PAGE_SIZE); (Note, that above completely ignores error handling) There are several problems with process_vm_writev() in this example: 1)it causes pagefault on remote process memory, and it forces allocation of a new page (if was not preallocated); 2)amount of memory for this example is doubled in a moment -- n pages in current and n pages in remote tasks are occupied at the same time; 3)received data has no a chance to be properly swapped for a long time. The third is the most critical in case of remote process touches the data pages some time after process_vm_writev() was made. Imagine, node is under memory pressure: a)kernel moves @buf pages into swap right after recv(); b)process_vm_writev() reads the data back from swap to pages; c)process_vm_writev() allocates duplicate pages in remote process and populates them; d)munmap() unmaps @buf; e)5 minutes later remote task touches data. In stages "a" and "b" kernel submits unneeded IO and makes system IO throughput worse. To make "b" and "c", kernel reclaims memory, and moves pages of some other processes to swap, so they have to read pages from swap back. Also, unneeded copying of pages is occured, while zero-copy is more preferred. We observe similar problem during online migration of big enough containers, when after doubling of container's size, the time increases 100 times. The system resides under high IO and throwing out of useful cashes. The proposed syscall aims to introduce an interface, which supplements currently existing process_vm_writev() and process_vm_readv(), and allows to solve the problem with anonymous memory transfer. The above example may be rewritten as: [Task 1] void *buf; buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, ...); recv(sock, buf, n * PAGE_SIZE, 0); [Task 2] buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0); This creates a copy of VMA related to buf from task1 in task2's VM. Task1's page table entries are copied into corresponding page table entries of VM of task2. It is swap-friendly: in case of memory is swapped right after recv(), the syscall just copies pagetable entries like we do on fork(), so real access to pages does not occurs, and no IO is needed. No excess pages are reclaimed, and number of pages is not doubled. Also, zero-copy takes a place, and this also reduces overhead. The patchset does not introduce much new code, since we simply reuse existing copy_page_range() and copy_vma() functions. We extend copy_vma() to be able merge VMAs in remote task [2/7], and teach copy_page_range() to work with different local and remote addresses [3/7]. Patch [7/7] introduces the syscall logic, which mostly consists of sanity checks. The rest of patches are preparations. This syscall may be used for page servers like in example above, for migration (I assume, even virtual machines may want something like this), for zero-copy desiring users of process_vm_writev() and process_vm_readv(), for debug purposes, etc. It requires the same permittions like existing proc_vm_xxx() syscalls have. The tests I used may be obtained here (UPDATED): [1]https://gist.github.com/tkhai/ce46502fc53580372da35e8c3b7818b9 [2]https://gist.github.com/tkhai/40bda78e304d2fe0d90863214b9ac5b5 Previous version (RFC): [3]https://lore.kernel.org/lkml/CAG48ez0itiEE1x=SXeMbjKvMGkrj7wxjM6c+ZB00LpXAAhqmiw@mail.gmail.com/T/ --- Kirill Tkhai (7): mm: Add process_vm_mmap() syscall declaration mm: Extend copy_vma() mm: Extend copy_page_range() mm: Export round_hint_to_min() mm: Introduce may_mmap_overlapped_region() helper mm: Introduce find_vma_filter_flags() helper mm: Add process_vm_mmap() arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 2 include/linux/huge_mm.h | 6 + include/linux/mm.h | 14 ++ include/linux/mm_types.h | 2 include/linux/mman.h | 14 ++ include/linux/syscalls.h | 5 + include/uapi/asm-generic/mman-common.h | 6 + include/uapi/asm-generic/unistd.h | 5 + init/Kconfig | 9 +- kernel/fork.c | 5 + kernel/sys_ni.c | 2 mm/huge_memory.c | 30 ++++- mm/memory.c | 165 +++++++++++++++++++--------- mm/mmap.c | 186 ++++++++++++++++++++++++++------ mm/mremap.c | 43 +++++-- mm/process_vm_access.c | 69 ++++++++++++ 17 files changed, 439 insertions(+), 125 deletions(-) -- Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>