Message ID | 20190520035254.57579-5-minchan@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory hinting API for external process | expand |
Hi. On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > This patch factor out madvise's core functionality so that upcoming > patch can reuse it without duplication. > > It shouldn't change any behavior. > > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ > 1 file changed, 89 insertions(+), 79 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 9a6698b56845..119e82e1f065 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > return 0; > } > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > +static long madvise_dontneed_free(struct task_struct *tsk, > + struct vm_area_struct *vma, > struct vm_area_struct **prev, > unsigned long start, unsigned long end, > int behavior) > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > if (!userfaultfd_remove(vma, start, end)) { > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > - down_read(¤t->mm->mmap_sem); > - vma = find_vma(current->mm, start); > + down_read(&tsk->mm->mmap_sem); > + vma = find_vma(tsk->mm, start); > if (!vma) > return -ENOMEM; > if (start < vma->vm_start) { > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > * Application wants to free up the pages and associated backing store. > * This is effectively punching a hole into the middle of a file. > */ > -static long madvise_remove(struct vm_area_struct *vma, > +static long madvise_remove(struct task_struct *tsk, > + struct vm_area_struct *vma, > struct vm_area_struct **prev, > unsigned long start, unsigned long end) > { > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, > get_file(f); > if (userfaultfd_remove(vma, start, end)) { > /* mmap_sem was not released by userfaultfd_remove() */ > - up_read(¤t->mm->mmap_sem); > + up_read(&tsk->mm->mmap_sem); > } > error = vfs_fallocate(f, > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > offset, end - start); > fput(f); > - down_read(¤t->mm->mmap_sem); > + down_read(&tsk->mm->mmap_sem); > return error; > } > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > #endif What about madvise_inject_error() and get_user_pages_fast() in it please? > > static long > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > - unsigned long start, unsigned long end, int behavior) > +madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, > + struct vm_area_struct **prev, unsigned long start, > + unsigned long end, int behavior) > { > switch (behavior) { > case MADV_REMOVE: > - return madvise_remove(vma, prev, start, end); > + return madvise_remove(tsk, vma, prev, start, end); > case MADV_WILLNEED: > return madvise_willneed(vma, prev, start, end); > case MADV_COOL: > @@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > return madvise_cold(vma, start, end); > case MADV_FREE: > case MADV_DONTNEED: > - return madvise_dontneed_free(vma, prev, start, end, behavior); > + return madvise_dontneed_free(tsk, vma, prev, start, > + end, behavior); > default: > return madvise_behavior(vma, prev, start, end, behavior); > } > @@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior) > } > } > > -/* > - * The madvise(2) system call. > - * > - * Applications can use madvise() to advise the kernel how it should > - * handle paging I/O in this VM area. The idea is to help the kernel > - * use appropriate read-ahead and caching techniques. The information > - * provided is advisory only, and can be safely disregarded by the > - * kernel without affecting the correct operation of the application. > - * > - * behavior values: > - * MADV_NORMAL - the default behavior is to read clusters. This > - * results in some read-ahead and read-behind. > - * MADV_RANDOM - the system should read the minimum amount of data > - * on any access, since it is unlikely that the appli- > - * cation will need more than what it asks for. > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed > - * once, so they can be aggressively read ahead, and > - * can be freed soon after they are accessed. > - * MADV_WILLNEED - the application is notifying the system to read > - * some pages ahead. > - * MADV_DONTNEED - the application is finished with the given range, > - * so the kernel can free resources associated with it. > - * MADV_FREE - the application marks pages in the given range as lazy free, > - * where actual purges are postponed until memory pressure happens. > - * MADV_REMOVE - the application wants to free up the given range of > - * pages and associated backing store. > - * MADV_DONTFORK - omit this area from child's address space when forking: > - * typically, to avoid COWing pages pinned by get_user_pages(). > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this > - * range after a fork. > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK > - * MADV_HWPOISON - trigger memory error handler as if the given memory range > - * were corrupted by unrecoverable hardware memory failure. > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in > - * this area with pages of identical content from other such areas. > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. > - * MADV_HUGEPAGE - the application wants to back the given range by transparent > - * huge pages in the future. Existing pages might be coalesced and > - * new pages might be allocated as THP. > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > - * transparent huge pages so the existing pages will not be > - * coalesced into THP and new pages will not be allocated as THP. > - * MADV_DONTDUMP - the application wants to prevent pages in the given range > - * from being included in its core dump. > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > - * > - * return values: > - * zero - success > - * -EINVAL - start + len < 0, start is not page-aligned, > - * "behavior" is not a valid value, or application > - * is attempting to release locked or shared pages, > - * or the specified address range includes file, Huge TLB, > - * MAP_SHARED or VMPFNMAP range. > - * -ENOMEM - addresses in the specified range are not currently > - * mapped, or are outside the AS of the process. > - * -EIO - an I/O error occurred while paging in data. > - * -EBADF - map exists, but area maps something that isn't a file. > - * -EAGAIN - a kernel resource was temporarily unavailable. > - */ > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > +static int madvise_core(struct task_struct *tsk, unsigned long start, > + size_t len_in, int behavior) > { > unsigned long end, tmp; > struct vm_area_struct *vma, *prev; > @@ -1071,10 +1015,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > write = madvise_need_mmap_write(behavior); > if (write) { > - if (down_write_killable(¤t->mm->mmap_sem)) > + if (down_write_killable(&tsk->mm->mmap_sem)) > return -EINTR; > } else { > - down_read(¤t->mm->mmap_sem); > + down_read(&tsk->mm->mmap_sem); > } > > /* > @@ -1082,7 +1026,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > * ranges, just ignore them, but return -ENOMEM at the end. > * - different from the way of handling in mlock etc. > */ > - vma = find_vma_prev(current->mm, start, &prev); > + vma = find_vma_prev(tsk->mm, start, &prev); > if (vma && start > vma->vm_start) > prev = vma; > > @@ -1107,7 +1051,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > tmp = end; > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > - error = madvise_vma(vma, &prev, start, tmp, behavior); > + error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); > if (error) > goto out; > start = tmp; > @@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > if (prev) > vma = prev->vm_next; > else /* madvise_remove dropped mmap_sem */ > - vma = find_vma(current->mm, start); > + vma = find_vma(tsk->mm, start); > } > out: > blk_finish_plug(&plug); > if (write) > - up_write(¤t->mm->mmap_sem); > + up_write(&tsk->mm->mmap_sem); > else > - up_read(¤t->mm->mmap_sem); > + up_read(&tsk->mm->mmap_sem); > > return error; > } > + > +/* > + * The madvise(2) system call. > + * > + * Applications can use madvise() to advise the kernel how it should > + * handle paging I/O in this VM area. The idea is to help the kernel > + * use appropriate read-ahead and caching techniques. The information > + * provided is advisory only, and can be safely disregarded by the > + * kernel without affecting the correct operation of the application. > + * > + * behavior values: > + * MADV_NORMAL - the default behavior is to read clusters. This > + * results in some read-ahead and read-behind. > + * MADV_RANDOM - the system should read the minimum amount of data > + * on any access, since it is unlikely that the appli- > + * cation will need more than what it asks for. > + * MADV_SEQUENTIAL - pages in the given range will probably be accessed > + * once, so they can be aggressively read ahead, and > + * can be freed soon after they are accessed. > + * MADV_WILLNEED - the application is notifying the system to read > + * some pages ahead. > + * MADV_DONTNEED - the application is finished with the given range, > + * so the kernel can free resources associated with it. > + * MADV_FREE - the application marks pages in the given range as lazy free, > + * where actual purges are postponed until memory pressure happens. > + * MADV_REMOVE - the application wants to free up the given range of > + * pages and associated backing store. > + * MADV_DONTFORK - omit this area from child's address space when forking: > + * typically, to avoid COWing pages pinned by get_user_pages(). > + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. > + * MADV_WIPEONFORK - present the child process with zero-filled memory in this > + * range after a fork. > + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK > + * MADV_HWPOISON - trigger memory error handler as if the given memory range > + * were corrupted by unrecoverable hardware memory failure. > + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. > + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in > + * this area with pages of identical content from other such areas. > + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. > + * MADV_HUGEPAGE - the application wants to back the given range by transparent > + * huge pages in the future. Existing pages might be coalesced and > + * new pages might be allocated as THP. > + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > + * transparent huge pages so the existing pages will not be > + * coalesced into THP and new pages will not be allocated as THP. > + * MADV_DONTDUMP - the application wants to prevent pages in the given range > + * from being included in its core dump. > + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > + * > + * return values: > + * zero - success > + * -EINVAL - start + len < 0, start is not page-aligned, > + * "behavior" is not a valid value, or application > + * is attempting to release locked or shared pages, > + * or the specified address range includes file, Huge TLB, > + * MAP_SHARED or VMPFNMAP range. > + * -ENOMEM - addresses in the specified range are not currently > + * mapped, or are outside the AS of the process. > + * -EIO - an I/O error occurred while paging in data. > + * -EBADF - map exists, but area maps something that isn't a file. > + * -EAGAIN - a kernel resource was temporarily unavailable. > + */ > +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > +{ > + return madvise_core(current, start, len_in, behavior); > +} > -- > 2.21.0.1020.gf2820cf01a-goog >
Hi Oleksandr, On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote: > Hi. > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > > This patch factor out madvise's core functionality so that upcoming > > patch can reuse it without duplication. > > > > It shouldn't change any behavior. > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ > > 1 file changed, 89 insertions(+), 79 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 9a6698b56845..119e82e1f065 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > return 0; > > } > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > +static long madvise_dontneed_free(struct task_struct *tsk, > > + struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > unsigned long start, unsigned long end, > > int behavior) > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > if (!userfaultfd_remove(vma, start, end)) { > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > - down_read(¤t->mm->mmap_sem); > > - vma = find_vma(current->mm, start); > > + down_read(&tsk->mm->mmap_sem); > > + vma = find_vma(tsk->mm, start); > > if (!vma) > > return -ENOMEM; > > if (start < vma->vm_start) { > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > * Application wants to free up the pages and associated backing store. > > * This is effectively punching a hole into the middle of a file. > > */ > > -static long madvise_remove(struct vm_area_struct *vma, > > +static long madvise_remove(struct task_struct *tsk, > > + struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > unsigned long start, unsigned long end) > > { > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, > > get_file(f); > > if (userfaultfd_remove(vma, start, end)) { > > /* mmap_sem was not released by userfaultfd_remove() */ > > - up_read(¤t->mm->mmap_sem); > > + up_read(&tsk->mm->mmap_sem); > > } > > error = vfs_fallocate(f, > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > offset, end - start); > > fput(f); > > - down_read(¤t->mm->mmap_sem); > > + down_read(&tsk->mm->mmap_sem); > > return error; > > } > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > > #endif > > What about madvise_inject_error() and get_user_pages_fast() in it > please? Good point. Maybe, there more places where assume context is "current" so I'm thinking to limit hints we could allow from external process. It would be better for maintainance point of view in that we could know the workload/usecases when someone ask new advises from external process without making every hints works both contexts. Thanks.
Hi. On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote: > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote: > > Hi. > > > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > > > This patch factor out madvise's core functionality so that upcoming > > > patch can reuse it without duplication. > > > > > > It shouldn't change any behavior. > > > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > > --- > > > mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ > > > 1 file changed, 89 insertions(+), 79 deletions(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 9a6698b56845..119e82e1f065 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > > return 0; > > > } > > > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > > +static long madvise_dontneed_free(struct task_struct *tsk, > > > + struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > unsigned long start, unsigned long end, > > > int behavior) > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > if (!userfaultfd_remove(vma, start, end)) { > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > > > - down_read(¤t->mm->mmap_sem); > > > - vma = find_vma(current->mm, start); > > > + down_read(&tsk->mm->mmap_sem); > > > + vma = find_vma(tsk->mm, start); > > > if (!vma) > > > return -ENOMEM; > > > if (start < vma->vm_start) { > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > * Application wants to free up the pages and associated backing store. > > > * This is effectively punching a hole into the middle of a file. > > > */ > > > -static long madvise_remove(struct vm_area_struct *vma, > > > +static long madvise_remove(struct task_struct *tsk, > > > + struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > unsigned long start, unsigned long end) > > > { > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, > > > get_file(f); > > > if (userfaultfd_remove(vma, start, end)) { > > > /* mmap_sem was not released by userfaultfd_remove() */ > > > - up_read(¤t->mm->mmap_sem); > > > + up_read(&tsk->mm->mmap_sem); > > > } > > > error = vfs_fallocate(f, > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > > offset, end - start); > > > fput(f); > > > - down_read(¤t->mm->mmap_sem); > > > + down_read(&tsk->mm->mmap_sem); > > > return error; > > > } > > > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > > > #endif > > > > What about madvise_inject_error() and get_user_pages_fast() in it > > please? > > Good point. Maybe, there more places where assume context is "current" so > I'm thinking to limit hints we could allow from external process. > It would be better for maintainance point of view in that we could know > the workload/usecases when someone ask new advises from external process > without making every hints works both contexts. Well, for madvise_inject_error() we still have a remote variant of get_user_pages(), and that should work, no? Regarding restricting the hints, I'm definitely interested in having remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote madvise() introduces another issue with traversing remote VMAs reliably. IIUC, one can do this via userspace by parsing [s]maps file only, which is not very consistent, and once some range is parsed, and then it is immediately gone, a wrong hint will be sent. Isn't this a problem we should worry about? > > Thanks.
On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: [...] > Regarding restricting the hints, I'm definitely interested in having > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > madvise() introduces another issue with traversing remote VMAs reliably. > IIUC, one can do this via userspace by parsing [s]maps file only, which > is not very consistent, and once some range is parsed, and then it is > immediately gone, a wrong hint will be sent. > > Isn't this a problem we should worry about? See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz
Hi. On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > [...] > > Regarding restricting the hints, I'm definitely interested in having > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > madvise() introduces another issue with traversing remote VMAs reliably. > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > is not very consistent, and once some range is parsed, and then it is > > immediately gone, a wrong hint will be sent. > > > > Isn't this a problem we should worry about? > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz Oh, thanks for the pointer. Indeed, for my specific task with remote KSM I'd go with map_files instead. This doesn't solve the task completely in case of traversal through all the VMAs in one pass, but makes it easier comparing to a remote syscall.
On Tue, May 21, 2019 at 08:36:28AM +0200, Oleksandr Natalenko wrote: > Hi. > > On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote: > > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote: > > > Hi. > > > > > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > > > > This patch factor out madvise's core functionality so that upcoming > > > > patch can reuse it without duplication. > > > > > > > > It shouldn't change any behavior. > > > > > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > > > --- > > > > mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ > > > > 1 file changed, 89 insertions(+), 79 deletions(-) > > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > > index 9a6698b56845..119e82e1f065 100644 > > > > --- a/mm/madvise.c > > > > +++ b/mm/madvise.c > > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > > > return 0; > > > > } > > > > > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > +static long madvise_dontneed_free(struct task_struct *tsk, > > > > + struct vm_area_struct *vma, > > > > struct vm_area_struct **prev, > > > > unsigned long start, unsigned long end, > > > > int behavior) > > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > if (!userfaultfd_remove(vma, start, end)) { > > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > > > > > - down_read(¤t->mm->mmap_sem); > > > > - vma = find_vma(current->mm, start); > > > > + down_read(&tsk->mm->mmap_sem); > > > > + vma = find_vma(tsk->mm, start); > > > > if (!vma) > > > > return -ENOMEM; > > > > if (start < vma->vm_start) { > > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > * Application wants to free up the pages and associated backing store. > > > > * This is effectively punching a hole into the middle of a file. > > > > */ > > > > -static long madvise_remove(struct vm_area_struct *vma, > > > > +static long madvise_remove(struct task_struct *tsk, > > > > + struct vm_area_struct *vma, > > > > struct vm_area_struct **prev, > > > > unsigned long start, unsigned long end) > > > > { > > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, > > > > get_file(f); > > > > if (userfaultfd_remove(vma, start, end)) { > > > > /* mmap_sem was not released by userfaultfd_remove() */ > > > > - up_read(¤t->mm->mmap_sem); > > > > + up_read(&tsk->mm->mmap_sem); > > > > } > > > > error = vfs_fallocate(f, > > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > > > offset, end - start); > > > > fput(f); > > > > - down_read(¤t->mm->mmap_sem); > > > > + down_read(&tsk->mm->mmap_sem); > > > > return error; > > > > } > > > > > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > > > > #endif > > > > > > What about madvise_inject_error() and get_user_pages_fast() in it > > > please? > > > > Good point. Maybe, there more places where assume context is "current" so > > I'm thinking to limit hints we could allow from external process. > > It would be better for maintainance point of view in that we could know > > the workload/usecases when someone ask new advises from external process > > without making every hints works both contexts. > > Well, for madvise_inject_error() we still have a remote variant of > get_user_pages(), and that should work, no? Regardless of madvise_inject_error, it seems to be risky to expose all of hints for external process, I think. For example, MADV_DONTNEED with race, it's critical for stability. So, until we could get the way to prevent the race, I want to restrict hints. > > Regarding restricting the hints, I'm definitely interested in having > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > madvise() introduces another issue with traversing remote VMAs reliably. How is it signifiact when the race happens? It could waste CPU cycle and make unncessary break of that merged pages but expect it should be rare so such non-desruptive hint could be exposed via process_madvise, I think. If the hint is critical for the race, yes, as Michal suggested, we need a way to close it and I guess non-cooperative userfaultfd with synchronous support would help private anonymous vma. > IIUC, one can do this via userspace by parsing [s]maps file only, which > is not very consistent, and once some range is parsed, and then it is > immediately gone, a wrong hint will be sent. > > Isn't this a problem we should worry about? I think it depends on the hint and usecase. > > > > > Thanks. > > -- > Best regards, > Oleksandr Natalenko (post-factum) > Senior Software Maintenance Engineer
On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote: > Hi. > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > > [...] > > > Regarding restricting the hints, I'm definitely interested in having > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > > madvise() introduces another issue with traversing remote VMAs reliably. > > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > > is not very consistent, and once some range is parsed, and then it is > > > immediately gone, a wrong hint will be sent. > > > > > > Isn't this a problem we should worry about? > > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz > > Oh, thanks for the pointer. > > Indeed, for my specific task with remote KSM I'd go with map_files > instead. This doesn't solve the task completely in case of traversal > through all the VMAs in one pass, but makes it easier comparing to a > remote syscall. I'm wondering how map_files can solve your concern exactly if you have a concern about the race of vma unmap/remap even there are anonymous vma which map_files doesn't support. > > -- > Best regards, > Oleksandr Natalenko (post-factum) > Senior Software Maintenance Engineer
On Tue 21-05-19 19:49:49, Minchan Kim wrote: > On Tue, May 21, 2019 at 08:36:28AM +0200, Oleksandr Natalenko wrote: > > Hi. > > > > On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote: > > > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote: > > > > Hi. > > > > > > > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > > > > > This patch factor out madvise's core functionality so that upcoming > > > > > patch can reuse it without duplication. > > > > > > > > > > It shouldn't change any behavior. > > > > > > > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > > > > --- > > > > > mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ > > > > > 1 file changed, 89 insertions(+), 79 deletions(-) > > > > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > > > index 9a6698b56845..119e82e1f065 100644 > > > > > --- a/mm/madvise.c > > > > > +++ b/mm/madvise.c > > > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > > > > return 0; > > > > > } > > > > > > > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > > +static long madvise_dontneed_free(struct task_struct *tsk, > > > > > + struct vm_area_struct *vma, > > > > > struct vm_area_struct **prev, > > > > > unsigned long start, unsigned long end, > > > > > int behavior) > > > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > > if (!userfaultfd_remove(vma, start, end)) { > > > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > > > > > > > - down_read(¤t->mm->mmap_sem); > > > > > - vma = find_vma(current->mm, start); > > > > > + down_read(&tsk->mm->mmap_sem); > > > > > + vma = find_vma(tsk->mm, start); > > > > > if (!vma) > > > > > return -ENOMEM; > > > > > if (start < vma->vm_start) { > > > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > > > * Application wants to free up the pages and associated backing store. > > > > > * This is effectively punching a hole into the middle of a file. > > > > > */ > > > > > -static long madvise_remove(struct vm_area_struct *vma, > > > > > +static long madvise_remove(struct task_struct *tsk, > > > > > + struct vm_area_struct *vma, > > > > > struct vm_area_struct **prev, > > > > > unsigned long start, unsigned long end) > > > > > { > > > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, > > > > > get_file(f); > > > > > if (userfaultfd_remove(vma, start, end)) { > > > > > /* mmap_sem was not released by userfaultfd_remove() */ > > > > > - up_read(¤t->mm->mmap_sem); > > > > > + up_read(&tsk->mm->mmap_sem); > > > > > } > > > > > error = vfs_fallocate(f, > > > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > > > > offset, end - start); > > > > > fput(f); > > > > > - down_read(¤t->mm->mmap_sem); > > > > > + down_read(&tsk->mm->mmap_sem); > > > > > return error; > > > > > } > > > > > > > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > > > > > #endif > > > > > > > > What about madvise_inject_error() and get_user_pages_fast() in it > > > > please? > > > > > > Good point. Maybe, there more places where assume context is "current" so > > > I'm thinking to limit hints we could allow from external process. > > > It would be better for maintainance point of view in that we could know > > > the workload/usecases when someone ask new advises from external process > > > without making every hints works both contexts. > > > > Well, for madvise_inject_error() we still have a remote variant of > > get_user_pages(), and that should work, no? > > Regardless of madvise_inject_error, it seems to be risky to expose all > of hints for external process, I think. For example, MADV_DONTNEED with > race, it's critical for stability. So, until we could get the way to > prevent the race, I want to restrict hints. Well, if you allow the full ptrace access then you can shoot the target whatever you like. > > Regarding restricting the hints, I'm definitely interested in having > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > madvise() introduces another issue with traversing remote VMAs reliably. > > How is it signifiact when the race happens? It could waste CPU cycle > and make unncessary break of that merged pages but expect it should be > rare so such non-desruptive hint could be exposed via process_madvise, I think. > > If the hint is critical for the race, yes, as Michal suggested, we need a way > to close it and I guess non-cooperative userfaultfd with synchronous support > would help private anonymous vma. If we have a per vma fd approach then we can revalidate atomically and make sure the operation is performed on the range that was really requested. I do not think we want to provide a more specific guarantees. Monitor process has to be careful same way ptrace doesn't want to harm the target.
On Tue 21-05-19 19:52:56, Minchan Kim wrote: > On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote: > > Hi. > > > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > > > [...] > > > > Regarding restricting the hints, I'm definitely interested in having > > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > > > madvise() introduces another issue with traversing remote VMAs reliably. > > > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > > > is not very consistent, and once some range is parsed, and then it is > > > > immediately gone, a wrong hint will be sent. > > > > > > > > Isn't this a problem we should worry about? > > > > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz > > > > Oh, thanks for the pointer. > > > > Indeed, for my specific task with remote KSM I'd go with map_files > > instead. This doesn't solve the task completely in case of traversal > > through all the VMAs in one pass, but makes it easier comparing to a > > remote syscall. > > I'm wondering how map_files can solve your concern exactly if you have > a concern about the race of vma unmap/remap even there are anonymous > vma which map_files doesn't support. See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz
On Tue, May 21, 2019 at 01:00:30PM +0200, Michal Hocko wrote: > On Tue 21-05-19 19:52:56, Minchan Kim wrote: > > On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote: > > > Hi. > > > > > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > > > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > > > > [...] > > > > > Regarding restricting the hints, I'm definitely interested in having > > > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > > > > madvise() introduces another issue with traversing remote VMAs reliably. > > > > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > > > > is not very consistent, and once some range is parsed, and then it is > > > > > immediately gone, a wrong hint will be sent. > > > > > > > > > > Isn't this a problem we should worry about? > > > > > > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz > > > > > > Oh, thanks for the pointer. > > > > > > Indeed, for my specific task with remote KSM I'd go with map_files > > > instead. This doesn't solve the task completely in case of traversal > > > through all the VMAs in one pass, but makes it easier comparing to a > > > remote syscall. > > > > I'm wondering how map_files can solve your concern exactly if you have > > a concern about the race of vma unmap/remap even there are anonymous > > vma which map_files doesn't support. > > See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz Question is how it works for anonymous vma which don't have backing file. > > -- > Michal Hocko > SUSE Labs
On Tue 21-05-19 20:24:23, Minchan Kim wrote: > On Tue, May 21, 2019 at 01:00:30PM +0200, Michal Hocko wrote: > > On Tue 21-05-19 19:52:56, Minchan Kim wrote: > > > On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote: > > > > Hi. > > > > > > > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > > > > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > > > > > [...] > > > > > > Regarding restricting the hints, I'm definitely interested in having > > > > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > > > > > madvise() introduces another issue with traversing remote VMAs reliably. > > > > > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > > > > > is not very consistent, and once some range is parsed, and then it is > > > > > > immediately gone, a wrong hint will be sent. > > > > > > > > > > > > Isn't this a problem we should worry about? > > > > > > > > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz > > > > > > > > Oh, thanks for the pointer. > > > > > > > > Indeed, for my specific task with remote KSM I'd go with map_files > > > > instead. This doesn't solve the task completely in case of traversal > > > > through all the VMAs in one pass, but makes it easier comparing to a > > > > remote syscall. > > > > > > I'm wondering how map_files can solve your concern exactly if you have > > > a concern about the race of vma unmap/remap even there are anonymous > > > vma which map_files doesn't support. > > > > See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz > > Question is how it works for anonymous vma which don't have backing > file. We would have to export map_files like interface for anonymous vmas and have a way to invalidate on the VMA removal (reference counting or something similar), no question this will be some additional work to do.
diff --git a/mm/madvise.c b/mm/madvise.c index 9a6698b56845..119e82e1f065 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, return 0; } -static long madvise_dontneed_free(struct vm_area_struct *vma, +static long madvise_dontneed_free(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_sem has been dropped, prev is stale */ - down_read(¤t->mm->mmap_sem); - vma = find_vma(current->mm, start); + down_read(&tsk->mm->mmap_sem); + vma = find_vma(tsk->mm, start); if (!vma) return -ENOMEM; if (start < vma->vm_start) { @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ -static long madvise_remove(struct vm_area_struct *vma, +static long madvise_remove(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end) { @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_sem was not released by userfaultfd_remove() */ - up_read(¤t->mm->mmap_sem); + up_read(&tsk->mm->mmap_sem); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - down_read(¤t->mm->mmap_sem); + down_read(&tsk->mm->mmap_sem); return error; } @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, #endif static long -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end, int behavior) +madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, int behavior) { switch (behavior) { case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); + return madvise_remove(tsk, vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); case MADV_COOL: @@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_cold(vma, start, end); case MADV_FREE: case MADV_DONTNEED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(tsk, vma, prev, start, + end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior) } } -/* - * The madvise(2) system call. - * - * Applications can use madvise() to advise the kernel how it should - * handle paging I/O in this VM area. The idea is to help the kernel - * use appropriate read-ahead and caching techniques. The information - * provided is advisory only, and can be safely disregarded by the - * kernel without affecting the correct operation of the application. - * - * behavior values: - * MADV_NORMAL - the default behavior is to read clusters. This - * results in some read-ahead and read-behind. - * MADV_RANDOM - the system should read the minimum amount of data - * on any access, since it is unlikely that the appli- - * cation will need more than what it asks for. - * MADV_SEQUENTIAL - pages in the given range will probably be accessed - * once, so they can be aggressively read ahead, and - * can be freed soon after they are accessed. - * MADV_WILLNEED - the application is notifying the system to read - * some pages ahead. - * MADV_DONTNEED - the application is finished with the given range, - * so the kernel can free resources associated with it. - * MADV_FREE - the application marks pages in the given range as lazy free, - * where actual purges are postponed until memory pressure happens. - * MADV_REMOVE - the application wants to free up the given range of - * pages and associated backing store. - * MADV_DONTFORK - omit this area from child's address space when forking: - * typically, to avoid COWing pages pinned by get_user_pages(). - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. - * MADV_WIPEONFORK - present the child process with zero-filled memory in this - * range after a fork. - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK - * MADV_HWPOISON - trigger memory error handler as if the given memory range - * were corrupted by unrecoverable hardware memory failure. - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in - * this area with pages of identical content from other such areas. - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. - * MADV_HUGEPAGE - the application wants to back the given range by transparent - * huge pages in the future. Existing pages might be coalesced and - * new pages might be allocated as THP. - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by - * transparent huge pages so the existing pages will not be - * coalesced into THP and new pages will not be allocated as THP. - * MADV_DONTDUMP - the application wants to prevent pages in the given range - * from being included in its core dump. - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. - * - * return values: - * zero - success - * -EINVAL - start + len < 0, start is not page-aligned, - * "behavior" is not a valid value, or application - * is attempting to release locked or shared pages, - * or the specified address range includes file, Huge TLB, - * MAP_SHARED or VMPFNMAP range. - * -ENOMEM - addresses in the specified range are not currently - * mapped, or are outside the AS of the process. - * -EIO - an I/O error occurred while paging in data. - * -EBADF - map exists, but area maps something that isn't a file. - * -EAGAIN - a kernel resource was temporarily unavailable. - */ -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +static int madvise_core(struct task_struct *tsk, unsigned long start, + size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -1071,10 +1015,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) write = madvise_need_mmap_write(behavior); if (write) { - if (down_write_killable(¤t->mm->mmap_sem)) + if (down_write_killable(&tsk->mm->mmap_sem)) return -EINTR; } else { - down_read(¤t->mm->mmap_sem); + down_read(&tsk->mm->mmap_sem); } /* @@ -1082,7 +1026,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(current->mm, start, &prev); + vma = find_vma_prev(tsk->mm, start, &prev); if (vma && start > vma->vm_start) prev = vma; @@ -1107,7 +1051,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) tmp = end; /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma(vma, &prev, start, tmp, behavior); + error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); if (error) goto out; start = tmp; @@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) if (prev) vma = prev->vm_next; else /* madvise_remove dropped mmap_sem */ - vma = find_vma(current->mm, start); + vma = find_vma(tsk->mm, start); } out: blk_finish_plug(&plug); if (write) - up_write(¤t->mm->mmap_sem); + up_write(&tsk->mm->mmap_sem); else - up_read(¤t->mm->mmap_sem); + up_read(&tsk->mm->mmap_sem); return error; } + +/* + * The madvise(2) system call. + * + * Applications can use madvise() to advise the kernel how it should + * handle paging I/O in this VM area. The idea is to help the kernel + * use appropriate read-ahead and caching techniques. The information + * provided is advisory only, and can be safely disregarded by the + * kernel without affecting the correct operation of the application. + * + * behavior values: + * MADV_NORMAL - the default behavior is to read clusters. This + * results in some read-ahead and read-behind. + * MADV_RANDOM - the system should read the minimum amount of data + * on any access, since it is unlikely that the appli- + * cation will need more than what it asks for. + * MADV_SEQUENTIAL - pages in the given range will probably be accessed + * once, so they can be aggressively read ahead, and + * can be freed soon after they are accessed. + * MADV_WILLNEED - the application is notifying the system to read + * some pages ahead. + * MADV_DONTNEED - the application is finished with the given range, + * so the kernel can free resources associated with it. + * MADV_FREE - the application marks pages in the given range as lazy free, + * where actual purges are postponed until memory pressure happens. + * MADV_REMOVE - the application wants to free up the given range of + * pages and associated backing store. + * MADV_DONTFORK - omit this area from child's address space when forking: + * typically, to avoid COWing pages pinned by get_user_pages(). + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. + * MADV_WIPEONFORK - present the child process with zero-filled memory in this + * range after a fork. + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK + * MADV_HWPOISON - trigger memory error handler as if the given memory range + * were corrupted by unrecoverable hardware memory failure. + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in + * this area with pages of identical content from other such areas. + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. + * MADV_HUGEPAGE - the application wants to back the given range by transparent + * huge pages in the future. Existing pages might be coalesced and + * new pages might be allocated as THP. + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by + * transparent huge pages so the existing pages will not be + * coalesced into THP and new pages will not be allocated as THP. + * MADV_DONTDUMP - the application wants to prevent pages in the given range + * from being included in its core dump. + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * + * return values: + * zero - success + * -EINVAL - start + len < 0, start is not page-aligned, + * "behavior" is not a valid value, or application + * is attempting to release locked or shared pages, + * or the specified address range includes file, Huge TLB, + * MAP_SHARED or VMPFNMAP range. + * -ENOMEM - addresses in the specified range are not currently + * mapped, or are outside the AS of the process. + * -EIO - an I/O error occurred while paging in data. + * -EBADF - map exists, but area maps something that isn't a file. + * -EAGAIN - a kernel resource was temporarily unavailable. + */ +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +{ + return madvise_core(current, start, len_in, behavior); +}
This patch factor out madvise's core functionality so that upcoming patch can reuse it without duplication. It shouldn't change any behavior. Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ 1 file changed, 89 insertions(+), 79 deletions(-)