Message ID | 1557844195-18882-1-git-send-email-rppt@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/gup: continue VM_FAULT_RETRY processing event for pre-faults | expand |
On Tue, May 14, 2019 at 7:32 AM Mike Rapoport <rppt@linux.ibm.com> wrote: > > When get_user_pages*() is called with pages = NULL, the processing of > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > the pages. > > If the pages in the requested range belong to a VMA that has userfaultfd > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > has populated the page, but for the gup pre-fault case there's no actual > retry and the caller will get no pages although they are present. > > This issue was uncovered when running post-copy memory restore in CRIU > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > copy_fpstate_to_sigframe() fails"). > > After this change, the copying of FPU state to the sigframe switched from > copy_to_user() variants which caused a real page fault to get_user_pages() > with pages parameter set to NULL. > > In post-copy mode of CRIU, the destination memory is managed with > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > causes a crash of the restored process. > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > one fixes the issue. > Tested-by: Andrei Vagin <avagin@gmail.com> https://travis-ci.org/avagin/linux/builds/533184940 > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > mm/gup.c | 15 ++++++++------- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 91819b8..c32ae5a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > BUG_ON(ret >= nr_pages); > } > > - if (!pages) > - /* If it's a prefault don't insist harder */ > - return ret; > - > if (ret > 0) { > nr_pages -= ret; > pages_done += ret; > @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > pages_done = ret; > break; > } > - /* VM_FAULT_RETRY triggered, so seek to the faulting offset */ > - pages += ret; > + /* > + * VM_FAULT_RETRY triggered, so seek to the faulting offset. > + * For the prefault case (!pages) we only update counts. > + */ > + if (likely(pages)) > + pages += ret; > start += ret << PAGE_SHIFT; > > /* > @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > pages_done++; > if (!nr_pages) > break; > - pages++; > + if (likely(pages)) > + pages++; > start += PAGE_SIZE; > } > if (lock_dropped && *locked) { > -- > 2.7.4 >
Hi, Any comments on this? On Tue, May 14, 2019 at 05:29:55PM +0300, Mike Rapoport wrote: > When get_user_pages*() is called with pages = NULL, the processing of > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > the pages. > > If the pages in the requested range belong to a VMA that has userfaultfd > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > has populated the page, but for the gup pre-fault case there's no actual > retry and the caller will get no pages although they are present. > > This issue was uncovered when running post-copy memory restore in CRIU > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > copy_fpstate_to_sigframe() fails"). > > After this change, the copying of FPU state to the sigframe switched from > copy_to_user() variants which caused a real page fault to get_user_pages() > with pages parameter set to NULL. > > In post-copy mode of CRIU, the destination memory is managed with > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > causes a crash of the restored process. > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > one fixes the issue. > > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > mm/gup.c | 15 ++++++++------- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 91819b8..c32ae5a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > BUG_ON(ret >= nr_pages); > } > > - if (!pages) > - /* If it's a prefault don't insist harder */ > - return ret; > - > if (ret > 0) { > nr_pages -= ret; > pages_done += ret; > @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > pages_done = ret; > break; > } > - /* VM_FAULT_RETRY triggered, so seek to the faulting offset */ > - pages += ret; > + /* > + * VM_FAULT_RETRY triggered, so seek to the faulting offset. > + * For the prefault case (!pages) we only update counts. > + */ > + if (likely(pages)) > + pages += ret; > start += ret << PAGE_SHIFT; > > /* > @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > pages_done++; > if (!nr_pages) > break; > - pages++; > + if (likely(pages)) > + pages++; > start += PAGE_SIZE; > } > if (lock_dropped && *locked) { > -- > 2.7.4 >
On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > When get_user_pages*() is called with pages = NULL, the processing of > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > the pages. > > If the pages in the requested range belong to a VMA that has userfaultfd > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > has populated the page, but for the gup pre-fault case there's no actual > retry and the caller will get no pages although they are present. > > This issue was uncovered when running post-copy memory restore in CRIU > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > copy_fpstate_to_sigframe() fails"). > > After this change, the copying of FPU state to the sigframe switched from > copy_to_user() variants which caused a real page fault to get_user_pages() > with pages parameter set to NULL. You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL? If so was that expected by the (now cc'ed) developers of d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")? It seems rather odd. copy_fpregs_to_sigframe() doesn't look like it's expecting a NULL argument. Also, I wonder if copy_fpstate_to_sigframe() would be better using fault_in_pages_writeable() rather than get_user_pages_unlocked(). That seems like it operates at a more suitable level and I guess it will fix this issue also. > In post-copy mode of CRIU, the destination memory is managed with > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > causes a crash of the restored process. > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > one fixes the issue. Should this be backported into -stable trees? > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote: > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > > > When get_user_pages*() is called with pages = NULL, the processing of > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > > the pages. > > > > If the pages in the requested range belong to a VMA that has userfaultfd > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > > has populated the page, but for the gup pre-fault case there's no actual > > retry and the caller will get no pages although they are present. > > > > This issue was uncovered when running post-copy memory restore in CRIU > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > > copy_fpstate_to_sigframe() fails"). > > > > After this change, the copying of FPU state to the sigframe switched from > > copy_to_user() variants which caused a real page fault to get_user_pages() > > with pages parameter set to NULL. > > You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL? buf_fx is user stack pointer and it should not be NULL. > If so was that expected by the (now cc'ed) developers of > d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if > copy_fpstate_to_sigframe() fails")? > > It seems rather odd. copy_fpregs_to_sigframe() doesn't look like it's > expecting a NULL argument. exactly, this is not expected. > Also, I wonder if copy_fpstate_to_sigframe() would be better using > fault_in_pages_writeable() rather than get_user_pages_unlocked(). That > seems like it operates at a more suitable level and I guess it will fix > this issue also. It looks, like fault_in_pages_writeable() would work. If this is the recommendation from the MM department than I can switch to that. > > In post-copy mode of CRIU, the destination memory is managed with > > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > > causes a crash of the restored process. > > > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > > one fixes the issue. > > Should this be backported into -stable trees? Sebastian
(added kvm) On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote: > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > > > When get_user_pages*() is called with pages = NULL, the processing of > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > > the pages. > > > > If the pages in the requested range belong to a VMA that has userfaultfd > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > > has populated the page, but for the gup pre-fault case there's no actual > > retry and the caller will get no pages although they are present. > > > > This issue was uncovered when running post-copy memory restore in CRIU > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > > copy_fpstate_to_sigframe() fails"). > > > > After this change, the copying of FPU state to the sigframe switched from > > copy_to_user() variants which caused a real page fault to get_user_pages() > > with pages parameter set to NULL. > > You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL? Apparently I haven't explained well. The 'pages' parameter in the call to get_user_pages_unlocked() is NULL. > If so was that expected by the (now cc'ed) developers of > d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if > copy_fpstate_to_sigframe() fails")? > > It seems rather odd. copy_fpregs_to_sigframe() doesn't look like it's > expecting a NULL argument. > > Also, I wonder if copy_fpstate_to_sigframe() would be better using > fault_in_pages_writeable() rather than get_user_pages_unlocked(). That > seems like it operates at a more suitable level and I guess it will fix > this issue also. If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid page faults, hence the use of get_user_pages(). With fault_in_pages_writeable() there might be a page fault, unless I've completely mistaken. Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call to get_user_pages() with pages parameter set to NULL tries to access userfaultfd-managed memory. Currently, there are 4 in tree users: arch/x86/kernel/fpu/signal.c:198:8-31: -> gup with !pages arch/x86/mm/mpx.c:423:11-25: -> gup with !pages virt/kvm/async_pf.c:90:1-22: -> gup with !pages virt/kvm/kvm_main.c:1437:6-20: -> gup with !pages I don't know if anybody is using mpx with uffd and anyway mpx seems to go away. As for KVM, I think that post-copy live migration of L2 guest might trigger that as well. Not sure though, I'm not really familiar with KVM code. > > In post-copy mode of CRIU, the destination memory is managed with > > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > > causes a crash of the restored process. > > > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > > one fixes the issue. > > Should this be backported into -stable trees? I think that it depends on whether KVM affected by this or not. > > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > >
On Wed, 22 May 2019 23:38:29 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > (added kvm) > > On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote: > > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > > When get_user_pages*() is called with pages = NULL, the processing of > > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > > > the pages. > > > > > > If the pages in the requested range belong to a VMA that has userfaultfd > > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > > > has populated the page, but for the gup pre-fault case there's no actual > > > retry and the caller will get no pages although they are present. > > > > > > This issue was uncovered when running post-copy memory restore in CRIU > > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > > > copy_fpstate_to_sigframe() fails"). > > > > > > After this change, the copying of FPU state to the sigframe switched from > > > copy_to_user() variants which caused a real page fault to get_user_pages() > > > with pages parameter set to NULL. > > > > You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL? > > Apparently I haven't explained well. The 'pages' parameter in the call to > get_user_pages_unlocked() is NULL. Doh. > > If so was that expected by the (now cc'ed) developers of > > d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if > > copy_fpstate_to_sigframe() fails")? > > > > It seems rather odd. copy_fpregs_to_sigframe() doesn't look like it's > > expecting a NULL argument. > > > > Also, I wonder if copy_fpstate_to_sigframe() would be better using > > fault_in_pages_writeable() rather than get_user_pages_unlocked(). That > > seems like it operates at a more suitable level and I guess it will fix > > this issue also. > > If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu: > Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid > page faults, hence the use of get_user_pages(). > > With fault_in_pages_writeable() there might be a page fault, unless I've > completely mistaken. > > Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call > to get_user_pages() with pages parameter set to NULL tries to access > userfaultfd-managed memory. Currently, there are 4 in tree users: > > arch/x86/kernel/fpu/signal.c:198:8-31: -> gup with !pages > arch/x86/mm/mpx.c:423:11-25: -> gup with !pages > virt/kvm/async_pf.c:90:1-22: -> gup with !pages > virt/kvm/kvm_main.c:1437:6-20: -> gup with !pages OK. > I don't know if anybody is using mpx with uffd and anyway mpx seems to go > away. > > As for KVM, I think that post-copy live migration of L2 guest might trigger > that as well. Not sure though, I'm not really familiar with KVM code. > > > > In post-copy mode of CRIU, the destination memory is managed with > > > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > > > causes a crash of the restored process. > > > > > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > > > one fixes the issue. > > > > Should this be backported into -stable trees? > > I think that it depends on whether KVM affected by this or not. > How do we determine this? I guess it doesn't matter much.
On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote: > On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote: > > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > > When get_user_pages*() is called with pages = NULL, the processing of > > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > > > the pages. > > > > > > If the pages in the requested range belong to a VMA that has userfaultfd > > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > > > has populated the page, but for the gup pre-fault case there's no actual > > > retry and the caller will get no pages although they are present. > > > > > > This issue was uncovered when running post-copy memory restore in CRIU > > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > > > copy_fpstate_to_sigframe() fails"). I've been getting unexplained segmentation violations, and "make" giving up early, when running kernel builds under swapping memory pressure: no CRIU involved. Bisected last night to that same x86/fpu commit, not itself guilty, but suffering from the odd behavior of get_user_pages_unlocked() giving up too early. (I wondered at first if copy_fpstate_to_sigframe() ought to retry if non-negative ret < nr_pages, but no, that would be wrong: a present page followed by an invalid area would repeatedly return 1 for nr_pages 2.) Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same? > > > > > > After this change, the copying of FPU state to the sigframe switched from > > > copy_to_user() variants which caused a real page fault to get_user_pages() > > > with pages parameter set to NULL. ... > > Also, I wonder if copy_fpstate_to_sigframe() would be better using > > fault_in_pages_writeable() rather than get_user_pages_unlocked(). That > > seems like it operates at a more suitable level and I guess it will fix > > this issue also. > > It looks, like fault_in_pages_writeable() would work. If this is the > recommendation from the MM department than I can switch to that. I've now run a couple of hours of load successfully with Mike's patch to GUP, no problem; but whatever the merits of that patch in general, I agree with Andrew that fault_in_pages_writeable() seems altogether more appropriate for copy_fpstate_to_sigframe(), and have now run a couple of hours of load successfully with this instead (rewrite to taste): --- 5.2-rc1/arch/x86/kernel/fpu/signal.c +++ linux/arch/x86/kernel/fpu/signal.c @@ -3,6 +3,7 @@ * FPU signal frame handling routines. */ +#include <linux/pagemap.h> #include <linux/compat.h> #include <linux/cpu.h> @@ -189,15 +190,7 @@ retry: fpregs_unlock(); if (ret) { - int aligned_size; - int nr_pages; - - aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size; - nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE); - - ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages, - NULL, FOLL_WRITE); - if (ret == nr_pages) + if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) goto retry; return -EFAULT; } (I did wonder whether there needs to be an access_ok() check on buf_fx; but if so, then I think it would already have been needed before the earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into that to be sure, nor into whether access_ok() check on buf covers buf_fx.) Hugh > > > > In post-copy mode of CRIU, the destination memory is managed with > > > userfaultfd and lack of the retry for pre-fault case in get_user_pages() > > > causes a crash of the restored process. > > > > > > Making the pre-fault behavior of get_user_pages() the same as the "normal" > > > one fixes the issue. > > > > Should this be backported into -stable trees? > > Sebastian
On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote: > I've now run a couple of hours of load successfully with Mike's patch > to GUP, no problem; but whatever the merits of that patch in general, > I agree with Andrew that fault_in_pages_writeable() seems altogether > more appropriate for copy_fpstate_to_sigframe(), and have now run a > couple of hours of load successfully with this instead (rewrite to taste): so this patch instead of Mike's GUP patch fixes the issue you observed? Is this just a taste question or limitation of the function in general? I'm asking because it has been suggested and is used in MPX code (in the signal path but .mmap) and I'm not aware of any limitation. But as I wrote earlier to akpm, if the MM folks suggest to use this instead I am happy to switch. > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c > +++ linux/arch/x86/kernel/fpu/signal.c > @@ -3,6 +3,7 @@ > * FPU signal frame handling routines. > */ > > +#include <linux/pagemap.h> > #include <linux/compat.h> > #include <linux/cpu.h> > > @@ -189,15 +190,7 @@ retry: > fpregs_unlock(); > > if (ret) { > - int aligned_size; > - int nr_pages; > - > - aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size; > - nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE); > - > - ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages, > - NULL, FOLL_WRITE); > - if (ret == nr_pages) > + if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) > goto retry; > return -EFAULT; > } > > (I did wonder whether there needs to be an access_ok() check on buf_fx; > but if so, then I think it would already have been needed before the > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into > that to be sure, nor into whether access_ok() check on buf covers buf_fx.) There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The memory is allocated from user's stack and there is (later) an access_ok() for the whole region (which can be more than the memory used by the FPU code). > Hugh Sebastian
On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote: > On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote: > > I've now run a couple of hours of load successfully with Mike's patch > > to GUP, no problem; but whatever the merits of that patch in general, > > I agree with Andrew that fault_in_pages_writeable() seems altogether > > more appropriate for copy_fpstate_to_sigframe(), and have now run a > > couple of hours of load successfully with this instead (rewrite to taste): > > so this patch instead of Mike's GUP patch fixes the issue you observed? Yes. > Is this just a taste question or limitation of the function in general? I'd say it's just a taste question. Though the the fact that your usage showed up a bug in the get_user_pages_unlocked() implementation, demanding a fix, does indicate that it's a more fragile and complex route, better avoided if there's a good simple alternative. If it were not already on your slowpath, I'd also argue fault_in_pages_writeable() is a more efficient way to do it. > > I'm asking because it has been suggested and is used in MPX code (in the > signal path but .mmap) and I'm not aware of any limitation. But as I > wrote earlier to akpm, if the MM folks suggest to use this instead I am > happy to switch. I know nothing of MPX, beyond that Dave Hansen has posted patches to remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is still in the tree. But peering at it now, it looks as if it's using get_user_pages() while holding mmap_sem, whereas you (sensibly enough) used get_user_pages_unlocked() to handle the mmap_sem for you - the trouble with that is that since it knows it's in control of mmap_sem, it feels free to drop it internally, and that takes it down the path of the premature return when pages NULL that Mike is fixing. MPX's get_user_pages() is not free to go that way. > > > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c > > +++ linux/arch/x86/kernel/fpu/signal.c > > @@ -3,6 +3,7 @@ > > * FPU signal frame handling routines. > > */ > > > > +#include <linux/pagemap.h> > > #include <linux/compat.h> > > #include <linux/cpu.h> > > > > @@ -189,15 +190,7 @@ retry: > > fpregs_unlock(); > > > > if (ret) { > > - int aligned_size; > > - int nr_pages; > > - > > - aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size; > > - nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE); > > - > > - ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages, > > - NULL, FOLL_WRITE); > > - if (ret == nr_pages) > > + if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) > > goto retry; > > return -EFAULT; > > } > > > > (I did wonder whether there needs to be an access_ok() check on buf_fx; > > but if so, then I think it would already have been needed before the > > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into > > that to be sure, nor into whether access_ok() check on buf covers buf_fx.) > > There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The > memory is allocated from user's stack and there is (later) an > access_ok() for the whole region (which can be more than the memory used > by the FPU code). Yes, but remember I know nothing of this FPU signal code, so I cannot tell whether an access_ok(buf, size) is good enough to cover the range of an access_ok(buf_fx, fpu_user_xstate_size). Your "(later)" worries me a little - I hope you're not writing first and checking the limits later; but what you're doing may be perfectly correct, I'm just too far from understanding the details to say; but raised the matter because (I think) get_user_pages_unlocked() would entail an access_ok() check where fault_in_pages_writable() would not. Hugh
On Fri 2019-05-24 15:22:51, Hugh Dickins wrote: > On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote: > > On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote: > > > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > > > > When get_user_pages*() is called with pages = NULL, the processing of > > > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all > > > > the pages. > > > > > > > > If the pages in the requested range belong to a VMA that has userfaultfd > > > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space > > > > has populated the page, but for the gup pre-fault case there's no actual > > > > retry and the caller will get no pages although they are present. > > > > > > > > This issue was uncovered when running post-copy memory restore in CRIU > > > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if > > > > copy_fpstate_to_sigframe() fails"). > > I've been getting unexplained segmentation violations, and "make" giving > up early, when running kernel builds under swapping memory pressure: no > CRIU involved. > > Bisected last night to that same x86/fpu commit, not itself guilty, but > suffering from the odd behavior of get_user_pages_unlocked() giving up > too early. > > (I wondered at first if copy_fpstate_to_sigframe() ought to retry if > non-negative ret < nr_pages, but no, that would be wrong: a present page > followed by an invalid area would repeatedly return 1 for nr_pages 2.) > > Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same? The emacs segfault was always during process exit. This sounds different... I don't see problems with make. But its true that at least one of affected machines uses swap heavily. Best regards, Pavel
On 2019-05-25 11:09:15 [-0700], Hugh Dickins wrote: > On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote: > > On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote: > > > I've now run a couple of hours of load successfully with Mike's patch > > > to GUP, no problem; but whatever the merits of that patch in general, > > > I agree with Andrew that fault_in_pages_writeable() seems altogether > > > more appropriate for copy_fpstate_to_sigframe(), and have now run a > > > couple of hours of load successfully with this instead (rewrite to taste): > > > > so this patch instead of Mike's GUP patch fixes the issue you observed? > > Yes. > > > Is this just a taste question or limitation of the function in general? > > I'd say it's just a taste question. Though the the fact that your > usage showed up a bug in the get_user_pages_unlocked() implementation, > demanding a fix, does indicate that it's a more fragile and complex > route, better avoided if there's a good simple alternative. If it were > not already on your slowpath, I'd also argue fault_in_pages_writeable() > is a more efficient way to do it. Okay. The GUP functions are not properly documented for my taste. There is no indication whether or not the mm_sem has to be acquired prior invoking it. Following the call chain of get_user_pages() I ended up in __get_user_pages_locked() `locked = NULL' indicated that mm_sem is no acquired and then I saw this: | if (!locked) | /* VM_FAULT_RETRY couldn't trigger, bypass */ | return ret; kind of suggesting that it is okay to invoke it without holding the mm_sem prefault. It passed a few tests and then https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw happened. After that, I switched to the locked variant and the problem disappeared (also I noticed that MPX code is invoked within ->mmap()). > > I'm asking because it has been suggested and is used in MPX code (in the > > signal path but .mmap) and I'm not aware of any limitation. But as I > > wrote earlier to akpm, if the MM folks suggest to use this instead I am > > happy to switch. > > I know nothing of MPX, beyond that Dave Hansen has posted patches to > remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is > still in the tree. I need to poke at that. I has been removed but then KVM folks complained that they kind of depend on that if it has been exposed to the guest. We need to fade it out slowly… > But peering at it now, it looks as if it's using > get_user_pages() while holding mmap_sem, whereas you (sensibly enough) > used get_user_pages_unlocked() to handle the mmap_sem for you - > the trouble with that is that since it knows it's in control of > mmap_sem, it feels free to drop it internally, and that takes it > down the path of the premature return when pages NULL that Mike is > fixing. MPX's get_user_pages() is not free to go that way. oki. > > > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c > > > +++ linux/arch/x86/kernel/fpu/signal.c > > > @@ -3,6 +3,7 @@ > > > * FPU signal frame handling routines. > > > */ > > > > > > +#include <linux/pagemap.h> > > > #include <linux/compat.h> > > > #include <linux/cpu.h> > > > > > > @@ -189,15 +190,7 @@ retry: > > > fpregs_unlock(); > > > > > > if (ret) { > > > - int aligned_size; > > > - int nr_pages; > > > - > > > - aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size; > > > - nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE); > > > - > > > - ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages, > > > - NULL, FOLL_WRITE); > > > - if (ret == nr_pages) > > > + if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) > > > goto retry; > > > return -EFAULT; > > > } > > > > > > (I did wonder whether there needs to be an access_ok() check on buf_fx; > > > but if so, then I think it would already have been needed before the > > > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into > > > that to be sure, nor into whether access_ok() check on buf covers buf_fx.) > > > > There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The > > memory is allocated from user's stack and there is (later) an > > access_ok() for the whole region (which can be more than the memory used > > by the FPU code). > > Yes, but remember I know nothing of this FPU signal code, so I cannot > tell whether an access_ok(buf, size) is good enough to cover the range > of an access_ok(buf_fx, fpu_user_xstate_size). yes, because size >= fpu_user_xstate_size > Your "(later)" worries me a little - I hope you're not writing first > and checking the limits later; but what you're doing may be perfectly > correct, I'm just too far from understanding the details to say; but > raised the matter because (I think) get_user_pages_unlocked() would > entail an access_ok() check where fault_in_pages_writable() would not. no, we first check the range and then write. It is later checked again after the size has been extended. > Hugh Sebastian
On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote: > > Okay. The GUP functions are not properly documented for my taste. There > is no indication whether or not the mm_sem has to be acquired prior > invoking it. Following the call chain of get_user_pages() I ended up in > __get_user_pages_locked() `locked = NULL' indicated that mm_sem is no > acquired and then I saw this: > | if (!locked) > | /* VM_FAULT_RETRY couldn't trigger, bypass */ > | return ret; > > kind of suggesting that it is okay to invoke it without holding the > mm_sem prefault. It passed a few tests and then > https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw > > happened. After that, I switched to the locked variant and the problem > disappeared (also I noticed that MPX code is invoked within ->mmap()). Ah, I wasn't following what happened here while it was in linux-next. I certainly agree that all the variants are very confusing, and the matter of mmap_sem particularly so. Because it used to be a simple rule that you needed to hold mmap_sem to call get_user_pages(); but that can be so contended that get_user_pages_fast(), and VM_FAULT_RETRY, optimizations came in, then interfaces like get_user_pages_locked() and get_user_pages_unlocked(). I think when you say that you "switched to the locked variant", you're writing of when you switched to using get_user_pages_unlocked(), which takes and releases the mmap_sem itself: yes, using get_user_pages() without holding mmap_sem would have been open to serious errors. The documentation in comments above get_user_pages_locked() and get_user_pages_unlocked() looks rather good to me, actually. But this is all good reason to use the less challenging fault_in_pages_writable() instead. (And also saves all those GUP developers, who from time to time have to search through all users of GUP in one form or another, to make this or that improvement, from having to visit arch/x86/kernel/fpu/signal.c: they will quietly thank you.) Hugh
Hello everyone, On Wed, May 22, 2019 at 02:18:03PM -0700, Andrew Morton wrote: > > arch/x86/kernel/fpu/signal.c:198:8-31: -> gup with !pages This simply had not to return -EFAULT if ret < nr_pages.. but ret >= 0. Instead it did: if (ret == nr_pages) goto retry; return -EFAULT; That was the bug and the correct code would have been: ret = get_user_pages_unlocked(pages=NULL) if (ret < 0) return -EFAULT; goto retry; This eventually should have worked fine but it was less efficient because it's still acting in a full prefault mode and it just tells GUP that pages = NULL and so all it is trying to do is to issue the blocking I/O after the mmap_sem has been released already. Overall the solution applied in commit b81ff1013eb8eef2934ca7e8cf53d553c1029e84 looks nicer. Alternatively it could have used down_read(); get_user_pages(); which prevents get_user_pages to drop the mmap_sem and break the loop if some blocking I/O had to be executed outside mmap_sem. But that would have the side effect of breaking userfaultfd (uffd requires gup_locked/unlocked and FAULT_FLAG_ALLOW_RETRY to be set in the fault flags). Eventually we need to allow VM_FAULT_RETRY to be returned even if FOLL_TRIED is set, so in theory get_user_pages_unlocked(pages=NULL) in a loop must eventually stop returning VM_FAULT_RETRY. FOLL_TRIED could still disambiguate if VM_FAULT_RETRY should or should not be returned so that it is only returned only if it cannnot be avoided (i.e. userfaultfd case). With gup_unlocked(pages=NULL) however all we are interested about is to execute the blocking I/O and we don't care to map anything in the pagetables. A later page fault has to happen anyway for sure because pages was == NULL, it just needs to be a fast one. > > arch/x86/mm/mpx.c:423:11-25: -> gup with !pages Note that get_user_pages is never affected by whatever change after the below, !locked check in gup_locked: if (!locked) /* VM_FAULT_RETRY couldn't trigger, bypass */ return ret; The bypass means when locked is NULL, there is a 1:1 bypass from __get_user_pages<->get_user_pages and the VM_FAULT_RETRY dance never runs. get_user_pages in fact can't support userfaultfd, which makes ptrace and core dump and the hwpoison non blocking in VM_FAULT_RETRY. All places that must support userfaultfd must use get_user_pages_unlocked/locked or somehow end up with FAULT_FLAG_ALLOW_RETRY set in the fault flags. > > virt/kvm/async_pf.c:90:1-22: -> gup with !pages Didn't this get slowed down with the commit df17277b2a85c00f5710e33ce238ba4114687a28? I mean it was a feature not a bug to skip that additional __get_user_pages(FOLL_TRIED). > > virt/kvm/kvm_main.c:1437:6-20: -> gup with !pages Like for mpx.c get_user_pages is agnostic to all these gup_locked changes because it sets locked = NULL, it couldn't break the loop early because it couldn't return VM_FAULT_RETRY. > > OK. Commit df17277b2a85c00f5710e33ce238ba4114687a28 is now applied. So I think the effect it has is to make async_pf.c slower and we didn't solve anything. There are two __get_user_pages: 1) ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas, locked); if (called by get_user_pages) return ret; /* bypass the whole VM_FAULT_RETRY logic */ *locked = 1; lock_dropped = true; down_read(&mm->mmap_sem); 2) ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED, pages, NULL, NULL); The problem introduced is that 2) is getting executed with pages==NULL but there's no point to ever run 2) with pages = NULL. async_pf especially uses nr_pages == 1, so it couldn't get any more optimal than it already was. Before df17277b2a85c00f5710e33ce238ba4114687a28 we broke the loop as soon as the first __get_user_pages returned VM_FAULT_RETRY. We can argue if we shouldn't have broken the loop and we should have kept executing only the first __get_user_pages (marked "1)" above) for the whole range, but nr_pages == 1 is common and in such case there's no difference between the two behaviors. The prefetch callers with nr_pages == 1, didn't even check the retval at all: down_read(&mm->mmap_sem); get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL, &locked); ^^^^ pages NULL // retval ignored It should probably check for retval < 0... but the fault will be retried for good later still with get_user_pages_unlocked() but with pages != NULL, so it'll find out later if it's a segfault. Now if we change the code to skip the second __get_user_pages it's not clear if we can return nr_pages because we may still not have faulted in the whole range in the pagetables. I guess we could still return nr_pages even if we scanned the whole range with only the first of the two __get_user_pages. However if you had mmu notifier registered in the range nr_pages would have stronger semantics if you would execute 2) too, but then without pages array not-NULL such stronger semantics cannot be taken advantage of anyway, because you don't know where those pages are and you can't map them in a secondary MMU even if you execute the line 2). I personally preferred the older code which should at least in theory run faster, it just required documentation that if "pages == NULL" we'll break the loop early because it has to be a "prefetch" attempt and it must be retried until nr_pages == ret. Either that or add a "continue" to skip the second __get_user_pages in line 2) above and then returning nr_pages to indicate VM_FAULT_RETRY may very well have been returned on all page offsets in the virtual range. That will behave the same for async_pf.c because nr_pages == 1. When VM_FAULT_RETRY is returned all I/O should be complete (no matter if network or disk with userfaultfd or just pagecache readpage on filebacked kernel faults) and only a minor fault is required to obtain the page. But it is better to defer that second minor fault to the point where "pages" already become != NULL, so we end up calling __get_user_pages 2 times instead of 3 times. Thanks, Andrea
diff --git a/mm/gup.c b/mm/gup.c index 91819b8..c32ae5a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, BUG_ON(ret >= nr_pages); } - if (!pages) - /* If it's a prefault don't insist harder */ - return ret; - if (ret > 0) { nr_pages -= ret; pages_done += ret; @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done = ret; break; } - /* VM_FAULT_RETRY triggered, so seek to the faulting offset */ - pages += ret; + /* + * VM_FAULT_RETRY triggered, so seek to the faulting offset. + * For the prefault case (!pages) we only update counts. + */ + if (likely(pages)) + pages += ret; start += ret << PAGE_SHIFT; /* @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done++; if (!nr_pages) break; - pages++; + if (likely(pages)) + pages++; start += PAGE_SIZE; } if (lock_dropped && *locked) {
When get_user_pages*() is called with pages = NULL, the processing of VM_FAULT_RETRY terminates early without actually retrying to fault-in all the pages. If the pages in the requested range belong to a VMA that has userfaultfd registered, handle_userfault() returns VM_FAULT_RETRY *after* user space has populated the page, but for the gup pre-fault case there's no actual retry and the caller will get no pages although they are present. This issue was uncovered when running post-copy memory restore in CRIU after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails"). After this change, the copying of FPU state to the sigframe switched from copy_to_user() variants which caused a real page fault to get_user_pages() with pages parameter set to NULL. In post-copy mode of CRIU, the destination memory is managed with userfaultfd and lack of the retry for pre-fault case in get_user_pages() causes a crash of the restored process. Making the pre-fault behavior of get_user_pages() the same as the "normal" one fixes the issue. Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- mm/gup.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)