Message ID | 20220712130542.18836-1-linmiaohe@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte | expand |
On 07/12/22 21:05, Miaohe Lin wrote: > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > for them mistakenly because they're not vm_shared. This will corrupt the > page->mapping used by page cache code. > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > --- > mm/hugetlb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) This looks correct to me. Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE should work on non-anon private mappings. For example, a private mapping of a hugetlbfs file. I think we just map the page in the file/cache and do not set the write bit in the pte. So, yes we would want page_dup_file_rmap() in this case as shown below. Adding Axel and Peter on Cc: as they were more involved in adding that code and the design of UFFDIO_CONTINUE.
On 2022/7/13 1:39, Mike Kravetz wrote: > On 07/12/22 21:05, Miaohe Lin wrote: >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called >> for them mistakenly because they're not vm_shared. This will corrupt the >> page->mapping used by page cache code. >> >> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> >> --- >> mm/hugetlb.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) > > This looks correct to me. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Many thanks for review. > > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE > should work on non-anon private mappings. For example, a private mapping of > a hugetlbfs file. I think we just map the page in the file/cache and do not > set the write bit in the pte. So, yes we would want page_dup_file_rmap() > in this case as shown below. +1 > > Adding Axel and Peter on Cc: as they were more involved in adding that code > and the design of UFFDIO_CONTINUE. That would be really helpful. Thanks. >
On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote: > On 07/12/22 21:05, Miaohe Lin wrote: > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > > for them mistakenly because they're not vm_shared. This will corrupt the > > page->mapping used by page cache code. > > > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > > --- > > mm/hugetlb.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > This looks correct to me. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE > should work on non-anon private mappings. For example, a private mapping of > a hugetlbfs file. I think we just map the page in the file/cache and do not > set the write bit in the pte. So, yes we would want page_dup_file_rmap() > in this case as shown below. > > Adding Axel and Peter on Cc: as they were more involved in adding that code > and the design of UFFDIO_CONTINUE. Yes the change makes sense to me too. There's just one thing to check on whether minor mode should support private mappings at all as it's probably not in the major goal of when it's proposed. I don't see why it can't logically, but I think we should have failed the uffdio-register already somewhere before when the vma was private and registered with minor mode. It's just that I cannot quickly find it in the code anywhere.. ideally it should be checked in vma_can_userfault() but it seems not. Axel? PS: the minor mode man page update seems to be still missing.
On Wed, Jul 13, 2022 at 10:24:09AM -0400, Peter Xu wrote: > On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote: > > On 07/12/22 21:05, Miaohe Lin wrote: > > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > > > for them mistakenly because they're not vm_shared. This will corrupt the > > > page->mapping used by page cache code. > > > > > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > > > --- > > > mm/hugetlb.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > This looks correct to me. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > > > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE > > should work on non-anon private mappings. For example, a private mapping of > > a hugetlbfs file. I think we just map the page in the file/cache and do not > > set the write bit in the pte. So, yes we would want page_dup_file_rmap() > > in this case as shown below. > > > > Adding Axel and Peter on Cc: as they were more involved in adding that code > > and the design of UFFDIO_CONTINUE. > > Yes the change makes sense to me too. There's just one thing to check on > whether minor mode should support private mappings at all as it's probably > not in the major goal of when it's proposed. > > I don't see why it can't logically, but I think we should have failed the > uffdio-register already somewhere before when the vma was private and > registered with minor mode. It's just that I cannot quickly find it in the > code anywhere.. ideally it should be checked in vma_can_userfault() but it > seems not. > > Axel? > > PS: the minor mode man page update seems to be still missing. Oh I should have done a pull first on the man-page repo.. From the man page indeed I didn't see anything mentioned on not allowing private mappings. There's the example given on using two mappings for modifying pages but logically that applies to private mappings too - we could have mapped the uffd region with private mappings but the other one shared, then we can modify page caches but later after pte installed it'll trigger cow for writes. So I think we need to confirm whether private mappings are supported. If no, we should be crystal clear in both the code and man page (we probably want a follow up patch to man-page to mention that too?). If yes, we'll need Miaohe's patch and also make sure they're enabled in the current code path. We'll also want to set test_uffdio_minor=1 for "hugetlb" test case in the userfaultfd kselftest (currently it's not there).
On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > for them mistakenly because they're not vm_shared. This will corrupt the > page->mapping used by page cache code. Well that sounds bad. And theories on why this has gone unnoticed for over a year? I assume this doesn't have coverage in our selftests? > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > goto out_release_unlock; > > - if (vm_shared) { > + if (page_in_pagecache) { > page_dup_file_rmap(page, true); > } else { > ClearHPageRestoreReserve(page); > -- > 2.23.0
I think there is a small mistake in this patch. Consider the non-minor-fault case. We have this block: /* Add shared, newly allocated pages to the page cache. */ if (vm_shared && !is_continue) { /* ... */ } In here, we've added the newly allocated page to the page cache, and we've set this page_in_pagecache flag to true. But we *do not* setup rmap for this page in this block. I think in this case, the patch will cause us to do the wrong thing: we should hugepage_add_new_anon_rmap() further down, but with this patch we dup instead. On Wed, Jul 13, 2022 at 9:10 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Jul 13, 2022 at 10:24:09AM -0400, Peter Xu wrote: > > On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote: > > > On 07/12/22 21:05, Miaohe Lin wrote: > > > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > > > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > > > > for them mistakenly because they're not vm_shared. This will corrupt the > > > > page->mapping used by page cache code. > > > > > > > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > > > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > > > > --- > > > > mm/hugetlb.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > This looks correct to me. > > > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > > > > > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE > > > should work on non-anon private mappings. For example, a private mapping of > > > a hugetlbfs file. I think we just map the page in the file/cache and do not > > > set the write bit in the pte. So, yes we would want page_dup_file_rmap() > > > in this case as shown below. > > > > > > Adding Axel and Peter on Cc: as they were more involved in adding that code > > > and the design of UFFDIO_CONTINUE. > > > > Yes the change makes sense to me too. There's just one thing to check on > > whether minor mode should support private mappings at all as it's probably > > not in the major goal of when it's proposed. > > > > I don't see why it can't logically, but I think we should have failed the > > uffdio-register already somewhere before when the vma was private and > > registered with minor mode. It's just that I cannot quickly find it in the > > code anywhere.. ideally it should be checked in vma_can_userfault() but it > > seems not. > > > > Axel? > > > > PS: the minor mode man page update seems to be still missing. > > Oh I should have done a pull first on the man-page repo.. > > From the man page indeed I didn't see anything mentioned on not allowing > private mappings. There's the example given on using two mappings for > modifying pages but logically that applies to private mappings too - we > could have mapped the uffd region with private mappings but the other one > shared, then we can modify page caches but later after pte installed it'll > trigger cow for writes. > > So I think we need to confirm whether private mappings are supported. If > no, we should be crystal clear in both the code and man page (we probably > want a follow up patch to man-page to mention that too?). If yes, we'll > need Miaohe's patch and also make sure they're enabled in the current code > path. We'll also want to set test_uffdio_minor=1 for "hugetlb" test case > in the userfaultfd kselftest (currently it's not there). So, originally when I proposed minor fault handling, I was expecting to only support VM_SHARED VMAs. Indeed, I too have a hard time imagining how one might benefit from using it with a private mapping. If my memory serves this restriction was relaxed due to feedback on the original RFC proposal [1], essentially on the basis of: why make it more restrictive than it needs to be? Since all we need for a minor fault to happen is for the pages to be in the page cache, that's the only restriction we should have. I don't see why it shouldn't work in principle though. Imagine a scenario where the VM guest's mapping is private, and the memory manager's mapping is shared. I guess in this case, say for a write from the guest: 1. The guest will generate a minor fault 2. The memory manager can modify the page via its shared mapping, and the guest will see those changes 3. After UFFDIO_CONTINUE resolves the fault, the page is CoW-ed, and the memory manager can no longer see the guest's version of the page I'm not really sure *why* you'd want to do this, but it seems like it should work. [1]: https://patchwork.kernel.org/project/linux-mm/patch/20210107190453.3051110-2-axelrasmussen@google.com/ > > -- > Peter Xu >
On 07/13/22 15:46, Axel Rasmussen wrote: > I think there is a small mistake in this patch. > > Consider the non-minor-fault case. We have this block: > > /* Add shared, newly allocated pages to the page cache. */ > if (vm_shared && !is_continue) { > /* ... */ > } > > In here, we've added the newly allocated page to the page cache, and > we've set this page_in_pagecache flag to true. But we *do not* setup > rmap for this page in this block. I think in this case, the patch will > cause us to do the wrong thing: we should hugepage_add_new_anon_rmap() > further down, but with this patch we dup instead. I am not sure I follow. The patch from Miaohe Lin would not change any behavior in the 'if (vm_shared && !is_continue)' case. In this case both vm_shared and page_in_pagecache are true. IIUC, the patch would address the case where !vm_shared && is_continue. On 07/12/22 21:05, Miaohe Lin wrote: > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > for them mistakenly because they're not vm_shared. This will corrupt the > page->mapping used by page cache code. > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > --- > mm/hugetlb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 8d379e03f672..b232e1508e49 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > goto out_release_unlock; > > - if (vm_shared) { > + if (page_in_pagecache) { > page_dup_file_rmap(page, true); > } else { > ClearHPageRestoreReserve(page);
On Wed, Jul 13, 2022 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 07/13/22 15:46, Axel Rasmussen wrote: > > I think there is a small mistake in this patch. > > > > Consider the non-minor-fault case. We have this block: > > > > /* Add shared, newly allocated pages to the page cache. */ > > if (vm_shared && !is_continue) { > > /* ... */ > > } > > > > In here, we've added the newly allocated page to the page cache, and > > we've set this page_in_pagecache flag to true. But we *do not* setup > > rmap for this page in this block. I think in this case, the patch will > > cause us to do the wrong thing: we should hugepage_add_new_anon_rmap() > > further down, but with this patch we dup instead. > > I am not sure I follow. The patch from Miaohe Lin would not change any > behavior in the 'if (vm_shared && !is_continue)' case. In this case > both vm_shared and page_in_pagecache are true. > > IIUC, the patch would address the case where !vm_shared && is_continue. Ah, you're right, my interpretation of the various flags got mixed up somewhere along the way. page_in_pagecache is equivalent to vm_shared in this function, *except* when we have is_continue. Given that, I think this patch is correct in the vm_shared case (no behavior change). In case of !vm_shared && is_continue, I agree the patch is a correction to the previous behavior. > > On 07/12/22 21:05, Miaohe Lin wrote: > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > > for them mistakenly because they're not vm_shared. This will corrupt the > > page->mapping used by page cache code. > > > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > > --- > > mm/hugetlb.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 8d379e03f672..b232e1508e49 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > > goto out_release_unlock; > > > > - if (vm_shared) { > > + if (page_in_pagecache) { > > page_dup_file_rmap(page, true); > > } else { > > ClearHPageRestoreReserve(page); > > -- > Mike Kravetz
On 2022/7/14 1:23, Andrew Morton wrote: > On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called >> for them mistakenly because they're not vm_shared. This will corrupt the >> page->mapping used by page cache code. > > Well that sounds bad. And theories on why this has gone unnoticed for > over a year? I assume this doesn't have coverage in our selftests? As discussed in another thread, when minor fault handling is proposed, only VM_SHARED vma is expected to be supported. And the test case is also missing. Thanks. > >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, >> if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) >> goto out_release_unlock; >> >> - if (vm_shared) { >> + if (page_in_pagecache) { >> page_dup_file_rmap(page, true); >> } else { >> ClearHPageRestoreReserve(page); >> -- >> 2.23.0 > . >
On 2022/7/14 8:20, Axel Rasmussen wrote: > On Wed, Jul 13, 2022 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> On 07/13/22 15:46, Axel Rasmussen wrote: >>> I think there is a small mistake in this patch. >>> >>> Consider the non-minor-fault case. We have this block: >>> >>> /* Add shared, newly allocated pages to the page cache. */ >>> if (vm_shared && !is_continue) { >>> /* ... */ >>> } >>> >>> In here, we've added the newly allocated page to the page cache, and >>> we've set this page_in_pagecache flag to true. But we *do not* setup >>> rmap for this page in this block. I think in this case, the patch will >>> cause us to do the wrong thing: we should hugepage_add_new_anon_rmap() >>> further down, but with this patch we dup instead. >> >> I am not sure I follow. The patch from Miaohe Lin would not change any >> behavior in the 'if (vm_shared && !is_continue)' case. In this case >> both vm_shared and page_in_pagecache are true. >> >> IIUC, the patch would address the case where !vm_shared && is_continue. > > Ah, you're right, my interpretation of the various flags got mixed up > somewhere along the way. > > page_in_pagecache is equivalent to vm_shared in this function, > *except* when we have is_continue. Given that, I think this patch is > correct in the vm_shared case (no behavior change). In case of > !vm_shared && is_continue, I agree the patch is a correction to the > previous behavior. > >> >> On 07/12/22 21:05, Miaohe Lin wrote: >>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page >>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called >>> for them mistakenly because they're not vm_shared. This will corrupt the >>> page->mapping used by page cache code. >>> >>> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> >>> --- >>> mm/hugetlb.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 8d379e03f672..b232e1508e49 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, >>> if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) >>> goto out_release_unlock; >>> >>> - if (vm_shared) { >>> + if (page_in_pagecache) { >>> page_dup_file_rmap(page, true); Many thanks for your comments. As discussed in another thread, we might call page_dup_file_rmap for newly allocated page (regardless of this patch). So should we come up a seperate patch to call page_add_file_rmap here instead? Thanks. >>> } else { >>> ClearHPageRestoreReserve(page); >> >> -- >> Mike Kravetz > . >
On Thu, Jul 14, 2022 at 06:09:49PM +0800, Miaohe Lin wrote: > As discussed in another thread, we might call page_dup_file_rmap for newly > allocated page (regardless of this patch). So should we come up a seperate > patch to call page_add_file_rmap here instead? Hmm, why we need page_add_file_rmap() even if a new page allocated? Say, we're at least also using page_dup_file_rmap() in hugetlb_no_page(). I see majorly two things extra there: memcg accounts on NR_FILE_MAPPED, and mlock. But I assume both of them will not apply to hugetlb pages?
On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote: > On 2022/7/14 1:23, Andrew Morton wrote: > > On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > > > >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > >> for them mistakenly because they're not vm_shared. This will corrupt the > >> page->mapping used by page cache code. > > > > Well that sounds bad. And theories on why this has gone unnoticed for > > over a year? I assume this doesn't have coverage in our selftests? > > As discussed in another thread, when minor fault handling is proposed, only > VM_SHARED vma is expected to be supported. And the test case is also missing. Yes, after this patch applied it'll be great to have the test case covering private mappings too. It's just that it'll be a bit more than setting test_uffdio_minor=1 for "hugetlb" test. In hugetlb_allocate_area() we'll need to setup the alias too for !shared case, it'll be a bit challenging since currently we're using anonymous hugetlb mappings for private tests, and I'm not sure whether we'll need the hugetlb path back just like what we have with "hugetlb_shared" tests.
On 2022/7/14 23:45, Peter Xu wrote: > On Thu, Jul 14, 2022 at 06:09:49PM +0800, Miaohe Lin wrote: >> As discussed in another thread, we might call page_dup_file_rmap for newly >> allocated page (regardless of this patch). So should we come up a seperate >> patch to call page_add_file_rmap here instead? > > Hmm, why we need page_add_file_rmap() even if a new page allocated? Say, > we're at least also using page_dup_file_rmap() in hugetlb_no_page(). > > I see majorly two things extra there: memcg accounts on NR_FILE_MAPPED, and > mlock. But I assume both of them will not apply to hugetlb pages? I think you are right. PageDoubleMap is also irrelevant for hugetlb. page_add_file_rmap shouldn't be called for hugetlb page. It seems page_dup_file_rmap can be regarded as hugetlb variant of page_add_file_rmap. Sorry for making noise. > Thanks.
On 2022/7/14 23:52, Peter Xu wrote: > On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote: >> On 2022/7/14 1:23, Andrew Morton wrote: >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: >>> >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called >>>> for them mistakenly because they're not vm_shared. This will corrupt the >>>> page->mapping used by page cache code. >>> >>> Well that sounds bad. And theories on why this has gone unnoticed for >>> over a year? I assume this doesn't have coverage in our selftests? >> >> As discussed in another thread, when minor fault handling is proposed, only >> VM_SHARED vma is expected to be supported. And the test case is also missing. > > Yes, after this patch applied it'll be great to have the test case covering > private mappings too. > > It's just that it'll be a bit more than setting test_uffdio_minor=1 for > "hugetlb" test. In hugetlb_allocate_area() we'll need to setup the alias > too for !shared case, it'll be a bit challenging since currently we're > using anonymous hugetlb mappings for private tests, and I'm not sure > whether we'll need the hugetlb path back just like what we have with > "hugetlb_shared" tests. I'm afraid not. When minor fault handling is proposed, only VM_SHARED vma is expected to be supported. It seems it's hard to image how one might benefit from using it with a private mapping. But I'm not sure as I'm still a layman in userfaultfd now. Any further suggestions? > Thanks!
On Fri, Jul 15, 2022 at 11:56:40AM +0800, Miaohe Lin wrote: > On 2022/7/14 23:52, Peter Xu wrote: > > On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote: > >> On 2022/7/14 1:23, Andrew Morton wrote: > >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > >>> > >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > >>>> for them mistakenly because they're not vm_shared. This will corrupt the > >>>> page->mapping used by page cache code. > >>> > >>> Well that sounds bad. And theories on why this has gone unnoticed for > >>> over a year? I assume this doesn't have coverage in our selftests? > >> > >> As discussed in another thread, when minor fault handling is proposed, only > >> VM_SHARED vma is expected to be supported. And the test case is also missing. > > > > Yes, after this patch applied it'll be great to have the test case covering > > private mappings too. > > > > It's just that it'll be a bit more than setting test_uffdio_minor=1 for > > "hugetlb" test. In hugetlb_allocate_area() we'll need to setup the alias > > too for !shared case, it'll be a bit challenging since currently we're > > using anonymous hugetlb mappings for private tests, and I'm not sure > > whether we'll need the hugetlb path back just like what we have with > > "hugetlb_shared" tests. > > I'm afraid not. When minor fault handling is proposed, only VM_SHARED vma is > expected to be supported. It seems it's hard to image how one might benefit > from using it with a private mapping. But I'm not sure as I'm still a layman > in userfaultfd now. Any further suggestions? IIUC so far we all think it's not required to limit it to shared mappings only? The effort is mostly the same. My suggestion is above - we could enable the kselftest for it, but I don't strongly ask for that too because I don't know any real use of it, it'll still be good to have it though for completeness. It's just that we may need to change some code back in 9ae8f2b849f79 on using fd-based memory, or I don't know how to create the alias mapping properly. Thanks,
On Fri, Jul 15, 2022 at 5:35 AM Peter Xu <peterx@redhat.com> wrote: > On Fri, Jul 15, 2022 at 11:56:40AM +0800, Miaohe Lin wrote: > > On 2022/7/14 23:52, Peter Xu wrote: > > > On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote: > > >> On 2022/7/14 1:23, Andrew Morton wrote: > > >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> > wrote: > > >>> > > >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the > page > > >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is > called > > >>>> for them mistakenly because they're not vm_shared. This will > corrupt the > > >>>> page->mapping used by page cache code. > > >>> > > >>> Well that sounds bad. And theories on why this has gone unnoticed > for > > >>> over a year? I assume this doesn't have coverage in our selftests? > > >> > > >> As discussed in another thread, when minor fault handling is > proposed, only > > >> VM_SHARED vma is expected to be supported. And the test case is also > missing. > > > > > > Yes, after this patch applied it'll be great to have the test case > covering > > > private mappings too. > > > > > > It's just that it'll be a bit more than setting test_uffdio_minor=1 for > > > "hugetlb" test. In hugetlb_allocate_area() we'll need to setup the > alias > > > too for !shared case, it'll be a bit challenging since currently we're > > > using anonymous hugetlb mappings for private tests, and I'm not sure > > > whether we'll need the hugetlb path back just like what we have with > > > "hugetlb_shared" tests. > > > > I'm afraid not. When minor fault handling is proposed, only VM_SHARED > vma is > > expected to be supported. It seems it's hard to image how one might > benefit > > from using it with a private mapping. But I'm not sure as I'm still a > layman > > in userfaultfd now. Any further suggestions? > > IIUC so far we all think it's not required to limit it to shared mappings > only? The effort is mostly the same. > > My suggestion is above - we could enable the kselftest for it, but I don't > strongly ask for that too because I don't know any real use of it, it'll > still be good to have it though for completeness. It's just that we may > need to change some code back in 9ae8f2b849f79 on using fd-based memory, or > I don't know how to create the alias mapping properly. > I agree we should either: - Update the UFFD selftest to exercise this case - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED for VM_UFFD_MINOR registration. The first one is unfortunately not completely straightforward as Peter described. I would say it's probably not worth holding up this fix while we wait for it to happen? I don't really have a strong preference between the two. The second option is what I originally proposed in the first version of the minor fault series, so going back to that isn't a problem at least from my perspective. If in the future we find a real use case for this, we could always easily re-enable it and add selftests for it at that point. > > Thanks, > > -- > Peter Xu > >
On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > I agree we should either: > - Update the UFFD selftest to exercise this case > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED > for VM_UFFD_MINOR registration. > > The first one is unfortunately not completely straightforward as Peter > described. I would say it's probably not worth holding up this fix while we > wait for it to happen? Agreed, Andrew has already queued it. It actually is a real fix since we never forbid the user running private mappings upon minor faults, so it's literally a bug in kernel irrelevant of the kselftest. > > I don't really have a strong preference between the two. The second option > is what I originally proposed in the first version of the minor fault > series, so going back to that isn't a problem at least from my perspective. > If in the future we find a real use case for this, we could always easily > re-enable it and add selftests for it at that point. I'd go for fixing the test case if possible. Mike, would it be fine if we go back to /dev/hugepages path based approach in the test case? Thanks,
On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > > I agree we should either: > > - Update the UFFD selftest to exercise this case > > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED > > for VM_UFFD_MINOR registration. > > > > The first one is unfortunately not completely straightforward as Peter > > described. I would say it's probably not worth holding up this fix while we > > wait for it to happen? > > Agreed, Andrew has already queued it. It actually is a real fix since we > never forbid the user running private mappings upon minor faults, so > it's literally a bug in kernel irrelevant of the kselftest. > > > > > I don't really have a strong preference between the two. The second option > > is what I originally proposed in the first version of the minor fault > > series, so going back to that isn't a problem at least from my perspective. > > If in the future we find a real use case for this, we could always easily > > re-enable it and add selftests for it at that point. > > I'd go for fixing the test case if possible. Mike, would it be fine if we > go back to /dev/hugepages path based approach in the test case? One possible alternative, can we use memfd_create() with MFD_HUGE_*? This afaict lets us have an fd so we can create two mappings, without having to mount hugetlbfs, pass in a path to the test, ... > > > Thanks, > > -- > Peter Xu >
On 07/15/22 13:07, Peter Xu wrote: > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > > I don't really have a strong preference between the two. The second option > > is what I originally proposed in the first version of the minor fault > > series, so going back to that isn't a problem at least from my perspective. > > If in the future we find a real use case for this, we could always easily > > re-enable it and add selftests for it at that point. > > I'd go for fixing the test case if possible. Mike, would it be fine if we > go back to /dev/hugepages path based approach in the test case? > No problem going back to using a file for private mapping testing. Removing that was more of a simplification, because of new MADV_DONTNEED support. Just want to make sure we also keep remap and remove event testing.
On Fri, Jul 15, 2022 at 10:29:33AM -0700, Mike Kravetz wrote: > On 07/15/22 13:07, Peter Xu wrote: > > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > > > I don't really have a strong preference between the two. The second option > > > is what I originally proposed in the first version of the minor fault > > > series, so going back to that isn't a problem at least from my perspective. > > > If in the future we find a real use case for this, we could always easily > > > re-enable it and add selftests for it at that point. > > > > I'd go for fixing the test case if possible. Mike, would it be fine if we > > go back to /dev/hugepages path based approach in the test case? > > > > No problem going back to using a file for private mapping testing. Removing > that was more of a simplification, because of new MADV_DONTNEED support. > Just want to make sure we also keep remap and remove event testing. Yeah definitely, thanks Mike!
On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote: > On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > > > I agree we should either: > > > - Update the UFFD selftest to exercise this case > > > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED > > > for VM_UFFD_MINOR registration. > > > > > > The first one is unfortunately not completely straightforward as Peter > > > described. I would say it's probably not worth holding up this fix while we > > > wait for it to happen? > > > > Agreed, Andrew has already queued it. It actually is a real fix since we > > never forbid the user running private mappings upon minor faults, so > > it's literally a bug in kernel irrelevant of the kselftest. > > > > > > > > I don't really have a strong preference between the two. The second option > > > is what I originally proposed in the first version of the minor fault > > > series, so going back to that isn't a problem at least from my perspective. > > > If in the future we find a real use case for this, we could always easily > > > re-enable it and add selftests for it at that point. > > > > I'd go for fixing the test case if possible. Mike, would it be fine if we > > go back to /dev/hugepages path based approach in the test case? > > One possible alternative, can we use memfd_create() with MFD_HUGE_*? > This afaict lets us have an fd so we can create two mappings, > without having to mount hugetlbfs, pass in a path to the test, ... Sounds good. :) We can also rework the shared hugetlb too. Wanna post a patch? I can do that too, let me know otherwise. Thanks!
On Fri, Jul 15, 2022 at 10:39 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote: > > On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: > > > > I agree we should either: > > > > - Update the UFFD selftest to exercise this case > > > > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED > > > > for VM_UFFD_MINOR registration. > > > > > > > > The first one is unfortunately not completely straightforward as Peter > > > > described. I would say it's probably not worth holding up this fix while we > > > > wait for it to happen? > > > > > > Agreed, Andrew has already queued it. It actually is a real fix since we > > > never forbid the user running private mappings upon minor faults, so > > > it's literally a bug in kernel irrelevant of the kselftest. > > > > > > > > > > > I don't really have a strong preference between the two. The second option > > > > is what I originally proposed in the first version of the minor fault > > > > series, so going back to that isn't a problem at least from my perspective. > > > > If in the future we find a real use case for this, we could always easily > > > > re-enable it and add selftests for it at that point. > > > > > > I'd go for fixing the test case if possible. Mike, would it be fine if we > > > go back to /dev/hugepages path based approach in the test case? > > > > One possible alternative, can we use memfd_create() with MFD_HUGE_*? > > This afaict lets us have an fd so we can create two mappings, > > without having to mount hugetlbfs, pass in a path to the test, ... > > Sounds good. :) We can also rework the shared hugetlb too. Wanna post a > patch? I can do that too, let me know otherwise. Thanks! Sure, I'll take a whack at it. > > -- > Peter Xu >
On 2022/7/16 1:51, Axel Rasmussen wrote: > On Fri, Jul 15, 2022 at 10:39 AM Peter Xu <peterx@redhat.com> wrote: >> >> On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote: >>> On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote: >>>>> I agree we should either: >>>>> - Update the UFFD selftest to exercise this case >>>>> - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED >>>>> for VM_UFFD_MINOR registration. >>>>> >>>>> The first one is unfortunately not completely straightforward as Peter >>>>> described. I would say it's probably not worth holding up this fix while we >>>>> wait for it to happen? >>>> >>>> Agreed, Andrew has already queued it. It actually is a real fix since we >>>> never forbid the user running private mappings upon minor faults, so >>>> it's literally a bug in kernel irrelevant of the kselftest. >>>> >>>>> >>>>> I don't really have a strong preference between the two. The second option >>>>> is what I originally proposed in the first version of the minor fault >>>>> series, so going back to that isn't a problem at least from my perspective. >>>>> If in the future we find a real use case for this, we could always easily >>>>> re-enable it and add selftests for it at that point. >>>> >>>> I'd go for fixing the test case if possible. Mike, would it be fine if we >>>> go back to /dev/hugepages path based approach in the test case? >>> >>> One possible alternative, can we use memfd_create() with MFD_HUGE_*? >>> This afaict lets us have an fd so we can create two mappings, >>> without having to mount hugetlbfs, pass in a path to the test, ... >> >> Sounds good. :) We can also rework the shared hugetlb too. Wanna post a >> patch? I can do that too, let me know otherwise. Thanks! > > Sure, I'll take a whack at it. Many thanks for all of your hard work. :) > >> >> -- >> Peter Xu >> > . >
On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > On 2022/7/14 1:23, Andrew Morton wrote: > > On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > > > >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > >> for them mistakenly because they're not vm_shared. This will corrupt the > >> page->mapping used by page cache code. > > > > Well that sounds bad. And theories on why this has gone unnoticed for > > over a year? I assume this doesn't have coverage in our selftests? > > As discussed in another thread, when minor fault handling is proposed, only > VM_SHARED vma is expected to be supported So... do we feel that this fix should be backported? And if so, is there a suitable commit for the Fixes:?
On 2022/7/17 7:06, Andrew Morton wrote: > On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > >> On 2022/7/14 1:23, Andrew Morton wrote: >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: >>> >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called >>>> for them mistakenly because they're not vm_shared. This will corrupt the >>>> page->mapping used by page cache code. >>> >>> Well that sounds bad. And theories on why this has gone unnoticed for >>> over a year? I assume this doesn't have coverage in our selftests? >> >> As discussed in another thread, when minor fault handling is proposed, only >> VM_SHARED vma is expected to be supported > > So... do we feel that this fix should be backported? And if so, is > there a suitable commit for the Fixes:? I tend to backport this fix. And I think the Fixes tag in this patch should be suitable, i.e. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl"). Thanks. > . >
On Sun, Jul 17, 2022 at 7:25 PM Miaohe Lin <linmiaohe@huawei.com> wrote: > > On 2022/7/17 7:06, Andrew Morton wrote: > > On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > > > >> On 2022/7/14 1:23, Andrew Morton wrote: > >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > >>> > >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page > >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called > >>>> for them mistakenly because they're not vm_shared. This will corrupt the > >>>> page->mapping used by page cache code. > >>> > >>> Well that sounds bad. And theories on why this has gone unnoticed for > >>> over a year? I assume this doesn't have coverage in our selftests? > >> > >> As discussed in another thread, when minor fault handling is proposed, only > >> VM_SHARED vma is expected to be supported > > > > So... do we feel that this fix should be backported? And if so, is > > there a suitable commit for the Fixes:? > > I tend to backport this fix. And I think the Fixes tag in this patch should be suitable, > i.e. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl"). Agreed, it is worth backporting. > > Thanks. > > > . > > >
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8d379e03f672..b232e1508e49 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) goto out_release_unlock; - if (vm_shared) { + if (page_in_pagecache) { page_dup_file_rmap(page, true); } else { ClearHPageRestoreReserve(page);
In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page cache are installed in the ptes. But hugepage_add_new_anon_rmap is called for them mistakenly because they're not vm_shared. This will corrupt the page->mapping used by page cache code. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)