diff mbox series

mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte

Message ID 20220712130542.18836-1-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte | expand

Commit Message

Miaohe Lin July 12, 2022, 1:05 p.m. UTC
In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
for them mistakenly because they're not vm_shared. This will corrupt the
page->mapping used by page cache code.

Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Mike Kravetz July 12, 2022, 5:39 p.m. UTC | #1
On 07/12/22 21:05, Miaohe Lin wrote:
> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> for them mistakenly because they're not vm_shared. This will corrupt the
> page->mapping used by page cache code.
> 
> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This looks correct to me.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE
should work on non-anon private mappings.  For example, a private mapping of
a hugetlbfs file.  I think we just map the page in the file/cache and do not
set the write bit in the pte.  So, yes we would want page_dup_file_rmap()
in this case as shown below.

Adding Axel and Peter on Cc: as they were more involved in adding that code
and the design of UFFDIO_CONTINUE.
Miaohe Lin July 13, 2022, 2:10 a.m. UTC | #2
On 2022/7/13 1:39, Mike Kravetz wrote:
> On 07/12/22 21:05, Miaohe Lin wrote:
>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
>> for them mistakenly because they're not vm_shared. This will corrupt the
>> page->mapping used by page cache code.
>>
>> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/hugetlb.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> This looks correct to me.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Many thanks for review.

> 
> However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE
> should work on non-anon private mappings.  For example, a private mapping of
> a hugetlbfs file.  I think we just map the page in the file/cache and do not
> set the write bit in the pte.  So, yes we would want page_dup_file_rmap()
> in this case as shown below.

+1

> 
> Adding Axel and Peter on Cc: as they were more involved in adding that code
> and the design of UFFDIO_CONTINUE.

That would be really helpful.

Thanks.

>
Peter Xu July 13, 2022, 2:24 p.m. UTC | #3
On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote:
> On 07/12/22 21:05, Miaohe Lin wrote:
> > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> > for them mistakenly because they're not vm_shared. This will corrupt the
> > page->mapping used by page cache code.
> > 
> > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/hugetlb.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> This looks correct to me.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE
> should work on non-anon private mappings.  For example, a private mapping of
> a hugetlbfs file.  I think we just map the page in the file/cache and do not
> set the write bit in the pte.  So, yes we would want page_dup_file_rmap()
> in this case as shown below.
> 
> Adding Axel and Peter on Cc: as they were more involved in adding that code
> and the design of UFFDIO_CONTINUE.

Yes the change makes sense to me too.  There's just one thing to check on
whether minor mode should support private mappings at all as it's probably
not in the major goal of when it's proposed.

I don't see why it can't logically, but I think we should have failed the
uffdio-register already somewhere before when the vma was private and
registered with minor mode.  It's just that I cannot quickly find it in the
code anywhere..  ideally it should be checked in vma_can_userfault() but it
seems not.

Axel?

PS: the minor mode man page update seems to be still missing.
Peter Xu July 13, 2022, 4:10 p.m. UTC | #4
On Wed, Jul 13, 2022 at 10:24:09AM -0400, Peter Xu wrote:
> On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote:
> > On 07/12/22 21:05, Miaohe Lin wrote:
> > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> > > for them mistakenly because they're not vm_shared. This will corrupt the
> > > page->mapping used by page cache code.
> > > 
> > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > > ---
> > >  mm/hugetlb.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > This looks correct to me.
> > 
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE
> > should work on non-anon private mappings.  For example, a private mapping of
> > a hugetlbfs file.  I think we just map the page in the file/cache and do not
> > set the write bit in the pte.  So, yes we would want page_dup_file_rmap()
> > in this case as shown below.
> > 
> > Adding Axel and Peter on Cc: as they were more involved in adding that code
> > and the design of UFFDIO_CONTINUE.
> 
> Yes the change makes sense to me too.  There's just one thing to check on
> whether minor mode should support private mappings at all as it's probably
> not in the major goal of when it's proposed.
> 
> I don't see why it can't logically, but I think we should have failed the
> uffdio-register already somewhere before when the vma was private and
> registered with minor mode.  It's just that I cannot quickly find it in the
> code anywhere..  ideally it should be checked in vma_can_userfault() but it
> seems not.
> 
> Axel?
> 
> PS: the minor mode man page update seems to be still missing.

Oh I should have done a pull first on the man-page repo..

From the man page indeed I didn't see anything mentioned on not allowing
private mappings.  There's the example given on using two mappings for
modifying pages but logically that applies to private mappings too - we
could have mapped the uffd region with private mappings but the other one
shared, then we can modify page caches but later after pte installed it'll
trigger cow for writes.

So I think we need to confirm whether private mappings are supported. If
no, we should be crystal clear in both the code and man page (we probably
want a follow up patch to man-page to mention that too?).  If yes, we'll
need Miaohe's patch and also make sure they're enabled in the current code
path.  We'll also want to set test_uffdio_minor=1 for "hugetlb" test case
in the userfaultfd kselftest (currently it's not there).
Andrew Morton July 13, 2022, 5:23 p.m. UTC | #5
On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> for them mistakenly because they're not vm_shared. This will corrupt the
> page->mapping used by page cache code.

Well that sounds bad.  And theories on why this has gone unnoticed for
over a year?  I assume this doesn't have coverage in our selftests?

> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
>  		goto out_release_unlock;
>  
> -	if (vm_shared) {
> +	if (page_in_pagecache) {
>  		page_dup_file_rmap(page, true);
>  	} else {
>  		ClearHPageRestoreReserve(page);
> -- 
> 2.23.0
Axel Rasmussen July 13, 2022, 10:46 p.m. UTC | #6
I think there is a small mistake in this patch.

Consider the non-minor-fault case. We have this block:

/* Add shared, newly allocated pages to the page cache. */
if (vm_shared && !is_continue) {
        /* ... */
}

In here, we've added the newly allocated page to the page cache, and
we've set this page_in_pagecache flag to true. But we *do not* setup
rmap for this page in this block. I think in this case, the patch will
cause us to do the wrong thing: we should hugepage_add_new_anon_rmap()
further down, but with this patch we dup instead.

On Wed, Jul 13, 2022 at 9:10 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jul 13, 2022 at 10:24:09AM -0400, Peter Xu wrote:
> > On Tue, Jul 12, 2022 at 10:39:20AM -0700, Mike Kravetz wrote:
> > > On 07/12/22 21:05, Miaohe Lin wrote:
> > > > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> > > > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> > > > for them mistakenly because they're not vm_shared. This will corrupt the
> > > > page->mapping used by page cache code.
> > > >
> > > > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> > > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > > > ---
> > > >  mm/hugetlb.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > This looks correct to me.
> > >
> > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > >
> > > However, I am having a hard time wrapping my head around how UFFDIO_CONTINUE
> > > should work on non-anon private mappings.  For example, a private mapping of
> > > a hugetlbfs file.  I think we just map the page in the file/cache and do not
> > > set the write bit in the pte.  So, yes we would want page_dup_file_rmap()
> > > in this case as shown below.
> > >
> > > Adding Axel and Peter on Cc: as they were more involved in adding that code
> > > and the design of UFFDIO_CONTINUE.
> >
> > Yes the change makes sense to me too.  There's just one thing to check on
> > whether minor mode should support private mappings at all as it's probably
> > not in the major goal of when it's proposed.
> >
> > I don't see why it can't logically, but I think we should have failed the
> > uffdio-register already somewhere before when the vma was private and
> > registered with minor mode.  It's just that I cannot quickly find it in the
> > code anywhere..  ideally it should be checked in vma_can_userfault() but it
> > seems not.
> >
> > Axel?
> >
> > PS: the minor mode man page update seems to be still missing.
>
> Oh I should have done a pull first on the man-page repo..
>
> From the man page indeed I didn't see anything mentioned on not allowing
> private mappings.  There's the example given on using two mappings for
> modifying pages but logically that applies to private mappings too - we
> could have mapped the uffd region with private mappings but the other one
> shared, then we can modify page caches but later after pte installed it'll
> trigger cow for writes.
>
> So I think we need to confirm whether private mappings are supported. If
> no, we should be crystal clear in both the code and man page (we probably
> want a follow up patch to man-page to mention that too?).  If yes, we'll
> need Miaohe's patch and also make sure they're enabled in the current code
> path.  We'll also want to set test_uffdio_minor=1 for "hugetlb" test case
> in the userfaultfd kselftest (currently it's not there).

So, originally when I proposed minor fault handling, I was expecting
to only support VM_SHARED VMAs. Indeed, I too have a hard time
imagining how one might benefit from using it with a private mapping.
If my memory serves this restriction was relaxed due to feedback on
the original RFC proposal [1], essentially on the basis of: why make
it more restrictive than it needs to be? Since all we need for a minor
fault to happen is for the pages to be in the page cache, that's the
only restriction we should have.

I don't see why it shouldn't work in principle though. Imagine a
scenario where the VM guest's mapping is private, and the memory
manager's mapping is shared. I guess in this case, say for a write
from the guest:

1. The guest will generate a minor fault
2. The memory manager can modify the page via its shared mapping, and
the guest will see those changes
3. After UFFDIO_CONTINUE resolves the fault, the page is CoW-ed, and
the memory manager can no longer see the guest's version of the page

I'm not really sure *why* you'd want to do this, but it seems like it
should work.

[1]: https://patchwork.kernel.org/project/linux-mm/patch/20210107190453.3051110-2-axelrasmussen@google.com/

>
> --
> Peter Xu
>
Mike Kravetz July 13, 2022, 11:36 p.m. UTC | #7
On 07/13/22 15:46, Axel Rasmussen wrote:
> I think there is a small mistake in this patch.
> 
> Consider the non-minor-fault case. We have this block:
> 
> /* Add shared, newly allocated pages to the page cache. */
> if (vm_shared && !is_continue) {
>         /* ... */
> }
> 
> In here, we've added the newly allocated page to the page cache, and
> we've set this page_in_pagecache flag to true. But we *do not* setup
> rmap for this page in this block. I think in this case, the patch will
> cause us to do the wrong thing: we should hugepage_add_new_anon_rmap()
> further down, but with this patch we dup instead.

I am not sure I follow.  The patch from Miaohe Lin would not change any
behavior in the 'if (vm_shared && !is_continue)' case.  In this case
both vm_shared and page_in_pagecache are true.

IIUC, the patch would address the case where !vm_shared && is_continue.

On 07/12/22 21:05, Miaohe Lin wrote:
> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> for them mistakenly because they're not vm_shared. This will corrupt the
> page->mapping used by page cache code.
> 
> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8d379e03f672..b232e1508e49 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
>  		goto out_release_unlock;
>  
> -	if (vm_shared) {
> +	if (page_in_pagecache) {
>  		page_dup_file_rmap(page, true);
>  	} else {
>  		ClearHPageRestoreReserve(page);
Axel Rasmussen July 14, 2022, 12:20 a.m. UTC | #8
On Wed, Jul 13, 2022 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 07/13/22 15:46, Axel Rasmussen wrote:
> > I think there is a small mistake in this patch.
> >
> > Consider the non-minor-fault case. We have this block:
> >
> > /* Add shared, newly allocated pages to the page cache. */
> > if (vm_shared && !is_continue) {
> >         /* ... */
> > }
> >
> > In here, we've added the newly allocated page to the page cache, and
> > we've set this page_in_pagecache flag to true. But we *do not* setup
> > rmap for this page in this block. I think in this case, the patch will
> > cause us to do the wrong thing: we should hugepage_add_new_anon_rmap()
> > further down, but with this patch we dup instead.
>
> I am not sure I follow.  The patch from Miaohe Lin would not change any
> behavior in the 'if (vm_shared && !is_continue)' case.  In this case
> both vm_shared and page_in_pagecache are true.
>
> IIUC, the patch would address the case where !vm_shared && is_continue.

Ah, you're right, my interpretation of the various flags got mixed up
somewhere along the way.

page_in_pagecache is equivalent to vm_shared in this function,
*except* when we have is_continue. Given that, I think this patch is
correct in the vm_shared case (no behavior change). In case of
!vm_shared && is_continue, I agree the patch is a correction to the
previous behavior.

>
> On 07/12/22 21:05, Miaohe Lin wrote:
> > In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> > cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> > for them mistakenly because they're not vm_shared. This will corrupt the
> > page->mapping used by page cache code.
> >
> > Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/hugetlb.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8d379e03f672..b232e1508e49 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >       if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> >               goto out_release_unlock;
> >
> > -     if (vm_shared) {
> > +     if (page_in_pagecache) {
> >               page_dup_file_rmap(page, true);
> >       } else {
> >               ClearHPageRestoreReserve(page);
>
> --
> Mike Kravetz
Miaohe Lin July 14, 2022, 9:59 a.m. UTC | #9
On 2022/7/14 1:23, Andrew Morton wrote:
> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
>> for them mistakenly because they're not vm_shared. This will corrupt the
>> page->mapping used by page cache code.
> 
> Well that sounds bad.  And theories on why this has gone unnoticed for
> over a year?  I assume this doesn't have coverage in our selftests?

As discussed in another thread, when minor fault handling is proposed, only
VM_SHARED vma is expected to be supported. And the test case is also missing.

Thanks.

> 
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>  	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
>>  		goto out_release_unlock;
>>  
>> -	if (vm_shared) {
>> +	if (page_in_pagecache) {
>>  		page_dup_file_rmap(page, true);
>>  	} else {
>>  		ClearHPageRestoreReserve(page);
>> -- 
>> 2.23.0
> .
>
Miaohe Lin July 14, 2022, 10:09 a.m. UTC | #10
On 2022/7/14 8:20, Axel Rasmussen wrote:
> On Wed, Jul 13, 2022 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 07/13/22 15:46, Axel Rasmussen wrote:
>>> I think there is a small mistake in this patch.
>>>
>>> Consider the non-minor-fault case. We have this block:
>>>
>>> /* Add shared, newly allocated pages to the page cache. */
>>> if (vm_shared && !is_continue) {
>>>         /* ... */
>>> }
>>>
>>> In here, we've added the newly allocated page to the page cache, and
>>> we've set this page_in_pagecache flag to true. But we *do not* setup
>>> rmap for this page in this block. I think in this case, the patch will
>>> cause us to do the wrong thing: we should hugepage_add_new_anon_rmap()
>>> further down, but with this patch we dup instead.
>>
>> I am not sure I follow.  The patch from Miaohe Lin would not change any
>> behavior in the 'if (vm_shared && !is_continue)' case.  In this case
>> both vm_shared and page_in_pagecache are true.
>>
>> IIUC, the patch would address the case where !vm_shared && is_continue.
> 
> Ah, you're right, my interpretation of the various flags got mixed up
> somewhere along the way.
> 
> page_in_pagecache is equivalent to vm_shared in this function,
> *except* when we have is_continue. Given that, I think this patch is
> correct in the vm_shared case (no behavior change). In case of
> !vm_shared && is_continue, I agree the patch is a correction to the
> previous behavior.
> 
>>
>> On 07/12/22 21:05, Miaohe Lin wrote:
>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
>>> for them mistakenly because they're not vm_shared. This will corrupt the
>>> page->mapping used by page cache code.
>>>
>>> Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/hugetlb.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 8d379e03f672..b232e1508e49 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -6038,7 +6038,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>>       if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
>>>               goto out_release_unlock;
>>>
>>> -     if (vm_shared) {
>>> +     if (page_in_pagecache) {
>>>               page_dup_file_rmap(page, true);

Many thanks for your comments.

As discussed in another thread, we might call page_dup_file_rmap for newly
allocated page (regardless of this patch). So should we come up a seperate
patch to call page_add_file_rmap here instead?

Thanks.

>>>       } else {
>>>               ClearHPageRestoreReserve(page);
>>
>> --
>> Mike Kravetz
> .
>
Peter Xu July 14, 2022, 3:45 p.m. UTC | #11
On Thu, Jul 14, 2022 at 06:09:49PM +0800, Miaohe Lin wrote:
> As discussed in another thread, we might call page_dup_file_rmap for newly
> allocated page (regardless of this patch). So should we come up a seperate
> patch to call page_add_file_rmap here instead?

Hmm, why we need page_add_file_rmap() even if a new page allocated?  Say,
we're at least also using page_dup_file_rmap() in hugetlb_no_page().

I see majorly two things extra there: memcg accounts on NR_FILE_MAPPED, and
mlock.  But I assume both of them will not apply to hugetlb pages?
Peter Xu July 14, 2022, 3:52 p.m. UTC | #12
On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote:
> On 2022/7/14 1:23, Andrew Morton wrote:
> > On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> > 
> >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> >> for them mistakenly because they're not vm_shared. This will corrupt the
> >> page->mapping used by page cache code.
> > 
> > Well that sounds bad.  And theories on why this has gone unnoticed for
> > over a year?  I assume this doesn't have coverage in our selftests?
> 
> As discussed in another thread, when minor fault handling is proposed, only
> VM_SHARED vma is expected to be supported. And the test case is also missing.

Yes, after this patch applied it'll be great to have the test case covering
private mappings too.

It's just that it'll be a bit more than setting test_uffdio_minor=1 for
"hugetlb" test.  In hugetlb_allocate_area() we'll need to setup the alias
too for !shared case, it'll be a bit challenging since currently we're
using anonymous hugetlb mappings for private tests, and I'm not sure
whether we'll need the hugetlb path back just like what we have with
"hugetlb_shared" tests.
Miaohe Lin July 15, 2022, 2:50 a.m. UTC | #13
On 2022/7/14 23:45, Peter Xu wrote:
> On Thu, Jul 14, 2022 at 06:09:49PM +0800, Miaohe Lin wrote:
>> As discussed in another thread, we might call page_dup_file_rmap for newly
>> allocated page (regardless of this patch). So should we come up a seperate
>> patch to call page_add_file_rmap here instead?
> 
> Hmm, why we need page_add_file_rmap() even if a new page allocated?  Say,
> we're at least also using page_dup_file_rmap() in hugetlb_no_page().
> 
> I see majorly two things extra there: memcg accounts on NR_FILE_MAPPED, and
> mlock.  But I assume both of them will not apply to hugetlb pages?

I think you are right. PageDoubleMap is also irrelevant for hugetlb.
page_add_file_rmap shouldn't be called for hugetlb page. It seems
page_dup_file_rmap can be regarded as hugetlb variant of page_add_file_rmap.
Sorry for making noise.

> 

Thanks.
Miaohe Lin July 15, 2022, 3:56 a.m. UTC | #14
On 2022/7/14 23:52, Peter Xu wrote:
> On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote:
>> On 2022/7/14 1:23, Andrew Morton wrote:
>>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>
>>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
>>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
>>>> for them mistakenly because they're not vm_shared. This will corrupt the
>>>> page->mapping used by page cache code.
>>>
>>> Well that sounds bad.  And theories on why this has gone unnoticed for
>>> over a year?  I assume this doesn't have coverage in our selftests?
>>
>> As discussed in another thread, when minor fault handling is proposed, only
>> VM_SHARED vma is expected to be supported. And the test case is also missing.
> 
> Yes, after this patch applied it'll be great to have the test case covering
> private mappings too.
> 
> It's just that it'll be a bit more than setting test_uffdio_minor=1 for
> "hugetlb" test.  In hugetlb_allocate_area() we'll need to setup the alias
> too for !shared case, it'll be a bit challenging since currently we're
> using anonymous hugetlb mappings for private tests, and I'm not sure
> whether we'll need the hugetlb path back just like what we have with
> "hugetlb_shared" tests.

I'm afraid not. When minor fault handling is proposed, only VM_SHARED vma is
expected to be supported. It seems it's hard to image how one might benefit
from using it with a private mapping. But I'm not sure as I'm still a layman
in userfaultfd now. Any further suggestions?

> 

Thanks!
Peter Xu July 15, 2022, 12:35 p.m. UTC | #15
On Fri, Jul 15, 2022 at 11:56:40AM +0800, Miaohe Lin wrote:
> On 2022/7/14 23:52, Peter Xu wrote:
> > On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote:
> >> On 2022/7/14 1:23, Andrew Morton wrote:
> >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>>
> >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> >>>> for them mistakenly because they're not vm_shared. This will corrupt the
> >>>> page->mapping used by page cache code.
> >>>
> >>> Well that sounds bad.  And theories on why this has gone unnoticed for
> >>> over a year?  I assume this doesn't have coverage in our selftests?
> >>
> >> As discussed in another thread, when minor fault handling is proposed, only
> >> VM_SHARED vma is expected to be supported. And the test case is also missing.
> > 
> > Yes, after this patch applied it'll be great to have the test case covering
> > private mappings too.
> > 
> > It's just that it'll be a bit more than setting test_uffdio_minor=1 for
> > "hugetlb" test.  In hugetlb_allocate_area() we'll need to setup the alias
> > too for !shared case, it'll be a bit challenging since currently we're
> > using anonymous hugetlb mappings for private tests, and I'm not sure
> > whether we'll need the hugetlb path back just like what we have with
> > "hugetlb_shared" tests.
> 
> I'm afraid not. When minor fault handling is proposed, only VM_SHARED vma is
> expected to be supported. It seems it's hard to image how one might benefit
> from using it with a private mapping. But I'm not sure as I'm still a layman
> in userfaultfd now. Any further suggestions?

IIUC so far we all think it's not required to limit it to shared mappings
only?  The effort is mostly the same.

My suggestion is above - we could enable the kselftest for it, but I don't
strongly ask for that too because I don't know any real use of it, it'll
still be good to have it though for completeness.  It's just that we may
need to change some code back in 9ae8f2b849f79 on using fd-based memory, or
I don't know how to create the alias mapping properly.

Thanks,
Axel Rasmussen July 15, 2022, 4:45 p.m. UTC | #16
On Fri, Jul 15, 2022 at 5:35 AM Peter Xu <peterx@redhat.com> wrote:

> On Fri, Jul 15, 2022 at 11:56:40AM +0800, Miaohe Lin wrote:
> > On 2022/7/14 23:52, Peter Xu wrote:
> > > On Thu, Jul 14, 2022 at 05:59:53PM +0800, Miaohe Lin wrote:
> > >> On 2022/7/14 1:23, Andrew Morton wrote:
> > >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com>
> wrote:
> > >>>
> > >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the
> page
> > >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is
> called
> > >>>> for them mistakenly because they're not vm_shared. This will
> corrupt the
> > >>>> page->mapping used by page cache code.
> > >>>
> > >>> Well that sounds bad.  And theories on why this has gone unnoticed
> for
> > >>> over a year?  I assume this doesn't have coverage in our selftests?
> > >>
> > >> As discussed in another thread, when minor fault handling is
> proposed, only
> > >> VM_SHARED vma is expected to be supported. And the test case is also
> missing.
> > >
> > > Yes, after this patch applied it'll be great to have the test case
> covering
> > > private mappings too.
> > >
> > > It's just that it'll be a bit more than setting test_uffdio_minor=1 for
> > > "hugetlb" test.  In hugetlb_allocate_area() we'll need to setup the
> alias
> > > too for !shared case, it'll be a bit challenging since currently we're
> > > using anonymous hugetlb mappings for private tests, and I'm not sure
> > > whether we'll need the hugetlb path back just like what we have with
> > > "hugetlb_shared" tests.
> >
> > I'm afraid not. When minor fault handling is proposed, only VM_SHARED
> vma is
> > expected to be supported. It seems it's hard to image how one might
> benefit
> > from using it with a private mapping. But I'm not sure as I'm still a
> layman
> > in userfaultfd now. Any further suggestions?
>
> IIUC so far we all think it's not required to limit it to shared mappings
> only?  The effort is mostly the same.
>
> My suggestion is above - we could enable the kselftest for it, but I don't
> strongly ask for that too because I don't know any real use of it, it'll
> still be good to have it though for completeness.  It's just that we may
> need to change some code back in 9ae8f2b849f79 on using fd-based memory, or
> I don't know how to create the alias mapping properly.
>

I agree we should either:
- Update the UFFD selftest to exercise this case
- Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
for VM_UFFD_MINOR registration.

The first one is unfortunately not completely straightforward as Peter
described. I would say it's probably not worth holding up this fix while we
wait for it to happen?

I don't really have a strong preference between the two. The second option
is what I originally proposed in the first version of the minor fault
series, so going back to that isn't a problem at least from my perspective.
If in the future we find a real use case for this, we could always easily
re-enable it and add selftests for it at that point.


>
> Thanks,
>
> --
> Peter Xu
>
>
Peter Xu July 15, 2022, 5:07 p.m. UTC | #17
On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> I agree we should either:
> - Update the UFFD selftest to exercise this case
> - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
> for VM_UFFD_MINOR registration.
> 
> The first one is unfortunately not completely straightforward as Peter
> described. I would say it's probably not worth holding up this fix while we
> wait for it to happen?

Agreed, Andrew has already queued it.  It actually is a real fix since we
never forbid the user running private mappings upon minor faults, so
it's literally a bug in kernel irrelevant of the kselftest.

> 
> I don't really have a strong preference between the two. The second option
> is what I originally proposed in the first version of the minor fault
> series, so going back to that isn't a problem at least from my perspective.
> If in the future we find a real use case for this, we could always easily
> re-enable it and add selftests for it at that point.

I'd go for fixing the test case if possible.  Mike, would it be fine if we
go back to /dev/hugepages path based approach in the test case?

Thanks,
Axel Rasmussen July 15, 2022, 5:28 p.m. UTC | #18
On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> > I agree we should either:
> > - Update the UFFD selftest to exercise this case
> > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
> > for VM_UFFD_MINOR registration.
> >
> > The first one is unfortunately not completely straightforward as Peter
> > described. I would say it's probably not worth holding up this fix while we
> > wait for it to happen?
>
> Agreed, Andrew has already queued it.  It actually is a real fix since we
> never forbid the user running private mappings upon minor faults, so
> it's literally a bug in kernel irrelevant of the kselftest.
>
> >
> > I don't really have a strong preference between the two. The second option
> > is what I originally proposed in the first version of the minor fault
> > series, so going back to that isn't a problem at least from my perspective.
> > If in the future we find a real use case for this, we could always easily
> > re-enable it and add selftests for it at that point.
>
> I'd go for fixing the test case if possible.  Mike, would it be fine if we
> go back to /dev/hugepages path based approach in the test case?

One possible alternative, can we use memfd_create() with MFD_HUGE_*?
This afaict lets us have an fd so we can create two mappings,
without having to mount hugetlbfs, pass in a path to the test, ...

>
>
> Thanks,

>
> --
> Peter Xu
>
Mike Kravetz July 15, 2022, 5:29 p.m. UTC | #19
On 07/15/22 13:07, Peter Xu wrote:
> On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> > I don't really have a strong preference between the two. The second option
> > is what I originally proposed in the first version of the minor fault
> > series, so going back to that isn't a problem at least from my perspective.
> > If in the future we find a real use case for this, we could always easily
> > re-enable it and add selftests for it at that point.
> 
> I'd go for fixing the test case if possible.  Mike, would it be fine if we
> go back to /dev/hugepages path based approach in the test case?
> 

No problem going back to using a file for private mapping testing.  Removing
that was more of a simplification, because of new MADV_DONTNEED support.
Just want to make sure we also keep remap and remove event testing.
Peter Xu July 15, 2022, 5:38 p.m. UTC | #20
On Fri, Jul 15, 2022 at 10:29:33AM -0700, Mike Kravetz wrote:
> On 07/15/22 13:07, Peter Xu wrote:
> > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> > > I don't really have a strong preference between the two. The second option
> > > is what I originally proposed in the first version of the minor fault
> > > series, so going back to that isn't a problem at least from my perspective.
> > > If in the future we find a real use case for this, we could always easily
> > > re-enable it and add selftests for it at that point.
> > 
> > I'd go for fixing the test case if possible.  Mike, would it be fine if we
> > go back to /dev/hugepages path based approach in the test case?
> > 
> 
> No problem going back to using a file for private mapping testing.  Removing
> that was more of a simplification, because of new MADV_DONTNEED support.
> Just want to make sure we also keep remap and remove event testing.

Yeah definitely, thanks Mike!
Peter Xu July 15, 2022, 5:39 p.m. UTC | #21
On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote:
> On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> > > I agree we should either:
> > > - Update the UFFD selftest to exercise this case
> > > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
> > > for VM_UFFD_MINOR registration.
> > >
> > > The first one is unfortunately not completely straightforward as Peter
> > > described. I would say it's probably not worth holding up this fix while we
> > > wait for it to happen?
> >
> > Agreed, Andrew has already queued it.  It actually is a real fix since we
> > never forbid the user running private mappings upon minor faults, so
> > it's literally a bug in kernel irrelevant of the kselftest.
> >
> > >
> > > I don't really have a strong preference between the two. The second option
> > > is what I originally proposed in the first version of the minor fault
> > > series, so going back to that isn't a problem at least from my perspective.
> > > If in the future we find a real use case for this, we could always easily
> > > re-enable it and add selftests for it at that point.
> >
> > I'd go for fixing the test case if possible.  Mike, would it be fine if we
> > go back to /dev/hugepages path based approach in the test case?
> 
> One possible alternative, can we use memfd_create() with MFD_HUGE_*?
> This afaict lets us have an fd so we can create two mappings,
> without having to mount hugetlbfs, pass in a path to the test, ...

Sounds good. :) We can also rework the shared hugetlb too.  Wanna post a
patch?  I can do that too, let me know otherwise.  Thanks!
Axel Rasmussen July 15, 2022, 5:51 p.m. UTC | #22
On Fri, Jul 15, 2022 at 10:39 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote:
> > On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
> > > > I agree we should either:
> > > > - Update the UFFD selftest to exercise this case
> > > > - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
> > > > for VM_UFFD_MINOR registration.
> > > >
> > > > The first one is unfortunately not completely straightforward as Peter
> > > > described. I would say it's probably not worth holding up this fix while we
> > > > wait for it to happen?
> > >
> > > Agreed, Andrew has already queued it.  It actually is a real fix since we
> > > never forbid the user running private mappings upon minor faults, so
> > > it's literally a bug in kernel irrelevant of the kselftest.
> > >
> > > >
> > > > I don't really have a strong preference between the two. The second option
> > > > is what I originally proposed in the first version of the minor fault
> > > > series, so going back to that isn't a problem at least from my perspective.
> > > > If in the future we find a real use case for this, we could always easily
> > > > re-enable it and add selftests for it at that point.
> > >
> > > I'd go for fixing the test case if possible.  Mike, would it be fine if we
> > > go back to /dev/hugepages path based approach in the test case?
> >
> > One possible alternative, can we use memfd_create() with MFD_HUGE_*?
> > This afaict lets us have an fd so we can create two mappings,
> > without having to mount hugetlbfs, pass in a path to the test, ...
>
> Sounds good. :) We can also rework the shared hugetlb too.  Wanna post a
> patch?  I can do that too, let me know otherwise.  Thanks!

Sure, I'll take a whack at it.

>
> --
> Peter Xu
>
Miaohe Lin July 16, 2022, 1:32 a.m. UTC | #23
On 2022/7/16 1:51, Axel Rasmussen wrote:
> On Fri, Jul 15, 2022 at 10:39 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Fri, Jul 15, 2022 at 10:28:44AM -0700, Axel Rasmussen wrote:
>>> On Fri, Jul 15, 2022 at 10:07 AM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Fri, Jul 15, 2022 at 09:45:37AM -0700, Axel Rasmussen wrote:
>>>>> I agree we should either:
>>>>> - Update the UFFD selftest to exercise this case
>>>>> - Or, don't allow it, update vma_can_userfault() to also require VM_SHARED
>>>>> for VM_UFFD_MINOR registration.
>>>>>
>>>>> The first one is unfortunately not completely straightforward as Peter
>>>>> described. I would say it's probably not worth holding up this fix while we
>>>>> wait for it to happen?
>>>>
>>>> Agreed, Andrew has already queued it.  It actually is a real fix since we
>>>> never forbid the user running private mappings upon minor faults, so
>>>> it's literally a bug in kernel irrelevant of the kselftest.
>>>>
>>>>>
>>>>> I don't really have a strong preference between the two. The second option
>>>>> is what I originally proposed in the first version of the minor fault
>>>>> series, so going back to that isn't a problem at least from my perspective.
>>>>> If in the future we find a real use case for this, we could always easily
>>>>> re-enable it and add selftests for it at that point.
>>>>
>>>> I'd go for fixing the test case if possible.  Mike, would it be fine if we
>>>> go back to /dev/hugepages path based approach in the test case?
>>>
>>> One possible alternative, can we use memfd_create() with MFD_HUGE_*?
>>> This afaict lets us have an fd so we can create two mappings,
>>> without having to mount hugetlbfs, pass in a path to the test, ...
>>
>> Sounds good. :) We can also rework the shared hugetlb too.  Wanna post a
>> patch?  I can do that too, let me know otherwise.  Thanks!
> 
> Sure, I'll take a whack at it.

Many thanks for all of your hard work. :)

> 
>>
>> --
>> Peter Xu
>>
> .
>
Andrew Morton July 16, 2022, 11:06 p.m. UTC | #24
On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> On 2022/7/14 1:23, Andrew Morton wrote:
> > On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> > 
> >> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> >> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> >> for them mistakenly because they're not vm_shared. This will corrupt the
> >> page->mapping used by page cache code.
> > 
> > Well that sounds bad.  And theories on why this has gone unnoticed for
> > over a year?  I assume this doesn't have coverage in our selftests?
> 
> As discussed in another thread, when minor fault handling is proposed, only
> VM_SHARED vma is expected to be supported

So...  do we feel that this fix should be backported?  And if so, is
there a suitable commit for the Fixes:?
Miaohe Lin July 18, 2022, 2:25 a.m. UTC | #25
On 2022/7/17 7:06, Andrew Morton wrote:
> On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>> On 2022/7/14 1:23, Andrew Morton wrote:
>>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>
>>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
>>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
>>>> for them mistakenly because they're not vm_shared. This will corrupt the
>>>> page->mapping used by page cache code.
>>>
>>> Well that sounds bad.  And theories on why this has gone unnoticed for
>>> over a year?  I assume this doesn't have coverage in our selftests?
>>
>> As discussed in another thread, when minor fault handling is proposed, only
>> VM_SHARED vma is expected to be supported
> 
> So...  do we feel that this fix should be backported?  And if so, is
> there a suitable commit for the Fixes:?

I tend to backport this fix. And I think the Fixes tag in this patch should be suitable,
i.e. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl").

Thanks.

> .
>
Axel Rasmussen July 18, 2022, 6:07 p.m. UTC | #26
On Sun, Jul 17, 2022 at 7:25 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/7/17 7:06, Andrew Morton wrote:
> > On Thu, 14 Jul 2022 17:59:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> >
> >> On 2022/7/14 1:23, Andrew Morton wrote:
> >>> On Tue, 12 Jul 2022 21:05:42 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>>
> >>>> In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
> >>>> cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
> >>>> for them mistakenly because they're not vm_shared. This will corrupt the
> >>>> page->mapping used by page cache code.
> >>>
> >>> Well that sounds bad.  And theories on why this has gone unnoticed for
> >>> over a year?  I assume this doesn't have coverage in our selftests?
> >>
> >> As discussed in another thread, when minor fault handling is proposed, only
> >> VM_SHARED vma is expected to be supported
> >
> > So...  do we feel that this fix should be backported?  And if so, is
> > there a suitable commit for the Fixes:?
>
> I tend to backport this fix. And I think the Fixes tag in this patch should be suitable,
> i.e. Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl").

Agreed, it is worth backporting.

>
> Thanks.
>
> > .
> >
>
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8d379e03f672..b232e1508e49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6038,7 +6038,7 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-	if (vm_shared) {
+	if (page_in_pagecache) {
 		page_dup_file_rmap(page, true);
 	} else {
 		ClearHPageRestoreReserve(page);