diff mbox series

[07/19] mm/fork: Accept huge pfnmap entries

Message ID 20240809160909.1023470-8-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Support huge pfnmaps | expand

Commit Message

Peter Xu Aug. 9, 2024, 4:08 p.m. UTC
Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in either
parent or child process will trigger a write fault.

Do the same for pmd level.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

Comments

David Hildenbrand Aug. 9, 2024, 4:32 p.m. UTC | #1
On 09.08.24 18:08, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>   1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6568586b21ab..015c9468eed5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	pgtable_t pgtable = NULL;
>   	int ret = -ENOMEM;
>   
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {
> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +		/*
> +		 * No need to recheck the pmd, it can't change with write
> +		 * mmap lock held here.
> +		 */
> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +			pmd = pmd_wrprotect(pmd);
> +		}
> +		goto set_pmd;
> +	}
> +

I strongly assume we should be using using vm_normal_page_pmd() instead 
of pmd_page() further below. pmd_special() should be mostly limited to 
GUP-fast and vm_normal_page_pmd().

Again, we should be doing this similar to how we handle PTEs.

I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, 
below: what else should we have here if it's not a migration entry but a 
present entry?

Likely this function needs a bit of rework.
Peter Xu Aug. 9, 2024, 5:15 p.m. UTC | #2
On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> >   1 file changed, 24 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 6568586b21ab..015c9468eed5 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	pgtable_t pgtable = NULL;
> >   	int ret = -ENOMEM;
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +		/*
> > +		 * No need to recheck the pmd, it can't change with write
> > +		 * mmap lock held here.
> > +		 */
> > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > +			pmd = pmd_wrprotect(pmd);
> > +		}
> > +		goto set_pmd;
> > +	}
> > +
> 
> I strongly assume we should be using using vm_normal_page_pmd() instead of
> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> and vm_normal_page_pmd().

One thing to mention that it has this:

	if (!vma_is_anonymous(dst_vma))
		return 0;

So it's only about anonymous below that.  In that case I feel like the
pmd_page() is benign, and actually good.

Though what you're saying here made me notice my above check doesn't seem
to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
if it's writable for CoW it means it's already an anon.

I think I can probably drop that line there, perhaps with a
VM_WARN_ON_ONCE() making sure it won't happen.

> 
> Again, we should be doing this similar to how we handle PTEs.
> 
> I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
> what else should we have here if it's not a migration entry but a present
> entry?

I had a feeling that it was just a safety belt since the 1st day of thp
when Andrea worked that out, so that it'll work with e.g. file truncation
races.

But with current code it looks like it's only anonymous indeed, so looks
not possible at least from that pov.

Thanks,

> 
> Likely this function needs a bit of rework.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Aug. 9, 2024, 5:59 p.m. UTC | #3
On 09.08.24 19:15, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
>> On 09.08.24 18:08, Peter Xu wrote:
>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>> much easier, the write bit needs to be persisted though for writable and
>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>> parent or child process will trigger a write fault.
>>>
>>> Do the same for pmd level.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>>>    1 file changed, 24 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 6568586b21ab..015c9468eed5 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>    	pgtable_t pgtable = NULL;
>>>    	int ret = -ENOMEM;
>>> +	pmd = pmdp_get_lockless(src_pmd);
>>> +	if (unlikely(pmd_special(pmd))) {
>>> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
>>> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>> +		/*
>>> +		 * No need to recheck the pmd, it can't change with write
>>> +		 * mmap lock held here.
>>> +		 */
>>> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
>>> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>> +			pmd = pmd_wrprotect(pmd);
>>> +		}
>>> +		goto set_pmd;
>>> +	}
>>> +
>>
>> I strongly assume we should be using using vm_normal_page_pmd() instead of
>> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
>> and vm_normal_page_pmd().
> 
> One thing to mention that it has this:
> 
> 	if (!vma_is_anonymous(dst_vma))
> 		return 0;

Another obscure thing in this function. It's not the job of 
copy_huge_pmd() to make the decision whether to copy, it's the job of 
vma_needs_copy() in copy_page_range().

And now I have to suspect that uffd-wp is broken with this function, 
because as vma_needs_copy() clearly states, we must copy, and we don't 
do that for PMDs. Ugh.

What a mess, we should just do what we do for PTEs and we will be fine ;)

Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) || 
pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"

Would that even be the case with PFNMAP? I suspect that pmd_trans_huge() 
would return "true" for special pfnmap, which is rather "surprising", 
but fortunate for us.

Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup 
for another day.

> 
> So it's only about anonymous below that.  In that case I feel like the
> pmd_page() is benign, and actually good.

Yes, it would likely currently work.

> 
> Though what you're saying here made me notice my above check doesn't seem
> to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
> pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
> if it's writable for CoW it means it's already an anon.
> 
> I think I can probably drop that line there, perhaps with a
> VM_WARN_ON_ONCE() making sure it won't happen.
> 
>>
>> Again, we should be doing this similar to how we handle PTEs.
>>
>> I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
>> what else should we have here if it's not a migration entry but a present
>> entry?
> 
> I had a feeling that it was just a safety belt since the 1st day of thp
> when Andrea worked that out, so that it'll work with e.g. file truncation
> races.
> 
> But with current code it looks like it's only anonymous indeed, so looks
> not possible at least from that pov.

Yes, as stated above, likely broken with UFFD-WP ...

I really think we should make this code just behave like it would with 
PTEs, instead of throwing in more "different" handling.
Peter Xu Aug. 12, 2024, 6:29 p.m. UTC | #4
On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
> On 09.08.24 19:15, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 18:08, Peter Xu wrote:
> > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > much easier, the write bit needs to be persisted though for writable and
> > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > parent or child process will trigger a write fault.
> > > > 
> > > > Do the same for pmd level.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> > > >    1 file changed, 24 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 6568586b21ab..015c9468eed5 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >    	pgtable_t pgtable = NULL;
> > > >    	int ret = -ENOMEM;
> > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > +	if (unlikely(pmd_special(pmd))) {
> > > > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > > > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > > > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > > > +		/*
> > > > +		 * No need to recheck the pmd, it can't change with write
> > > > +		 * mmap lock held here.
> > > > +		 */
> > > > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > > > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > > > +			pmd = pmd_wrprotect(pmd);
> > > > +		}
> > > > +		goto set_pmd;
> > > > +	}
> > > > +
> > > 
> > > I strongly assume we should be using using vm_normal_page_pmd() instead of
> > > pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> > > and vm_normal_page_pmd().
> > 
> > One thing to mention that it has this:
> > 
> > 	if (!vma_is_anonymous(dst_vma))
> > 		return 0;
> 
> Another obscure thing in this function. It's not the job of copy_huge_pmd()
> to make the decision whether to copy, it's the job of vma_needs_copy() in
> copy_page_range().
> 
> And now I have to suspect that uffd-wp is broken with this function, because
> as vma_needs_copy() clearly states, we must copy, and we don't do that for
> PMDs. Ugh.
> 
> What a mess, we should just do what we do for PTEs and we will be fine ;)

IIUC it's not a problem: file uffd-wp is different from anonymous, in that
it pushes everything down to ptes.

It means if we skipped one huge pmd here for file, then it's destined to
have nothing to do with uffd-wp, otherwise it should have already been
split at the first attempt to wr-protect.

> 
> Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
> pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
> 
> Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
> would return "true" for special pfnmap, which is rather "surprising", but
> fortunate for us.

It's definitely not surprising to me as that's the plan.. and I thought it
shoulidn't be surprising to you - if you remember before I sent this one, I
tried to decouple that here with the "thp agnostic" series:

  https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com

in which you reviewed it (which I appreciated).

So yes, pfnmap on pmd so far will report pmd_trans_huge==true.

> 
> Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
> another day.

Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
much feedback there, and that can further postpone this series from being
posted I'm afraid, then I decided to just move on with "taking pfnmap as
THPs".  The corresponding change on this path is here in that series:

https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/

@@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,

> 
> > 
> > So it's only about anonymous below that.  In that case I feel like the
> > pmd_page() is benign, and actually good.
> 
> Yes, it would likely currently work.
> 
> > 
> > Though what you're saying here made me notice my above check doesn't seem
> > to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
> > pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
> > if it's writable for CoW it means it's already an anon.
> > 
> > I think I can probably drop that line there, perhaps with a
> > VM_WARN_ON_ONCE() making sure it won't happen.
> > 
> > > 
> > > Again, we should be doing this similar to how we handle PTEs.
> > > 
> > > I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
> > > what else should we have here if it's not a migration entry but a present
> > > entry?
> > 
> > I had a feeling that it was just a safety belt since the 1st day of thp
> > when Andrea worked that out, so that it'll work with e.g. file truncation
> > races.
> > 
> > But with current code it looks like it's only anonymous indeed, so looks
> > not possible at least from that pov.
> 
> Yes, as stated above, likely broken with UFFD-WP ...
> 
> I really think we should make this code just behave like it would with PTEs,
> instead of throwing in more "different" handling.

So it could simply because file / anon uffd-wp work very differently.

Let me know if you still spot something that is suspicious, but in all
cases I guess we can move on with this series, but maybe if you can find
something I can tackle them together when I decide to go back to the
mremap() issues in the other thread.

Thanks,
David Hildenbrand Aug. 12, 2024, 6:50 p.m. UTC | #5
On 12.08.24 20:29, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
>> On 09.08.24 19:15, Peter Xu wrote:
>>> On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
>>>> On 09.08.24 18:08, Peter Xu wrote:
>>>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>>>> much easier, the write bit needs to be persisted though for writable and
>>>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>>>> parent or child process will trigger a write fault.
>>>>>
>>>>> Do the same for pmd level.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>     mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>>>>>     1 file changed, 24 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 6568586b21ab..015c9468eed5 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>>     	pgtable_t pgtable = NULL;
>>>>>     	int ret = -ENOMEM;
>>>>> +	pmd = pmdp_get_lockless(src_pmd);
>>>>> +	if (unlikely(pmd_special(pmd))) {
>>>>> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>>>> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
>>>>> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>>>> +		/*
>>>>> +		 * No need to recheck the pmd, it can't change with write
>>>>> +		 * mmap lock held here.
>>>>> +		 */
>>>>> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
>>>>> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>>>> +			pmd = pmd_wrprotect(pmd);
>>>>> +		}
>>>>> +		goto set_pmd;
>>>>> +	}
>>>>> +
>>>>
>>>> I strongly assume we should be using using vm_normal_page_pmd() instead of
>>>> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
>>>> and vm_normal_page_pmd().
>>>
>>> One thing to mention that it has this:
>>>
>>> 	if (!vma_is_anonymous(dst_vma))
>>> 		return 0;
>>
>> Another obscure thing in this function. It's not the job of copy_huge_pmd()
>> to make the decision whether to copy, it's the job of vma_needs_copy() in
>> copy_page_range().
>>
>> And now I have to suspect that uffd-wp is broken with this function, because
>> as vma_needs_copy() clearly states, we must copy, and we don't do that for
>> PMDs. Ugh.
>>
>> What a mess, we should just do what we do for PTEs and we will be fine ;)
> 
> IIUC it's not a problem: file uffd-wp is different from anonymous, in that
> it pushes everything down to ptes.
> 
> It means if we skipped one huge pmd here for file, then it's destined to
> have nothing to do with uffd-wp, otherwise it should have already been
> split at the first attempt to wr-protect.

Is that also true for UFFD_FEATURE_WP_ASYNC, when we call 
pagemap_scan_thp_entry()->make_uffd_wp_pmd() ?

I'm not immediately finding the code that does the "pushes everything 
down to ptes", so I might miss that part.

> 
>>
>> Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
>> pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
>>
>> Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
>> would return "true" for special pfnmap, which is rather "surprising", but
>> fortunate for us.
> 
> It's definitely not surprising to me as that's the plan.. and I thought it
> shoulidn't be surprising to you - if you remember before I sent this one, I
> tried to decouple that here with the "thp agnostic" series:
> 
>    https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> 
> in which you reviewed it (which I appreciated).
> 
> So yes, pfnmap on pmd so far will report pmd_trans_huge==true.

I review way to much stuff to remember everything :) That certainly 
screams for a cleanup ...

> 
>>
>> Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
>> another day.
> 
> Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
> much feedback there, and that can further postpone this series from being
> posted I'm afraid, then I decided to just move on with "taking pfnmap as
> THPs".  The corresponding change on this path is here in that series:
> 
> https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/
> 
> @@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   	src_pmd = pmd_offset(src_pud, addr);
>   	do {
>   		next = pmd_addr_end(addr, end);
> -		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> -			|| pmd_devmap(*src_pmd)) {
> +		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
>   			int err;
>   			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
>   			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
> 

Ah, good.

[...]

>> Yes, as stated above, likely broken with UFFD-WP ...
>>
>> I really think we should make this code just behave like it would with PTEs,
>> instead of throwing in more "different" handling.
> 
> So it could simply because file / anon uffd-wp work very differently.

Or because nobody wants to clean up that code ;)
Peter Xu Aug. 12, 2024, 7:05 p.m. UTC | #6
On Mon, Aug 12, 2024 at 08:50:12PM +0200, David Hildenbrand wrote:
> On 12.08.24 20:29, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 19:15, Peter Xu wrote:
> > > > On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> > > > > On 09.08.24 18:08, Peter Xu wrote:
> > > > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > > > much easier, the write bit needs to be persisted though for writable and
> > > > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > > > parent or child process will trigger a write fault.
> > > > > > 
> > > > > > Do the same for pmd level.
> > > > > > 
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >     mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> > > > > >     1 file changed, 24 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index 6568586b21ab..015c9468eed5 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > > >     	pgtable_t pgtable = NULL;
> > > > > >     	int ret = -ENOMEM;
> > > > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > > > +	if (unlikely(pmd_special(pmd))) {
> > > > > > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > > > > > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > > > > > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > > > > > +		/*
> > > > > > +		 * No need to recheck the pmd, it can't change with write
> > > > > > +		 * mmap lock held here.
> > > > > > +		 */
> > > > > > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > > > > > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > > > > > +			pmd = pmd_wrprotect(pmd);
> > > > > > +		}
> > > > > > +		goto set_pmd;
> > > > > > +	}
> > > > > > +
> > > > > 
> > > > > I strongly assume we should be using using vm_normal_page_pmd() instead of
> > > > > pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> > > > > and vm_normal_page_pmd().
> > > > 
> > > > One thing to mention that it has this:
> > > > 
> > > > 	if (!vma_is_anonymous(dst_vma))
> > > > 		return 0;
> > > 
> > > Another obscure thing in this function. It's not the job of copy_huge_pmd()
> > > to make the decision whether to copy, it's the job of vma_needs_copy() in
> > > copy_page_range().
> > > 
> > > And now I have to suspect that uffd-wp is broken with this function, because
> > > as vma_needs_copy() clearly states, we must copy, and we don't do that for
> > > PMDs. Ugh.
> > > 
> > > What a mess, we should just do what we do for PTEs and we will be fine ;)
> > 
> > IIUC it's not a problem: file uffd-wp is different from anonymous, in that
> > it pushes everything down to ptes.
> > 
> > It means if we skipped one huge pmd here for file, then it's destined to
> > have nothing to do with uffd-wp, otherwise it should have already been
> > split at the first attempt to wr-protect.
> 
> Is that also true for UFFD_FEATURE_WP_ASYNC, when we call
> pagemap_scan_thp_entry()->make_uffd_wp_pmd() ?
> 
> I'm not immediately finding the code that does the "pushes everything down
> to ptes", so I might miss that part.

UFFDIO_WRITEPROTECT should have all those covered, while I guess you're
right, looks like the pagemap ioctl is overlooked..

> 
> > 
> > > 
> > > Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
> > > pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
> > > 
> > > Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
> > > would return "true" for special pfnmap, which is rather "surprising", but
> > > fortunate for us.
> > 
> > It's definitely not surprising to me as that's the plan.. and I thought it
> > shoulidn't be surprising to you - if you remember before I sent this one, I
> > tried to decouple that here with the "thp agnostic" series:
> > 
> >    https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> > 
> > in which you reviewed it (which I appreciated).
> > 
> > So yes, pfnmap on pmd so far will report pmd_trans_huge==true.
> 
> I review way to much stuff to remember everything :) That certainly screams
> for a cleanup ...

Definitely.

> 
> > 
> > > 
> > > Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
> > > another day.
> > 
> > Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
> > much feedback there, and that can further postpone this series from being
> > posted I'm afraid, then I decided to just move on with "taking pfnmap as
> > THPs".  The corresponding change on this path is here in that series:
> > 
> > https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/
> > 
> > @@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >   	src_pmd = pmd_offset(src_pud, addr);
> >   	do {
> >   		next = pmd_addr_end(addr, end);
> > -		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> > -			|| pmd_devmap(*src_pmd)) {
> > +		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
> >   			int err;
> >   			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
> >   			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
> > 
> 
> Ah, good.
> 
> [...]
> 
> > > Yes, as stated above, likely broken with UFFD-WP ...
> > > 
> > > I really think we should make this code just behave like it would with PTEs,
> > > instead of throwing in more "different" handling.
> > 
> > So it could simply because file / anon uffd-wp work very differently.
> 
> Or because nobody wants to clean up that code ;)

I think in this case maybe the fork() part is all fine? As long as we can
switch pagemap ioctl to do proper break-downs when necessary, or even try
to reuse what UFFDIO_WRITEPROTECT does if still possible in some way.

In all cases, definitely sounds like another separate effort.

Thanks,
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6568586b21ab..015c9468eed5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,22 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pgtable_t pgtable = NULL;
 	int ret = -ENOMEM;
 
+	pmd = pmdp_get_lockless(src_pmd);
+	if (unlikely(pmd_special(pmd))) {
+		dst_ptl = pmd_lock(dst_mm, dst_pmd);
+		src_ptl = pmd_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+		/*
+		 * No need to recheck the pmd, it can't change with write
+		 * mmap lock held here.
+		 */
+		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+			pmd = pmd_wrprotect(pmd);
+		}
+		goto set_pmd;
+	}
+
 	/* Skip if can be re-fill on fault */
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
@@ -1456,7 +1472,9 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	pmd = pmd_wrprotect(pmd);
+set_pmd:
+	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1502,8 +1520,11 @@  int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
 	 */
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
+	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
+		pudp_set_wrprotect(src_mm, addr, src_pud);
+		pud = pud_wrprotect(pud);
+	}
+	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;