diff mbox series

[v2,07/19] mm/fork: Accept huge pfnmap entries

Message ID 20240826204353.2228736-8-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Support huge pfnmaps | expand

Commit Message

Peter Xu Aug. 26, 2024, 8:43 p.m. UTC
Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in either
parent or child process will trigger a write fault.

Do the same for pmd level.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

Comments

David Hildenbrand Aug. 29, 2024, 3:10 p.m. UTC | #1
On 26.08.24 22:43, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>   1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2c314f631f3..15418ffdd377 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	pgtable_t pgtable = NULL;
>   	int ret = -ENOMEM;
>   
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {

I assume I have to clean up your mess here as well?
Peter Xu Aug. 29, 2024, 6:26 p.m. UTC | #2
On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
> On 26.08.24 22:43, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> >   1 file changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2c314f631f3..15418ffdd377 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	pgtable_t pgtable = NULL;
> >   	int ret = -ENOMEM;
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> 
> I assume I have to clean up your mess here as well?

Can you leave meaningful and explicit comment?  I'll try to address.
David Hildenbrand Aug. 29, 2024, 7:44 p.m. UTC | #3
On 29.08.24 20:26, Peter Xu wrote:
> On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
>> On 26.08.24 22:43, Peter Xu wrote:
>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>> much easier, the write bit needs to be persisted though for writable and
>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>> parent or child process will trigger a write fault.
>>>
>>> Do the same for pmd level.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>>>    1 file changed, 26 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index e2c314f631f3..15418ffdd377 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>    	pgtable_t pgtable = NULL;
>>>    	int ret = -ENOMEM;
>>> +	pmd = pmdp_get_lockless(src_pmd);
>>> +	if (unlikely(pmd_special(pmd))) {
>>
>> I assume I have to clean up your mess here as well?
> 
> Can you leave meaningful and explicit comment?  I'll try to address.

Sorry Peter, but I raised all that as reply to v1. For example, I 
stated that vm_normal_page_pmd() already *exist* and why these 
pmd_special() checks should be kept there.

I hear you, you're not interested in cleaning that up. So at this point 
it's easier for me to clean it up myself.
Peter Xu Aug. 29, 2024, 8:01 p.m. UTC | #4
On Thu, Aug 29, 2024 at 09:44:01PM +0200, David Hildenbrand wrote:
> On 29.08.24 20:26, Peter Xu wrote:
> > On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
> > > On 26.08.24 22:43, Peter Xu wrote:
> > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > much easier, the write bit needs to be persisted though for writable and
> > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > parent or child process will trigger a write fault.
> > > > 
> > > > Do the same for pmd level.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> > > >    1 file changed, 26 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index e2c314f631f3..15418ffdd377 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >    	pgtable_t pgtable = NULL;
> > > >    	int ret = -ENOMEM;
> > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > +	if (unlikely(pmd_special(pmd))) {
> > > 
> > > I assume I have to clean up your mess here as well?
> > 
> > Can you leave meaningful and explicit comment?  I'll try to address.
> 
> Sorry Peter, but I raised all that as reply to v1. For example, I stated
> that vm_normal_page_pmd() already *exist* and why these pmd_special() checks
> should be kept there.

We discussed the usage of pmd_page() but I don't think this is clear you
suggest it to be used there.  IOW, copy_huge_pmd() doesn't use
vm_normal_page_pmd() yet so far and I'm not sure whether it's always safe.

E.g. at least one thing I spot is vm_normal_page_pmd() returns NULL for
huge zeropage pmd but here in fork() we need to take a ref with
mm_get_huge_zero_folio().

> 
> I hear you, you're not interested in cleaning that up. So at this point it's
> easier for me to clean it up myself.

It might be easier indeed you provide a patch that you think the best.

Then I'll leave that to you, and I'll send the solo fixup patch to be
squashed soon to the list.
Yan Zhao Sept. 2, 2024, 7:58 a.m. UTC | #5
On Mon, Aug 26, 2024 at 04:43:41PM -0400, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2c314f631f3..15418ffdd377 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pgtable_t pgtable = NULL;
>  	int ret = -ENOMEM;
>  
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {
> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +		/*
> +		 * No need to recheck the pmd, it can't change with write
> +		 * mmap lock held here.
> +		 *
> +		 * Meanwhile, making sure it's not a CoW VMA with writable
> +		 * mapping, otherwise it means either the anon page wrongly
> +		 * applied special bit, or we made the PRIVATE mapping be
> +		 * able to wrongly write to the backend MMIO.
> +		 */
> +		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> +		goto set_pmd;
> +	}
> +
>  	/* Skip if can be re-fill on fault */
>  	if (!vma_is_anonymous(dst_vma))
>  		return 0;
> @@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pmdp_set_wrprotect(src_mm, addr, src_pmd);
>  	if (!userfaultfd_wp(dst_vma))
>  		pmd = pmd_clear_uffd_wp(pmd);
> -	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	pmd = pmd_wrprotect(pmd);
> +set_pmd:
> +	pmd = pmd_mkold(pmd);
>  	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>  
>  	ret = 0;
> @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * TODO: once we support anonymous pages, use
>  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
>  	 */
> -	pudp_set_wrprotect(src_mm, addr, src_pud);
> -	pud = pud_mkold(pud_wrprotect(pud));
> +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> +		pudp_set_wrprotect(src_mm, addr, src_pud);
> +		pud = pud_wrprotect(pud);
> +	}
Do we need the logic to clear dirty bit in the child as that in
__copy_present_ptes()?  (and also for the pmd's case).

e.g.
if (vma->vm_flags & VM_SHARED)
	pud = pud_mkclean(pud);

> +	pud = pud_mkold(pud);
>  	set_pud_at(dst_mm, addr, dst_pud, pud);
>  
>  	ret = 0;
> -- 
> 2.45.0
>
Peter Xu Sept. 3, 2024, 9:23 p.m. UTC | #6
On Mon, Sep 02, 2024 at 03:58:38PM +0800, Yan Zhao wrote:
> On Mon, Aug 26, 2024 at 04:43:41PM -0400, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> >  1 file changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2c314f631f3..15418ffdd377 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	pgtable_t pgtable = NULL;
> >  	int ret = -ENOMEM;
> >  
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +		/*
> > +		 * No need to recheck the pmd, it can't change with write
> > +		 * mmap lock held here.
> > +		 *
> > +		 * Meanwhile, making sure it's not a CoW VMA with writable
> > +		 * mapping, otherwise it means either the anon page wrongly
> > +		 * applied special bit, or we made the PRIVATE mapping be
> > +		 * able to wrongly write to the backend MMIO.
> > +		 */
> > +		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> > +		goto set_pmd;
> > +	}
> > +
> >  	/* Skip if can be re-fill on fault */
> >  	if (!vma_is_anonymous(dst_vma))
> >  		return 0;
> > @@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> >  	if (!userfaultfd_wp(dst_vma))
> >  		pmd = pmd_clear_uffd_wp(pmd);
> > -	pmd = pmd_mkold(pmd_wrprotect(pmd));
> > +	pmd = pmd_wrprotect(pmd);
> > +set_pmd:
> > +	pmd = pmd_mkold(pmd);
> >  	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> >  
> >  	ret = 0;
> > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	 * TODO: once we support anonymous pages, use
> >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> >  	 */
> > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > -	pud = pud_mkold(pud_wrprotect(pud));
> > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > +		pud = pud_wrprotect(pud);
> > +	}
> Do we need the logic to clear dirty bit in the child as that in
> __copy_present_ptes()?  (and also for the pmd's case).
> 
> e.g.
> if (vma->vm_flags & VM_SHARED)
> 	pud = pud_mkclean(pud);

Yeah, good question.  I remember I thought about that when initially
working on these lines, but I forgot the details, or maybe I simply tried
to stick with the current code base, as the dirty bit used to be kept even
in the child here.

I'd expect there's only performance differences, but still sounds like I'd
better leave that to whoever knows the best on the implications, then draft
it as a separate patch but only when needed.

Thanks,
Andrew Morton Sept. 9, 2024, 10:25 p.m. UTC | #7
On Tue, 3 Sep 2024 17:23:38 -0400 Peter Xu <peterx@redhat.com> wrote:

> > > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >  	 * TODO: once we support anonymous pages, use
> > >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> > >  	 */
> > > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > > -	pud = pud_mkold(pud_wrprotect(pud));
> > > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > > +		pud = pud_wrprotect(pud);
> > > +	}
> > Do we need the logic to clear dirty bit in the child as that in
> > __copy_present_ptes()?  (and also for the pmd's case).
> > 
> > e.g.
> > if (vma->vm_flags & VM_SHARED)
> > 	pud = pud_mkclean(pud);
> 
> Yeah, good question.  I remember I thought about that when initially
> working on these lines, but I forgot the details, or maybe I simply tried
> to stick with the current code base, as the dirty bit used to be kept even
> in the child here.
> 
> I'd expect there's only performance differences, but still sounds like I'd
> better leave that to whoever knows the best on the implications, then draft
> it as a separate patch but only when needed.

Sorry, but this vaguensss simply leaves me with nowhere to go.

I'll drop the series - let's revisit after -rc1 please.
Peter Xu Sept. 9, 2024, 10:43 p.m. UTC | #8
On Mon, Sep 09, 2024 at 03:25:46PM -0700, Andrew Morton wrote:
> On Tue, 3 Sep 2024 17:23:38 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >  	 * TODO: once we support anonymous pages, use
> > > >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> > > >  	 */
> > > > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > > > -	pud = pud_mkold(pud_wrprotect(pud));
> > > > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > > > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > > > +		pud = pud_wrprotect(pud);
> > > > +	}
> > > Do we need the logic to clear dirty bit in the child as that in
> > > __copy_present_ptes()?  (and also for the pmd's case).
> > > 
> > > e.g.
> > > if (vma->vm_flags & VM_SHARED)
> > > 	pud = pud_mkclean(pud);
> > 
> > Yeah, good question.  I remember I thought about that when initially
> > working on these lines, but I forgot the details, or maybe I simply tried
> > to stick with the current code base, as the dirty bit used to be kept even
> > in the child here.
> > 
> > I'd expect there's only performance differences, but still sounds like I'd
> > better leave that to whoever knows the best on the implications, then draft
> > it as a separate patch but only when needed.
> 
> Sorry, but this vaguensss simply leaves me with nowhere to go.
> 
> I'll drop the series - let's revisit after -rc1 please.

Andrew, would you please explain why it needs to be dropped?

I meant in the reply that I think we should leave that as is, and I think
so far nobody in real life should care much on this bit, so I think it's
fine to leave the dirty bit as-is.

I still think whoever has a better use of the dirty bit and would like to
change the behavior should find the use case and work on top, but only if
necessary.

At least this whole fork() path is not useful at all for the use case we're
working on.  Please still consider having this series as I think it's useful.

Thanks,
Andrew Morton Sept. 9, 2024, 11:15 p.m. UTC | #9
On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:

> > > > Do we need the logic to clear dirty bit in the child as that in
> > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > 
> > > > e.g.
> > > > if (vma->vm_flags & VM_SHARED)
> > > > 	pud = pud_mkclean(pud);
> > > 
> > > Yeah, good question.  I remember I thought about that when initially
> > > working on these lines, but I forgot the details, or maybe I simply tried
> > > to stick with the current code base, as the dirty bit used to be kept even
> > > in the child here.
> > > 
> > > I'd expect there's only performance differences, but still sounds like I'd
> > > better leave that to whoever knows the best on the implications, then draft
> > > it as a separate patch but only when needed.
> > 
> > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > 
> > I'll drop the series - let's revisit after -rc1 please.
> 
> Andrew, would you please explain why it needs to be dropped?
> 
> I meant in the reply that I think we should leave that as is, and I think
> so far nobody in real life should care much on this bit, so I think it's
> fine to leave the dirty bit as-is.
> 
> I still think whoever has a better use of the dirty bit and would like to
> change the behavior should find the use case and work on top, but only if
> necessary.

Well.  "I'd expect there's only performance differences" means to me
"there might be correctness issues, I don't know".  Is it or is it not
merely a performance thing?
Peter Xu Sept. 10, 2024, 12:08 a.m. UTC | #10
On Mon, Sep 09, 2024 at 04:15:39PM -0700, Andrew Morton wrote:
> On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > > > Do we need the logic to clear dirty bit in the child as that in
> > > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > > 
> > > > > e.g.
> > > > > if (vma->vm_flags & VM_SHARED)
> > > > > 	pud = pud_mkclean(pud);
> > > > 
> > > > Yeah, good question.  I remember I thought about that when initially
> > > > working on these lines, but I forgot the details, or maybe I simply tried
> > > > to stick with the current code base, as the dirty bit used to be kept even
> > > > in the child here.
> > > > 
> > > > I'd expect there's only performance differences, but still sounds like I'd
> > > > better leave that to whoever knows the best on the implications, then draft
> > > > it as a separate patch but only when needed.
> > > 
> > > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > > 
> > > I'll drop the series - let's revisit after -rc1 please.
> > 
> > Andrew, would you please explain why it needs to be dropped?
> > 
> > I meant in the reply that I think we should leave that as is, and I think
> > so far nobody in real life should care much on this bit, so I think it's
> > fine to leave the dirty bit as-is.
> > 
> > I still think whoever has a better use of the dirty bit and would like to
> > change the behavior should find the use case and work on top, but only if
> > necessary.
> 
> Well.  "I'd expect there's only performance differences" means to me
> "there might be correctness issues, I don't know".  Is it or is it not
> merely a performance thing?

There should have no correctness issue pending.  It can only be about
performance, and AFAIU what this patch does is exactly the way where it
shouldn't ever change performance either, as it didn't change how dirty bit
was processed (just like before this patch), not to mention correctness (in
regards to dirty bits).

I can provide some more details.

Here the question we're discussing is "whether we should clear the dirty
bit in the child for a pgtable entry when it's VM_SHARED".  Yan observed
that we don't do the same thing for pte/pmd/pud, which is true.

Before this patch:

  - For pte:     we clear dirty bit if VM_SHARED in child when copy
  - For pmd/pud: we never clear dirty bit in the child when copy

The behavior of clearing dirty bit for VM_SHARED in child for pte level
originates to the 1st commit that git history starts.  So we always do so
for 19 years.

That makes sense to me, because clearing dirty bit in pte normally requires
a SetDirty on the folio, e.g. in unmap path:

        if (pte_dirty(pteval))
                folio_mark_dirty(folio);

Hence cleared dirty bit in the child should avoid some extra overheads when
the pte maps a file cache, so clean pte can at least help us to avoid calls
into e.g. mapping's dirty_folio() functions (in which it should normally
check folio_test_set_dirty() again anyway, and parent pte still have the
dirty bit set so we won't miss setting folio dirty):

folio_mark_dirty():
        if (folio_test_reclaim(folio))
                folio_clear_reclaim(folio);
        return mapping->a_ops->dirty_folio(mapping, folio);

However there's the other side of thing where when the dirty bit is missing
I _think_ it also means when the child writes to the cleaned pte, it'll
require (e.g. on hardware accelerated archs) MMU setting dirty bit which is
slower than if we don't clear the dirty bit... and on software emulated
dirty bits it could even require a page fault, IIUC.

In short, personally I don't know what's the best to do, on keep / remove
the dirty bit even if it's safe either way: there are pros and cons on
different decisions.

That's why I said I'm not sure which is the best way.  I had a feeling that
most of the people didn't even notice this, and we kept running this code
for the past 19 years just all fine..

OTOH, we don't do the same for pmds/puds (in which case we persist dirty
bits always in child), and I didn't check whether it's intended, or why.
It'll have similar reasoning as above discussion on pte, or even more I
overlooked.

So again, the safest approach here is in terms of dirty bit we keep what we
do as before.  And that's what this patch does as of now.

IOW, if I'll need a repost, I'll repost exactly the same thing (with the
fixup I sent later, which is already in mm-unstable).

Thanks,
Yan Zhao Sept. 10, 2024, 2:52 a.m. UTC | #11
On Mon, Sep 09, 2024 at 08:08:16PM -0400, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 04:15:39PM -0700, Andrew Morton wrote:
> > On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:
> > 
> > > > > > Do we need the logic to clear dirty bit in the child as that in
> > > > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > > > 
> > > > > > e.g.
> > > > > > if (vma->vm_flags & VM_SHARED)
> > > > > > 	pud = pud_mkclean(pud);
> > > > > 
> > > > > Yeah, good question.  I remember I thought about that when initially
> > > > > working on these lines, but I forgot the details, or maybe I simply tried
> > > > > to stick with the current code base, as the dirty bit used to be kept even
> > > > > in the child here.
> > > > > 
> > > > > I'd expect there's only performance differences, but still sounds like I'd
> > > > > better leave that to whoever knows the best on the implications, then draft
> > > > > it as a separate patch but only when needed.
> > > > 
> > > > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > > > 
> > > > I'll drop the series - let's revisit after -rc1 please.
> > > 
> > > Andrew, would you please explain why it needs to be dropped?
> > > 
> > > I meant in the reply that I think we should leave that as is, and I think
> > > so far nobody in real life should care much on this bit, so I think it's
> > > fine to leave the dirty bit as-is.
> > > 
> > > I still think whoever has a better use of the dirty bit and would like to
> > > change the behavior should find the use case and work on top, but only if
> > > necessary.
> > 
> > Well.  "I'd expect there's only performance differences" means to me
> > "there might be correctness issues, I don't know".  Is it or is it not
> > merely a performance thing?
> 
> There should have no correctness issue pending.  It can only be about
> performance, and AFAIU what this patch does is exactly the way where it
> shouldn't ever change performance either, as it didn't change how dirty bit
> was processed (just like before this patch), not to mention correctness (in
> regards to dirty bits).
> 
> I can provide some more details.
> 
> Here the question we're discussing is "whether we should clear the dirty
> bit in the child for a pgtable entry when it's VM_SHARED".  Yan observed
> that we don't do the same thing for pte/pmd/pud, which is true.
> 
> Before this patch:
> 
>   - For pte:     we clear dirty bit if VM_SHARED in child when copy
>   - For pmd/pud: we never clear dirty bit in the child when copy
Hi Peter,

Not sure if I missed anything.

It looks that before this patch, pmd/pud are alawys write protected without
checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
clears dirty bit by moving the dirty value to the software bit.

And I have a question that why previously pmd/pud are always write protected.

Thanks
Yan

> 
> The behavior of clearing dirty bit for VM_SHARED in child for pte level
> originates to the 1st commit that git history starts.  So we always do so
> for 19 years.
> 
> That makes sense to me, because clearing dirty bit in pte normally requires
> a SetDirty on the folio, e.g. in unmap path:
> 
>         if (pte_dirty(pteval))
>                 folio_mark_dirty(folio);
> 
> Hence cleared dirty bit in the child should avoid some extra overheads when
> the pte maps a file cache, so clean pte can at least help us to avoid calls
> into e.g. mapping's dirty_folio() functions (in which it should normally
> check folio_test_set_dirty() again anyway, and parent pte still have the
> dirty bit set so we won't miss setting folio dirty):
> 
> folio_mark_dirty():
>         if (folio_test_reclaim(folio))
>                 folio_clear_reclaim(folio);
>         return mapping->a_ops->dirty_folio(mapping, folio);
> 
> However there's the other side of thing where when the dirty bit is missing
> I _think_ it also means when the child writes to the cleaned pte, it'll
> require (e.g. on hardware accelerated archs) MMU setting dirty bit which is
> slower than if we don't clear the dirty bit... and on software emulated
> dirty bits it could even require a page fault, IIUC.
> 
> In short, personally I don't know what's the best to do, on keep / remove
> the dirty bit even if it's safe either way: there are pros and cons on
> different decisions.
> 
> That's why I said I'm not sure which is the best way.  I had a feeling that
> most of the people didn't even notice this, and we kept running this code
> for the past 19 years just all fine..
> 
> OTOH, we don't do the same for pmds/puds (in which case we persist dirty
> bits always in child), and I didn't check whether it's intended, or why.
> It'll have similar reasoning as above discussion on pte, or even more I
> overlooked.
> 
> So again, the safest approach here is in terms of dirty bit we keep what we
> do as before.  And that's what this patch does as of now.
> 
> IOW, if I'll need a repost, I'll repost exactly the same thing (with the
> fixup I sent later, which is already in mm-unstable).
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
>
Peter Xu Sept. 10, 2024, 12:16 p.m. UTC | #12
On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> Hi Peter,

Hi, Yan,

> 
> Not sure if I missed anything.
> 
> It looks that before this patch, pmd/pud are alawys write protected without
> checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> clears dirty bit by moving the dirty value to the software bit.
> 
> And I have a question that why previously pmd/pud are always write protected.

IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
is to avoid wrongly creating shadow stack mappings.  In our discussion I
think that's an extra complexity and can be put aside; the dirty bit will
get recovered in pud_clear_saveddirty() later, so it's not the same as
pud_mkclean().

AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
will not duplicate pgtables in fork() for most of shared file mappings
(!CoW).  Please refer to vma_needs_copy(), and the comment before returning
false at last.  I think it's not strictly is_cow_mapping(), as we're
checking anon_vma there, however it's mostly it, just to also cover
MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
happened then anon_vma will appear already).

There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
Userfault & mixedmap are not involved in this series at all, so let's
discuss pfnmaps.

It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
to this series, because before this series pfnmap only exists in pte level,
hence IMO the is_cow_mapping() must exist for pte level as you described,
because it needs to properly take care of those.  Note that in the pte
processing it also checks pte_write() to make sure it's a COWed page, not a
RO page cache / pfnmap / ..., for example.

Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
the whole copy_page_range() could be already skipped.  IOW I think they
only need to process COWed pages here, and those pages require write bit
removed in both parent and child when fork().

After this series, pfnmaps can appear in the form of pmd/pud, then the
previous assumption will stop holding true, as we'll still copy pfnmaps
during fork() always. My guessing of the reason is because most of the
drivers map pfnmap vmas only during mmap(), it means there can normally
have no fault() handler at all for those pfns.

In this case, we'll need to also identify whether the page is COWed, using
the newly added "is_cow_mapping() && pxx_write()" in this series (added
to pud path, while for pmd path I used a WARN_ON_ONCE instead).

If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
we'll wrongly observe write protected entries.  Here the change will make
sure VM_SHARED can properly persist the write bits on pmds/puds.

Hope that explains.

Thanks,
Yan Zhao Sept. 11, 2024, 2:16 a.m. UTC | #13
On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote:
> On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> > Hi Peter,
> 
> Hi, Yan,
> 
> > 
> > Not sure if I missed anything.
> > 
> > It looks that before this patch, pmd/pud are alawys write protected without
> > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> > clears dirty bit by moving the dirty value to the software bit.
> > 
> > And I have a question that why previously pmd/pud are always write protected.
> 
> IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
> is to avoid wrongly creating shadow stack mappings.  In our discussion I
> think that's an extra complexity and can be put aside; the dirty bit will
> get recovered in pud_clear_saveddirty() later, so it's not the same as
> pud_mkclean().
But pud_clear_saveddirty() will only set dirty bit when write bit is 1.

> 
> AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
> will not duplicate pgtables in fork() for most of shared file mappings
> (!CoW).  Please refer to vma_needs_copy(), and the comment before returning
> false at last.  I think it's not strictly is_cow_mapping(), as we're
> checking anon_vma there, however it's mostly it, just to also cover
> MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
> happened then anon_vma will appear already).
> 
> There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
> Userfault & mixedmap are not involved in this series at all, so let's
> discuss pfnmaps.
> 
> It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
> to this series, because before this series pfnmap only exists in pte level,
> hence IMO the is_cow_mapping() must exist for pte level as you described,
> because it needs to properly take care of those.  Note that in the pte
> processing it also checks pte_write() to make sure it's a COWed page, not a
> RO page cache / pfnmap / ..., for example.
> 
> Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
> pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
> the whole copy_page_range() could be already skipped.  IOW I think they
> only need to process COWed pages here, and those pages require write bit
> removed in both parent and child when fork().
Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now?

> 
> After this series, pfnmaps can appear in the form of pmd/pud, then the
> previous assumption will stop holding true, as we'll still copy pfnmaps
> during fork() always. My guessing of the reason is because most of the
> drivers map pfnmap vmas only during mmap(), it means there can normally
> have no fault() handler at all for those pfns.
> 
> In this case, we'll need to also identify whether the page is COWed, using
> the newly added "is_cow_mapping() && pxx_write()" in this series (added
> to pud path, while for pmd path I used a WARN_ON_ONCE instead).
> 
> If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
> we'll wrongly observe write protected entries.  Here the change will make
> sure VM_SHARED can properly persist the write bits on pmds/puds.
> 
> Hope that explains.
> 
Thanks a lot for such detailed explanation!
>
Peter Xu Sept. 11, 2024, 2:34 p.m. UTC | #14
On Wed, Sep 11, 2024 at 10:16:55AM +0800, Yan Zhao wrote:
> On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote:
> > On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> > > Hi Peter,
> > 
> > Hi, Yan,
> > 
> > > 
> > > Not sure if I missed anything.
> > > 
> > > It looks that before this patch, pmd/pud are alawys write protected without
> > > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> > > clears dirty bit by moving the dirty value to the software bit.
> > > 
> > > And I have a question that why previously pmd/pud are always write protected.
> > 
> > IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
> > is to avoid wrongly creating shadow stack mappings.  In our discussion I
> > think that's an extra complexity and can be put aside; the dirty bit will
> > get recovered in pud_clear_saveddirty() later, so it's not the same as
> > pud_mkclean().
> But pud_clear_saveddirty() will only set dirty bit when write bit is 1.

Yes, it's because x86 wants to avoid unexpected write=0 && dirty=1 entries,
because it can wrongly reflect a shadow stack mapping.  Here we cannot
recover the dirty bit if set only if write bit is 1 first.

> 
> > 
> > AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
> > will not duplicate pgtables in fork() for most of shared file mappings
> > (!CoW).  Please refer to vma_needs_copy(), and the comment before returning
> > false at last.  I think it's not strictly is_cow_mapping(), as we're
> > checking anon_vma there, however it's mostly it, just to also cover
> > MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
> > happened then anon_vma will appear already).
> > 
> > There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
> > Userfault & mixedmap are not involved in this series at all, so let's
> > discuss pfnmaps.
> > 
> > It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
> > to this series, because before this series pfnmap only exists in pte level,
> > hence IMO the is_cow_mapping() must exist for pte level as you described,
> > because it needs to properly take care of those.  Note that in the pte
> > processing it also checks pte_write() to make sure it's a COWed page, not a
> > RO page cache / pfnmap / ..., for example.
> > 
> > Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
> > pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
> > the whole copy_page_range() could be already skipped.  IOW I think they
> > only need to process COWed pages here, and those pages require write bit
> > removed in both parent and child when fork().
> Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now?

Correct.

Thanks,
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2c314f631f3..15418ffdd377 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1559,6 +1559,24 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pgtable_t pgtable = NULL;
 	int ret = -ENOMEM;
 
+	pmd = pmdp_get_lockless(src_pmd);
+	if (unlikely(pmd_special(pmd))) {
+		dst_ptl = pmd_lock(dst_mm, dst_pmd);
+		src_ptl = pmd_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+		/*
+		 * No need to recheck the pmd, it can't change with write
+		 * mmap lock held here.
+		 *
+		 * Meanwhile, making sure it's not a CoW VMA with writable
+		 * mapping, otherwise it means either the anon page wrongly
+		 * applied special bit, or we made the PRIVATE mapping be
+		 * able to wrongly write to the backend MMIO.
+		 */
+		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
+		goto set_pmd;
+	}
+
 	/* Skip if can be re-fill on fault */
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
@@ -1640,7 +1658,9 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	pmd = pmd_wrprotect(pmd);
+set_pmd:
+	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1686,8 +1706,11 @@  int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
 	 */
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
+	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
+		pudp_set_wrprotect(src_mm, addr, src_pud);
+		pud = pud_wrprotect(pud);
+	}
+	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;