diff mbox series

[5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

Message ID 20200921212031.25233-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Break COW for pinned pages during fork() | expand

Commit Message

Peter Xu Sept. 21, 2020, 9:20 p.m. UTC
Pinned pages shouldn't be write-protected when fork() happens, because follow
up copy-on-write on these pages could cause the pinned pages to be replaced by
random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected.  So that future
handling will be done by the PTE level (with our latest changes, each of the
small pages will be copied).  We can achieve this by let copy_huge_pmd() return
-EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
finally land the next copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support anonymous
pages.  But it can actually be done the same as the huge PMDs even if the split
huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
ins will remap the same pages in either parent/child later.

This might not be the most efficient way, but it should be easy and clean
enough.  It should be fine, since we're tackling with a very rare case just to
make sure userspaces that pinned some thps will still work even without
MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

John Hubbard Sept. 22, 2020, 6:41 a.m. UTC | #1
On 9/21/20 2:20 PM, Peter Xu wrote:
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ff29cc3d55c..c40aac0ad87e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   
>   	src_page = pmd_page(pmd);
>   	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> +
> +	/*
> +	 * If this page is a potentially pinned page, split and retry the fault
> +	 * with smaller page size.  Normally this should not happen because the
> +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> +	 * best effort that the pinned pages won't be replaced by another
> +	 * random page during the coming copy-on-write.
> +	 */
> +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> +		     page_maybe_dma_pinned(src_page))) {

This condition would make a good static inline function. It's used in 3 places,
and the condition is quite special and worth documenting, and having a separate
function helps with that, because the function name adds to the story. I'd suggest
approximately:

     page_likely_dma_pinned()

for the name.

> +		pte_free(dst_mm, pgtable);
> +		spin_unlock(src_ptl);
> +		spin_unlock(dst_ptl);
> +		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
> +		return -EAGAIN;
> +	}


Why wait until we are so deep into this routine to detect this and unwind?
It seems like if you could do a check near the beginning of this routine, and
handle it there, with less unwinding? In fact, after taking only the src_ptl,
the check could be made, right?


thanks,
Jan Kara Sept. 22, 2020, 10:33 a.m. UTC | #2
On Mon 21-09-20 23:41:16, John Hubbard wrote:
> On 9/21/20 2:20 PM, Peter Xu wrote:
> ...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	src_page = pmd_page(pmd);
> >   	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > +	/*
> > +	 * If this page is a potentially pinned page, split and retry the fault
> > +	 * with smaller page size.  Normally this should not happen because the
> > +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> > +	 * best effort that the pinned pages won't be replaced by another
> > +	 * random page during the coming copy-on-write.
> > +	 */
> > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > +		     page_maybe_dma_pinned(src_page))) {
> 
> This condition would make a good static inline function. It's used in 3
> places, and the condition is quite special and worth documenting, and
> having a separate function helps with that, because the function name
> adds to the story. I'd suggest approximately:
> 
>     page_likely_dma_pinned()
> 
> for the name.

Well, but we should also capture that this really only works for anonymous
pages. For file pages mm->has_pinned does not work because the page may be
still pinned by completely unrelated process as Jann already properly
pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
Possibly also assert PageAnon(page) in it if we want to be paranoid...

								Honza
Jason Gunthorpe Sept. 22, 2020, 12:05 p.m. UTC | #3
On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> Pinned pages shouldn't be write-protected when fork() happens, because follow
> up copy-on-write on these pages could cause the pinned pages to be replaced by
> random newly allocated pages.
> 
> For huge PMDs, we split the huge pmd if pinning is detected.  So that future
> handling will be done by the PTE level (with our latest changes, each of the
> small pages will be copied).  We can achieve this by let copy_huge_pmd() return
> -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> finally land the next copy_pte_range() call.
> 
> Huge PUDs will be even more special - so far it does not support anonymous
> pages.  But it can actually be done the same as the huge PMDs even if the split
> huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
> ins will remap the same pages in either parent/child later.
> 
> This might not be the most efficient way, but it should be easy and clean
> enough.  It should be fine, since we're tackling with a very rare case just to
> make sure userspaces that pinned some thps will still work even without
> MADV_DONTFORK and after they fork()ed.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
>  mm/huge_memory.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ff29cc3d55c..c40aac0ad87e 100644
> +++ b/mm/huge_memory.c
> @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	src_page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> +
> +	/*
> +	 * If this page is a potentially pinned page, split and retry the fault
> +	 * with smaller page size.  Normally this should not happen because the
> +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> +	 * best effort that the pinned pages won't be replaced by another
> +	 * random page during the coming copy-on-write.
> +	 */
> +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> +		     page_maybe_dma_pinned(src_page))) {
> +		pte_free(dst_mm, pgtable);
> +		spin_unlock(src_ptl);
> +		spin_unlock(dst_ptl);
> +		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
> +		return -EAGAIN;
> +	}

Not sure why, but the PMD stuff here is not calling is_cow_mapping()
before doing the write protect. Seems like it might be an existing
bug?

In any event, the has_pinned logic shouldn't be used without also
checking is_cow_mapping(), so it should be added to that test. Same
remarks for PUD

Jason
John Hubbard Sept. 22, 2020, 8:01 p.m. UTC | #4
On 9/22/20 3:33 AM, Jan Kara wrote:
> On Mon 21-09-20 23:41:16, John Hubbard wrote:
>> On 9/21/20 2:20 PM, Peter Xu wrote:
>> ...
>>> +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
>>> +		     page_maybe_dma_pinned(src_page))) {
>>
>> This condition would make a good static inline function. It's used in 3
>> places, and the condition is quite special and worth documenting, and
>> having a separate function helps with that, because the function name
>> adds to the story. I'd suggest approximately:
>>
>>      page_likely_dma_pinned()
>>
>> for the name.
> 
> Well, but we should also capture that this really only works for anonymous
> pages. For file pages mm->has_pinned does not work because the page may be
> still pinned by completely unrelated process as Jann already properly
> pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> Possibly also assert PageAnon(page) in it if we want to be paranoid...
> 
> 								Honza

The file-backed case doesn't really change anything, though:
page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
can get a false positive. Just like here, with an mm->has_pinned that
could be a false positive for a process.

And for that reason, I'm also not sure an "assert PageAnon(page)" is
desirable. That assertion would prevent file-backed callers from being
able to call a function that provides a fuzzy answer, but I don't see
why you'd want or need to do that. The goal here is to make the fuzzy
answer a little bit more definite, but it's not "broken" just because
the result is still fuzzy, right?

Apologies if I'm missing a huge point here... :)


thanks,
Jan Kara Sept. 23, 2020, 9:22 a.m. UTC | #5
On Tue 22-09-20 13:01:13, John Hubbard wrote:
> On 9/22/20 3:33 AM, Jan Kara wrote:
> > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > ...
> > > > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > +		     page_maybe_dma_pinned(src_page))) {
> > > 
> > > This condition would make a good static inline function. It's used in 3
> > > places, and the condition is quite special and worth documenting, and
> > > having a separate function helps with that, because the function name
> > > adds to the story. I'd suggest approximately:
> > > 
> > >      page_likely_dma_pinned()
> > > 
> > > for the name.
> > 
> > Well, but we should also capture that this really only works for anonymous
> > pages. For file pages mm->has_pinned does not work because the page may be
> > still pinned by completely unrelated process as Jann already properly
> > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > 
> > 								Honza
> 
> The file-backed case doesn't really change anything, though:
> page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> can get a false positive. Just like here, with an mm->has_pinned that
> could be a false positive for a process.
> 
> And for that reason, I'm also not sure an "assert PageAnon(page)" is
> desirable. That assertion would prevent file-backed callers from being
> able to call a function that provides a fuzzy answer, but I don't see
> why you'd want or need to do that. The goal here is to make the fuzzy
> answer a little bit more definite, but it's not "broken" just because
> the result is still fuzzy, right?
> 
> Apologies if I'm missing a huge point here... :)

But the problem is that if you apply mm->has_pinned check on file pages,
you can get false negatives now. And that's not acceptable...

								Honza
Peter Xu Sept. 23, 2020, 1:50 p.m. UTC | #6
On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > ...
> > > > > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > +		     page_maybe_dma_pinned(src_page))) {
> > > > 
> > > > This condition would make a good static inline function. It's used in 3
> > > > places, and the condition is quite special and worth documenting, and
> > > > having a separate function helps with that, because the function name
> > > > adds to the story. I'd suggest approximately:
> > > > 
> > > >      page_likely_dma_pinned()
> > > > 
> > > > for the name.
> > > 
> > > Well, but we should also capture that this really only works for anonymous
> > > pages. For file pages mm->has_pinned does not work because the page may be
> > > still pinned by completely unrelated process as Jann already properly
> > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > > 
> > > 								Honza
> > 
> > The file-backed case doesn't really change anything, though:
> > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > can get a false positive. Just like here, with an mm->has_pinned that
> > could be a false positive for a process.
> > 
> > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > desirable. That assertion would prevent file-backed callers from being
> > able to call a function that provides a fuzzy answer, but I don't see
> > why you'd want or need to do that. The goal here is to make the fuzzy
> > answer a little bit more definite, but it's not "broken" just because
> > the result is still fuzzy, right?
> > 
> > Apologies if I'm missing a huge point here... :)
> 
> But the problem is that if you apply mm->has_pinned check on file pages,
> you can get false negatives now. And that's not acceptable...

Do you mean the case where proc A pinned page P from a file, then proc B mapped
the same page P on the file, then fork() on proc B?

If proc B didn't explicitly pinned page P in B's address space too, shouldn't
we return "false" for page_likely_dma_pinned(P)?  Because if proc B didn't pin
the page in its own address space, I'd think it's ok to get the page replaced
at any time as long as the content keeps the same.  Or couldn't we?

Thanks,
Jan Kara Sept. 23, 2020, 2:01 p.m. UTC | #7
On Wed 23-09-20 09:50:04, Peter Xu wrote:
> On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> > On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > > ...
> > > > > > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > > +		     page_maybe_dma_pinned(src_page))) {
> > > > > 
> > > > > This condition would make a good static inline function. It's used in 3
> > > > > places, and the condition is quite special and worth documenting, and
> > > > > having a separate function helps with that, because the function name
> > > > > adds to the story. I'd suggest approximately:
> > > > > 
> > > > >      page_likely_dma_pinned()
> > > > > 
> > > > > for the name.
> > > > 
> > > > Well, but we should also capture that this really only works for anonymous
> > > > pages. For file pages mm->has_pinned does not work because the page may be
> > > > still pinned by completely unrelated process as Jann already properly
> > > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > > > 
> > > > 								Honza
> > > 
> > > The file-backed case doesn't really change anything, though:
> > > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > > can get a false positive. Just like here, with an mm->has_pinned that
> > > could be a false positive for a process.
> > > 
> > > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > > desirable. That assertion would prevent file-backed callers from being
> > > able to call a function that provides a fuzzy answer, but I don't see
> > > why you'd want or need to do that. The goal here is to make the fuzzy
> > > answer a little bit more definite, but it's not "broken" just because
> > > the result is still fuzzy, right?
> > > 
> > > Apologies if I'm missing a huge point here... :)
> > 
> > But the problem is that if you apply mm->has_pinned check on file pages,
> > you can get false negatives now. And that's not acceptable...
> 
> Do you mean the case where proc A pinned page P from a file, then proc B
> mapped the same page P on the file, then fork() on proc B?

Yes.

> If proc B didn't explicitly pinned page P in B's address space too,
> shouldn't we return "false" for page_likely_dma_pinned(P)?  Because if
> proc B didn't pin the page in its own address space, I'd think it's ok to
> get the page replaced at any time as long as the content keeps the same.
> Or couldn't we?

So it depends on the reason why you call page_likely_dma_pinned(). For your
COW purposes the check is correct but e.g. for "can filesystem safely
writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
not objecting to the mechanism as such. I'm mainly objecting to the generic
function name which suggests something else than what it really checks and
thus it could be used in wrong places in the future... That's why I'd
prefer to restrict the function to PageAnon pages where there's no risk of
confusion what the check actually does.

								Honza
Peter Xu Sept. 23, 2020, 3:24 p.m. UTC | #8
On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > random newly allocated pages.
> > 
> > For huge PMDs, we split the huge pmd if pinning is detected.  So that future
> > handling will be done by the PTE level (with our latest changes, each of the
> > small pages will be copied).  We can achieve this by let copy_huge_pmd() return
> > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > finally land the next copy_pte_range() call.
> > 
> > Huge PUDs will be even more special - so far it does not support anonymous
> > pages.  But it can actually be done the same as the huge PMDs even if the split
> > huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
> > ins will remap the same pages in either parent/child later.
> > 
> > This might not be the most efficient way, but it should be easy and clean
> > enough.  It should be fine, since we're tackling with a very rare case just to
> > make sure userspaces that pinned some thps will still work even without
> > MADV_DONTFORK and after they fork()ed.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> >  mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  
> >  	src_page = pmd_page(pmd);
> >  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > +	/*
> > +	 * If this page is a potentially pinned page, split and retry the fault
> > +	 * with smaller page size.  Normally this should not happen because the
> > +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> > +	 * best effort that the pinned pages won't be replaced by another
> > +	 * random page during the coming copy-on-write.
> > +	 */
> > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > +		     page_maybe_dma_pinned(src_page))) {
> > +		pte_free(dst_mm, pgtable);
> > +		spin_unlock(src_ptl);
> > +		spin_unlock(dst_ptl);
> > +		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > +		return -EAGAIN;
> > +	}
> 
> Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> before doing the write protect. Seems like it might be an existing
> bug?

IMHO it's not a bug, because splitting a huge pmd should always be safe.

One thing I can think of that might be special here is when the pmd is
anonymously mapped but also shared (shared, tmpfs thp, I think?), then here
we'll also mark it as wrprotected even if we don't need to (or maybe we need it
for some reason..).  But again I think it's safe anyways - when page fault
happens, wp_huge_pmd() should split it into smaller pages unconditionally.  I
just don't know whether it's the ideal way for the shared case.  Andrea should
definitely know it better (because it is there since the 1st day of thp).

> 
> In any event, the has_pinned logic shouldn't be used without also
> checking is_cow_mapping(), so it should be added to that test. Same
> remarks for PUD

I think the case mentioned above is also the special case here when we didn't
check is_cow_mapping().  The major difference is whether we'll split the page
right now, or postpone it until the next write to each mm.  But I think, yes,
maybe I should better still keep the is_cow_mapping() to be explicit.

Thanks,
Peter Xu Sept. 23, 2020, 3:44 p.m. UTC | #9
On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
> On Wed 23-09-20 09:50:04, Peter Xu wrote:
> > On Wed, Sep 23, 2020 at 11:22:05AM +0200, Jan Kara wrote:
> > > On Tue 22-09-20 13:01:13, John Hubbard wrote:
> > > > On 9/22/20 3:33 AM, Jan Kara wrote:
> > > > > On Mon 21-09-20 23:41:16, John Hubbard wrote:
> > > > > > On 9/21/20 2:20 PM, Peter Xu wrote:
> > > > > > ...
> > > > > > > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > > > > > +		     page_maybe_dma_pinned(src_page))) {
> > > > > > 
> > > > > > This condition would make a good static inline function. It's used in 3
> > > > > > places, and the condition is quite special and worth documenting, and
> > > > > > having a separate function helps with that, because the function name
> > > > > > adds to the story. I'd suggest approximately:
> > > > > > 
> > > > > >      page_likely_dma_pinned()
> > > > > > 
> > > > > > for the name.
> > > > > 
> > > > > Well, but we should also capture that this really only works for anonymous
> > > > > pages. For file pages mm->has_pinned does not work because the page may be
> > > > > still pinned by completely unrelated process as Jann already properly
> > > > > pointed out earlier in the thread. So maybe anon_page_likely_pinned()?
> > > > > Possibly also assert PageAnon(page) in it if we want to be paranoid...
> > > > > 
> > > > > 								Honza
> > > > 
> > > > The file-backed case doesn't really change anything, though:
> > > > page_maybe_dma_pinned() is already a "fuzzy yes" in the same sense: you
> > > > can get a false positive. Just like here, with an mm->has_pinned that
> > > > could be a false positive for a process.
> > > > 
> > > > And for that reason, I'm also not sure an "assert PageAnon(page)" is
> > > > desirable. That assertion would prevent file-backed callers from being
> > > > able to call a function that provides a fuzzy answer, but I don't see
> > > > why you'd want or need to do that. The goal here is to make the fuzzy
> > > > answer a little bit more definite, but it's not "broken" just because
> > > > the result is still fuzzy, right?
> > > > 
> > > > Apologies if I'm missing a huge point here... :)
> > > 
> > > But the problem is that if you apply mm->has_pinned check on file pages,
> > > you can get false negatives now. And that's not acceptable...
> > 
> > Do you mean the case where proc A pinned page P from a file, then proc B
> > mapped the same page P on the file, then fork() on proc B?
> 
> Yes.
> 
> > If proc B didn't explicitly pinned page P in B's address space too,
> > shouldn't we return "false" for page_likely_dma_pinned(P)?  Because if
> > proc B didn't pin the page in its own address space, I'd think it's ok to
> > get the page replaced at any time as long as the content keeps the same.
> > Or couldn't we?
> 
> So it depends on the reason why you call page_likely_dma_pinned(). For your
> COW purposes the check is correct but e.g. for "can filesystem safely
> writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
> not objecting to the mechanism as such. I'm mainly objecting to the generic
> function name which suggests something else than what it really checks and
> thus it could be used in wrong places in the future... That's why I'd
> prefer to restrict the function to PageAnon pages where there's no risk of
> confusion what the check actually does.

How about I introduce the helper as John suggested, but rename it to

  page_maybe_dma_pinned_by_mm()

?

Then we also don't need to judge on which is more likely to happen (between
"maybe" and "likely", since that will confuse me if I only read these words..).

I didn't use any extra suffix like "cow" because I think it might be useful for
things besides cow.  Fundamentally the new helper will be mm-based, so "by_mm"
seems to suite better to me.

Does that sound ok?
Peter Xu Sept. 23, 2020, 4:06 p.m. UTC | #10
On Mon, Sep 21, 2020 at 11:41:16PM -0700, John Hubbard wrote:
> On 9/21/20 2:20 PM, Peter Xu wrote:
> ...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7ff29cc3d55c..c40aac0ad87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	src_page = pmd_page(pmd);
> >   	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > +
> > +	/*
> > +	 * If this page is a potentially pinned page, split and retry the fault
> > +	 * with smaller page size.  Normally this should not happen because the
> > +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> > +	 * best effort that the pinned pages won't be replaced by another
> > +	 * random page during the coming copy-on-write.
> > +	 */
> > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > +		     page_maybe_dma_pinned(src_page))) {

[...]

> > +		pte_free(dst_mm, pgtable);
> > +		spin_unlock(src_ptl);
> > +		spin_unlock(dst_ptl);
> > +		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > +		return -EAGAIN;
> > +	}
> 
> 
> Why wait until we are so deep into this routine to detect this and unwind?
> It seems like if you could do a check near the beginning of this routine, and
> handle it there, with less unwinding? In fact, after taking only the src_ptl,
> the check could be made, right?

Because that's where we've fetched the page from the pmd so I can directly
reference src_page.  Also I think at least I need to check against swp entries?
So it seems still easier to keep it here, considering it's an unlikely path.

Thanks,
Yang Shi Sept. 23, 2020, 4:07 p.m. UTC | #11
On Wed, Sep 23, 2020 at 8:26 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > > random newly allocated pages.
> > >
> > > For huge PMDs, we split the huge pmd if pinning is detected.  So that future
> > > handling will be done by the PTE level (with our latest changes, each of the
> > > small pages will be copied).  We can achieve this by let copy_huge_pmd() return
> > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > > finally land the next copy_pte_range() call.
> > >
> > > Huge PUDs will be even more special - so far it does not support anonymous
> > > pages.  But it can actually be done the same as the huge PMDs even if the split
> > > huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
> > > ins will remap the same pages in either parent/child later.
> > >
> > > This might not be the most efficient way, but it should be easy and clean
> > > enough.  It should be fine, since we're tackling with a very rare case just to
> > > make sure userspaces that pinned some thps will still work even without
> > > MADV_DONTFORK and after they fork()ed.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > >  mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 7ff29cc3d55c..c40aac0ad87e 100644
> > > +++ b/mm/huge_memory.c
> > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >
> > >     src_page = pmd_page(pmd);
> > >     VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > > +
> > > +   /*
> > > +    * If this page is a potentially pinned page, split and retry the fault
> > > +    * with smaller page size.  Normally this should not happen because the
> > > +    * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> > > +    * best effort that the pinned pages won't be replaced by another
> > > +    * random page during the coming copy-on-write.
> > > +    */
> > > +   if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > +                page_maybe_dma_pinned(src_page))) {
> > > +           pte_free(dst_mm, pgtable);
> > > +           spin_unlock(src_ptl);
> > > +           spin_unlock(dst_ptl);
> > > +           __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > > +           return -EAGAIN;
> > > +   }
> >
> > Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> > before doing the write protect. Seems like it might be an existing
> > bug?
>
> IMHO it's not a bug, because splitting a huge pmd should always be safe.
>
> One thing I can think of that might be special here is when the pmd is
> anonymously mapped but also shared (shared, tmpfs thp, I think?), then here
> we'll also mark it as wrprotected even if we don't need to (or maybe we need it
> for some reason..).  But again I think it's safe anyways - when page fault

For tmpfs map, the pmd split just clears the pmd entry without
reinstalling ptes (oppositely anonymous map would reinstall ptes). It
looks this patch intends to copy at pte level by splitting pmd. But
I'm afraid this may not work for tmpfs mappings.

> happens, wp_huge_pmd() should split it into smaller pages unconditionally.  I
> just don't know whether it's the ideal way for the shared case.  Andrea should
> definitely know it better (because it is there since the 1st day of thp).
>
> >
> > In any event, the has_pinned logic shouldn't be used without also
> > checking is_cow_mapping(), so it should be added to that test. Same
> > remarks for PUD
>
> I think the case mentioned above is also the special case here when we didn't
> check is_cow_mapping().  The major difference is whether we'll split the page
> right now, or postpone it until the next write to each mm.  But I think, yes,
> maybe I should better still keep the is_cow_mapping() to be explicit.
>
> Thanks,
>
> --
> Peter Xu
>
Jason Gunthorpe Sept. 23, 2020, 5:17 p.m. UTC | #12
On Wed, Sep 23, 2020 at 11:24:09AM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > > random newly allocated pages.
> > > 
> > > For huge PMDs, we split the huge pmd if pinning is detected.  So that future
> > > handling will be done by the PTE level (with our latest changes, each of the
> > > small pages will be copied).  We can achieve this by let copy_huge_pmd() return
> > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > > finally land the next copy_pte_range() call.
> > > 
> > > Huge PUDs will be even more special - so far it does not support anonymous
> > > pages.  But it can actually be done the same as the huge PMDs even if the split
> > > huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
> > > ins will remap the same pages in either parent/child later.
> > > 
> > > This might not be the most efficient way, but it should be easy and clean
> > > enough.  It should be fine, since we're tackling with a very rare case just to
> > > make sure userspaces that pinned some thps will still work even without
> > > MADV_DONTFORK and after they fork()ed.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > >  mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 7ff29cc3d55c..c40aac0ad87e 100644
> > > +++ b/mm/huge_memory.c
> > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >  
> > >  	src_page = pmd_page(pmd);
> > >  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > > +
> > > +	/*
> > > +	 * If this page is a potentially pinned page, split and retry the fault
> > > +	 * with smaller page size.  Normally this should not happen because the
> > > +	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
> > > +	 * best effort that the pinned pages won't be replaced by another
> > > +	 * random page during the coming copy-on-write.
> > > +	 */
> > > +	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > +		     page_maybe_dma_pinned(src_page))) {
> > > +		pte_free(dst_mm, pgtable);
> > > +		spin_unlock(src_ptl);
> > > +		spin_unlock(dst_ptl);
> > > +		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > > +		return -EAGAIN;
> > > +	}
> > 
> > Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> > before doing the write protect. Seems like it might be an existing
> > bug?
> 
> IMHO it's not a bug, because splitting a huge pmd should always be safe.

Sur splitting is safe, but testing has_pinned without checking COW is
not, for what Jann explained.

The 'maybe' in page_maybe_dma_pinned() means it can return true when
the correct answer is false. It can never return false when the
correct answer is true.

It is the same when has_pinned is involved, the combined expression
must never return false when true is correct. Which means it can only
be applied for COW cases.

Jason
John Hubbard Sept. 23, 2020, 8:19 p.m. UTC | #13
On 9/23/20 8:44 AM, Peter Xu wrote:
> On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
>> On Wed 23-09-20 09:50:04, Peter Xu wrote:
...
>>>> But the problem is that if you apply mm->has_pinned check on file pages,
>>>> you can get false negatives now. And that's not acceptable...
>>>
>>> Do you mean the case where proc A pinned page P from a file, then proc B
>>> mapped the same page P on the file, then fork() on proc B?
>>
>> Yes.

aha, thanks for spelling out the false negative problem.

>>
>>> If proc B didn't explicitly pinned page P in B's address space too,
>>> shouldn't we return "false" for page_likely_dma_pinned(P)?  Because if
>>> proc B didn't pin the page in its own address space, I'd think it's ok to
>>> get the page replaced at any time as long as the content keeps the same.
>>> Or couldn't we?
>>
>> So it depends on the reason why you call page_likely_dma_pinned(). For your
>> COW purposes the check is correct but e.g. for "can filesystem safely
>> writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
>> not objecting to the mechanism as such. I'm mainly objecting to the generic
>> function name which suggests something else than what it really checks and
>> thus it could be used in wrong places in the future... That's why I'd
>> prefer to restrict the function to PageAnon pages where there's no risk of
>> confusion what the check actually does.
> 
> How about I introduce the helper as John suggested, but rename it to
> 
>    page_maybe_dma_pinned_by_mm()
> 
> ?
> 
> Then we also don't need to judge on which is more likely to happen (between
> "maybe" and "likely", since that will confuse me if I only read these words..).
>

You're right, it is too subtle of a distinction after all. I agree that sticking
with "_maybe_" avoids that confusion.


> I didn't use any extra suffix like "cow" because I think it might be useful for
> things besides cow.  Fundamentally the new helper will be mm-based, so "by_mm"
> seems to suite better to me.
> 
> Does that sound ok?
> 

Actually, Jan nailed it. I just wasn't understanding his scenario, but now that
I do, and considering your other point about wording, I think we end up with:

     anon_page_maybe_pinned()

as a pretty good name for a helper function. (We don't want "_mm" because that
refers more to the mechanism used internally, rather than the behavior of the
function. "anon_" adds more meaning.)

...now I better go and try to grok what Jason is recommending for the new
meaning of FOLL_PIN, in another tributary of this thread. I don't *think* it affects
this naming point, though. :)

thanks,
Peter Xu Sept. 24, 2020, 3:47 p.m. UTC | #14
On Wed, Sep 23, 2020 at 09:07:49AM -0700, Yang Shi wrote:
> For tmpfs map, the pmd split just clears the pmd entry without
> reinstalling ptes (oppositely anonymous map would reinstall ptes). It
> looks this patch intends to copy at pte level by splitting pmd. But
> I'm afraid this may not work for tmpfs mappings.

IIUC that's exactly what we want.

We only want to make sure the pinned tmpfs shared pages will be kept there in
the parent.  It's not a must to copy the pages to the child, as long as they
can be faulted in later correctly.
Yang Shi Sept. 24, 2020, 5:29 p.m. UTC | #15
On Thu, Sep 24, 2020 at 8:47 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 23, 2020 at 09:07:49AM -0700, Yang Shi wrote:
> > For tmpfs map, the pmd split just clears the pmd entry without
> > reinstalling ptes (oppositely anonymous map would reinstall ptes). It
> > looks this patch intends to copy at pte level by splitting pmd. But
> > I'm afraid this may not work for tmpfs mappings.
>
> IIUC that's exactly what we want.
>
> We only want to make sure the pinned tmpfs shared pages will be kept there in
> the parent.  It's not a must to copy the pages to the child, as long as they
> can be faulted in later correctly.

Aha, got your point. Yes, they can be refaulted in later. This is how
the file THP pmd split was designed.

>
> --
> Peter Xu
>
Peter Xu Sept. 24, 2020, 6:49 p.m. UTC | #16
On Wed, Sep 23, 2020 at 01:19:08PM -0700, John Hubbard wrote:
> On 9/23/20 8:44 AM, Peter Xu wrote:
> > On Wed, Sep 23, 2020 at 04:01:14PM +0200, Jan Kara wrote:
> > > On Wed 23-09-20 09:50:04, Peter Xu wrote:
> ...
> > > > > But the problem is that if you apply mm->has_pinned check on file pages,
> > > > > you can get false negatives now. And that's not acceptable...
> > > > 
> > > > Do you mean the case where proc A pinned page P from a file, then proc B
> > > > mapped the same page P on the file, then fork() on proc B?
> > > 
> > > Yes.
> 
> aha, thanks for spelling out the false negative problem.
> 
> > > 
> > > > If proc B didn't explicitly pinned page P in B's address space too,
> > > > shouldn't we return "false" for page_likely_dma_pinned(P)?  Because if
> > > > proc B didn't pin the page in its own address space, I'd think it's ok to
> > > > get the page replaced at any time as long as the content keeps the same.
> > > > Or couldn't we?
> > > 
> > > So it depends on the reason why you call page_likely_dma_pinned(). For your
> > > COW purposes the check is correct but e.g. for "can filesystem safely
> > > writeback this page" the page_likely_dma_pinned() would be wrong. So I'm
> > > not objecting to the mechanism as such. I'm mainly objecting to the generic
> > > function name which suggests something else than what it really checks and
> > > thus it could be used in wrong places in the future... That's why I'd
> > > prefer to restrict the function to PageAnon pages where there's no risk of
> > > confusion what the check actually does.
> > 
> > How about I introduce the helper as John suggested, but rename it to
> > 
> >    page_maybe_dma_pinned_by_mm()
> > 
> > ?
> > 
> > Then we also don't need to judge on which is more likely to happen (between
> > "maybe" and "likely", since that will confuse me if I only read these words..).
> > 
> 
> You're right, it is too subtle of a distinction after all. I agree that sticking
> with "_maybe_" avoids that confusion.
> 
> 
> > I didn't use any extra suffix like "cow" because I think it might be useful for
> > things besides cow.  Fundamentally the new helper will be mm-based, so "by_mm"
> > seems to suite better to me.
> > 
> > Does that sound ok?
> > 
> 
> Actually, Jan nailed it. I just wasn't understanding his scenario, but now that
> I do, and considering your other point about wording, I think we end up with:
> 
>     anon_page_maybe_pinned()
> 
> as a pretty good name for a helper function. (We don't want "_mm" because that
> refers more to the mechanism used internally, rather than the behavior of the
> function. "anon_" adds more meaning.)

Actually it was really my intention when I suggested "_by_mm", because IMHO the
new helper actually means "whether this page may be pinned by _this mm_ (not
any other address space)".  IOW, the case that Jan mentioned on the share page
can be reflected in this case, because although that page was pinned, however
it was not pinned "by this mm" for e.g. proc B above.

Though I've no strong opinion either. I'll start with anon_page_maybe_pinned().
To me it's probably more important to prepare the next spin first and see
whether we'd still like it for this release.

Thanks,
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ff29cc3d55c..c40aac0ad87e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1074,6 +1074,23 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+
+	/*
+	 * If this page is a potentially pinned page, split and retry the fault
+	 * with smaller page size.  Normally this should not happen because the
+	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
+	 * best effort that the pinned pages won't be replaced by another
+	 * random page during the coming copy-on-write.
+	 */
+	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(src_page))) {
+		pte_free(dst_mm, pgtable);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
+		return -EAGAIN;
+	}
+
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1177,6 +1194,15 @@  int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
+	/* Please refer to comments in copy_huge_pmd() */
+	if (unlikely(READ_ONCE(src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(pud_page(pud)))) {
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pud(vma, src_pud, addr);
+		return -EAGAIN;
+	}
+
 	pudp_set_wrprotect(src_mm, addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
 	set_pud_at(dst_mm, addr, dst_pud, pud);