diff mbox series

[4/5] mm: Do early cow for pinned pages during fork() for ptes

Message ID 20200921212028.25184-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Break COW for pinned pages during fork() | expand

Commit Message

Peter Xu Sept. 21, 2020, 9:20 p.m. UTC
This patch is greatly inspired by the discussions on the list from Linus, Jason
Gunthorpe and others [1].

It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm.  Currently we don't have an accurate way to know whether a page is
pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
that's good enough for now.  Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes.  So the page copy process needs to be done outside
copy_one_pte().

The new COPY_MM_BREAK_COW is introduced for this - copy_one_pte() would return
this when it finds any pte that may need an early breaking of cow.

page_duplicate() is used to handle the page copy process in copy_pte_range().
Of course we need to do that after releasing of the locks.

The slightly tricky part is page_duplicate() will fill in the copy_mm_data with
the new page copied and we'll need to re-install the pte again with page table
locks held again.  That's done in pte_install_copied_page().

The whole procedure looks quite similar to wp_page_copy() however it's simpler
because we know the page is special (pinned) and we know we don't need tlb
flushings because no one is referencing the new mm yet.

Though we still have to be very careful on maintaining the two pages (one old
source page, one new allocated page) across all these lock taking/releasing
process and make sure neither of them will get lost.

[1] https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 167 insertions(+), 7 deletions(-)

Comments

Jann Horn Sept. 21, 2020, 9:55 p.m. UTC | #1
On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <peterx@redhat.com> wrote:
> This patch is greatly inspired by the discussions on the list from Linus, Jason
> Gunthorpe and others [1].
>
> It allows copy_pte_range() to do early cow if the pages were pinned on the
> source mm.  Currently we don't have an accurate way to know whether a page is
> pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
> that's good enough for now.  Especially, with the newly added mm->has_pinned
> flag to make sure we won't affect processes that never pinned any pages.

To clarify: This patch only handles pin_user_pages() callers and
doesn't try to address other GUP users, right? E.g. if task A uses
process_vm_write() on task B while task B is going through fork(),
that can still race in such a way that the written data only shows up
in the child and not in B, right?

I dislike the whole pin_user_pages() concept because (as far as I
understand) it fundamentally tries to fix a problem in the subset of
cases that are more likely to occur in practice (long-term pins
overlapping with things like writeback), and ignores the rarer cases
("short-term" GUP).
John Hubbard Sept. 21, 2020, 10:18 p.m. UTC | #2
On 9/21/20 2:55 PM, Jann Horn wrote:
> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <peterx@redhat.com> wrote:
...
> I dislike the whole pin_user_pages() concept because (as far as I
> understand) it fundamentally tries to fix a problem in the subset of
> cases that are more likely to occur in practice (long-term pins
> overlapping with things like writeback), and ignores the rarer cases
> ("short-term" GUP).
> 

Well, no, that's not really fair. pin_user_pages() provides a key
prerequisite to fixing *all* of the bugs in that area, not just a
subset. The 5 cases in Documentation/core-api/pin_user_pages.rst cover
this pretty well. Or if they don't, let me know and I'll have another
pass at it.

The case for a "pin count" that is (logically) separate from a
page->_refcount is real, and it fixes real problems. An elevated
refcount can be caused by a lot of things, but it can normally be waited
for and/or retried. The FOLL_PIN pages cannot.

Of course, a valid remaining criticism of the situation is, "why not
just *always* mark any of these pages as "dma-pinned"? In other words,
why even have a separate gup/pup API? And in fact, perhaps eventually
we'll just get rid of the get_user_pages*() side of it. But the pin
count will need to remain, in order to discern between DMA pins and
temporary refcount boosts.

thanks,
Peter Xu Sept. 21, 2020, 10:27 p.m. UTC | #3
Hi, Jann,

On Mon, Sep 21, 2020 at 11:55:06PM +0200, Jann Horn wrote:
> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <peterx@redhat.com> wrote:
> > This patch is greatly inspired by the discussions on the list from Linus, Jason
> > Gunthorpe and others [1].
> >
> > It allows copy_pte_range() to do early cow if the pages were pinned on the
> > source mm.  Currently we don't have an accurate way to know whether a page is
> > pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
> > that's good enough for now.  Especially, with the newly added mm->has_pinned
> > flag to make sure we won't affect processes that never pinned any pages.
> 
> To clarify: This patch only handles pin_user_pages() callers and
> doesn't try to address other GUP users, right? E.g. if task A uses
> process_vm_write() on task B while task B is going through fork(),
> that can still race in such a way that the written data only shows up
> in the child and not in B, right?

I saw that process_vm_write() is using pin_user_pages_remote(), so I think
after this patch applied the data will only be written to B but not the child.
Because when B fork() with these temp pinned pages, it will copy the pages
rather than write-protect them any more.  IIUC the child could still have
partial data, but at last (after unpinned) B should always have the complete
data set.

> 
> I dislike the whole pin_user_pages() concept because (as far as I
> understand) it fundamentally tries to fix a problem in the subset of
> cases that are more likely to occur in practice (long-term pins
> overlapping with things like writeback), and ignores the rarer cases
> ("short-term" GUP).

John/Jason or others may be better on commenting on this one.  From my own
understanding, I thought it was the right thing to do so that we'll always
guarantee process B gets the whole data.  From that pov this patch should make
sense even for short term gups.  But maybe I've missed something.
Jann Horn Sept. 21, 2020, 10:27 p.m. UTC | #4
On Tue, Sep 22, 2020 at 12:18 AM John Hubbard <jhubbard@nvidia.com> wrote:
> On 9/21/20 2:55 PM, Jann Horn wrote:
> > On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <peterx@redhat.com> wrote:
> ...
> > I dislike the whole pin_user_pages() concept because (as far as I
> > understand) it fundamentally tries to fix a problem in the subset of
> > cases that are more likely to occur in practice (long-term pins
> > overlapping with things like writeback), and ignores the rarer cases
> > ("short-term" GUP).
> >
>
> Well, no, that's not really fair. pin_user_pages() provides a key
> prerequisite to fixing *all* of the bugs in that area, not just a
> subset. The 5 cases in Documentation/core-api/pin_user_pages.rst cover
> this pretty well. Or if they don't, let me know and I'll have another
> pass at it.
>
> The case for a "pin count" that is (logically) separate from a
> page->_refcount is real, and it fixes real problems. An elevated
> refcount can be caused by a lot of things, but it can normally be waited
> for and/or retried. The FOLL_PIN pages cannot.
>
> Of course, a valid remaining criticism of the situation is, "why not
> just *always* mark any of these pages as "dma-pinned"? In other words,
> why even have a separate gup/pup API? And in fact, perhaps eventually
> we'll just get rid of the get_user_pages*() side of it. But the pin
> count will need to remain, in order to discern between DMA pins and
> temporary refcount boosts.

Ah... the documentation you linked implies that FOLL_WRITE should more
or less imply FOLL_PIN? I didn't realize that.

Whoops, and actually, process_vm_writev() does use FOLL_PIN
already, and I just grepped the code the wrong way.

Thanks for the enlightenment; I take back everything I said.
John Hubbard Sept. 22, 2020, 12:08 a.m. UTC | #5
On 9/21/20 3:27 PM, Jann Horn wrote:
> On Tue, Sep 22, 2020 at 12:18 AM John Hubbard <jhubbard@nvidia.com> wrote:
>> On 9/21/20 2:55 PM, Jann Horn wrote:
>>> On Mon, Sep 21, 2020 at 11:20 PM Peter Xu <peterx@redhat.com> wrote:
>> ...
> Ah... the documentation you linked implies that FOLL_WRITE should more
> or less imply FOLL_PIN? I didn't realize that.
> 

hmmm, that does seem like a pretty close approximation. It's certainly
true that if we were only doing reads, and also never marking pages
dirty, that the file system writeback code would be OK.

For completeness we should add: even just reading a page is still a
problem, if one also marks the page as dirty (which is inconsistent and
wrong, but still). That's because the file system code can then break,
during writeback in particular.


thanks,
Oleg Nesterov Sept. 22, 2020, 11:48 a.m. UTC | #6
On 09/21, Peter Xu wrote:
>
> @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>  				break;
>  		}
> +
> +		if (unlikely(data.cow_new_page)) {
> +			/*
> +			 * If cow_new_page set, we must be at the 2nd round of
> +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> +			 * page now.  Note that in all cases page_break_cow()
> +			 * will properly release the objects in copy_mm_data.
> +			 */
> +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> +			if (pte_install_copied_page(dst_mm, new, src_pte,
> +						    dst_pte, addr, rss,
> +						    &data)) {
> +				/* We installed the pte successfully; move on */
> +				progress++;
> +				continue;

I'm afraid I misread this patch too ;)

But it seems to me in this case the main loop can really "leak"
COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
need_resched() is true.

No?

Oleg.
Oleg Nesterov Sept. 22, 2020, 12:40 p.m. UTC | #7
On 09/22, Oleg Nesterov wrote:
>
> On 09/21, Peter Xu wrote:
> >
> > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >  				break;
> >  		}
> > +
> > +		if (unlikely(data.cow_new_page)) {
> > +			/*
> > +			 * If cow_new_page set, we must be at the 2nd round of
> > +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> > +			 * page now.  Note that in all cases page_break_cow()
> > +			 * will properly release the objects in copy_mm_data.
> > +			 */
> > +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > +			if (pte_install_copied_page(dst_mm, new, src_pte,
> > +						    dst_pte, addr, rss,
> > +						    &data)) {
> > +				/* We installed the pte successfully; move on */
> > +				progress++;
> > +				continue;
>
> I'm afraid I misread this patch too ;)
>
> But it seems to me in this case the main loop can really "leak"
> COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> need_resched() is true.
>
> No?

If yes, perhaps we can simplify the copy_ret/cow_new_page logic and make
it more explicit?

Something like below, on top of this patch...

Oleg.


--- x/mm/memory.c
+++ x/mm/memory.c
@@ -704,17 +704,6 @@
 	};
 };
 
-static inline void page_release_cow(struct copy_mm_data *data)
-{
-	/* The old page should only be released in page_duplicate() */
-	WARN_ON_ONCE(data->cow_old_page);
-
-	if (data->cow_new_page) {
-		put_page(data->cow_new_page);
-		data->cow_new_page = NULL;
-	}
-}
-
 /*
  * Duplicate the page for this PTE.  Returns zero if page copied (so we need to
  * retry on the same PTE again to arm the copied page very soon), or negative
@@ -925,7 +914,7 @@
 
 	if (!pte_same(*src_pte, data->cow_oldpte)) {
 		/* PTE has changed under us.  Release the page and retry */
-		page_release_cow(data);
+		put_page(data->cow_new_page);
 		return false;
 	}
 
@@ -936,12 +925,6 @@
 	set_pte_at(dst_mm, addr, dst_pte, entry);
 	rss[mm_counter(new_page)]++;
 
-	/*
-	 * Manually clear the new page pointer since we've moved ownership to
-	 * the newly armed PTE.
-	 */
-	data->cow_new_page = NULL;
-
 	return true;
 }
 
@@ -958,16 +941,12 @@
 	struct copy_mm_data data;
 
 again:
-	/* We don't reset this for COPY_MM_BREAK_COW */
-	memset(&data, 0, sizeof(data));
-
-again_break_cow:
 	init_rss_vec(rss);
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte) {
-		/* Guarantee that the new page is released if there is */
-		page_release_cow(&data);
+		if (unlikely(copy_ret == COPY_MM_BREAK_COW))
+			put_page(data.cow_new_page);
 		return -ENOMEM;
 	}
 	src_pte = pte_offset_map(src_pmd, addr);
@@ -978,6 +957,22 @@
 	arch_enter_lazy_mmu_mode();
 
 	progress = 0;
+	if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
+		/*
+		 * Note that in all cases pte_install_copied_page()
+		 * will properly release the objects in copy_mm_data.
+		 */
+		copy_ret = COPY_MM_DONE;
+		if (pte_install_copied_page(dst_mm, new, src_pte,
+					    dst_pte, addr, rss,
+					    &data)) {
+			/* We installed the pte successfully; move on */
+			progress++;
+			goto next;
+		}
+		/* PTE changed.  Retry this pte (falls through) */
+	}
+
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -990,24 +985,6 @@
 				break;
 		}
 
-		if (unlikely(data.cow_new_page)) {
-			/*
-			 * If cow_new_page set, we must be at the 2nd round of
-			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
-			 * page now.  Note that in all cases page_break_cow()
-			 * will properly release the objects in copy_mm_data.
-			 */
-			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
-			if (pte_install_copied_page(dst_mm, new, src_pte,
-						    dst_pte, addr, rss,
-						    &data)) {
-				/* We installed the pte successfully; move on */
-				progress++;
-				continue;
-			}
-			/* PTE changed.  Retry this pte (falls through) */
-		}
-
 		if (pte_none(*src_pte)) {
 			progress++;
 			continue;
@@ -1017,6 +994,7 @@
 		if (copy_ret != COPY_MM_DONE)
 			break;
 		progress += 8;
+next:
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
@@ -1030,13 +1008,14 @@
 	case COPY_MM_SWAP_CONT:
 		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
-		break;
+		copy_ret = COPY_MM_DONE;
+		goto again;
 	case COPY_MM_BREAK_COW:
 		/* Do accounting onto parent mm directly */
 		ret = page_duplicate(src_mm, vma, addr, &data);
 		if (ret)
 			return ret;
-		goto again_break_cow;
+		goto again;
 	case COPY_MM_DONE:
 		/* This means we're all good. */
 		break;
Peter Xu Sept. 22, 2020, 3:58 p.m. UTC | #8
On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> On 09/22, Oleg Nesterov wrote:
> >
> > On 09/21, Peter Xu wrote:
> > >
> > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > >  				break;
> > >  		}
> > > +
> > > +		if (unlikely(data.cow_new_page)) {
> > > +			/*
> > > +			 * If cow_new_page set, we must be at the 2nd round of
> > > +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> > > +			 * page now.  Note that in all cases page_break_cow()
> > > +			 * will properly release the objects in copy_mm_data.
> > > +			 */
> > > +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > +			if (pte_install_copied_page(dst_mm, new, src_pte,
> > > +						    dst_pte, addr, rss,
> > > +						    &data)) {
> > > +				/* We installed the pte successfully; move on */
> > > +				progress++;
> > > +				continue;
> >
> > I'm afraid I misread this patch too ;)
> >
> > But it seems to me in this case the main loop can really "leak"
> > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > need_resched() is true.
> >
> > No?

I still think it's a no...

Note that now we'll reset "progress" every time before the do loop, so we'll
never reach need_resched() (since progress<32) before pte_install_copied_page()
when needed.

I explicitly put the pte_install_copied_page() into the loop just...

> 
> If yes, perhaps we can simplify the copy_ret/cow_new_page logic and make
> it more explicit?
> 
> Something like below, on top of this patch...
> 
> Oleg.
> 
> 
> --- x/mm/memory.c
> +++ x/mm/memory.c
> @@ -704,17 +704,6 @@
>  	};
>  };
>  
> -static inline void page_release_cow(struct copy_mm_data *data)
> -{
> -	/* The old page should only be released in page_duplicate() */
> -	WARN_ON_ONCE(data->cow_old_page);
> -
> -	if (data->cow_new_page) {
> -		put_page(data->cow_new_page);
> -		data->cow_new_page = NULL;
> -	}
> -}

(I'm not very sure on whether I should drop this helper.  I wanted to have more
 spots for checking everything is right and raise if something got wrong, and I
 also wanted to have the cow_new_page to never contain invalid page pointer too
 since after the put_page() it's invalid (otherwise we'll need to set NULL when
 we do put_page every time explicitly).  I'll still tend to keep this if no
 strong opinion.. or I can also drop it if there's another vote.)

> -
>  /*
>   * Duplicate the page for this PTE.  Returns zero if page copied (so we need to
>   * retry on the same PTE again to arm the copied page very soon), or negative
> @@ -925,7 +914,7 @@
>  
>  	if (!pte_same(*src_pte, data->cow_oldpte)) {
>  		/* PTE has changed under us.  Release the page and retry */
> -		page_release_cow(data);
> +		put_page(data->cow_new_page);
>  		return false;
>  	}
>  
> @@ -936,12 +925,6 @@
>  	set_pte_at(dst_mm, addr, dst_pte, entry);
>  	rss[mm_counter(new_page)]++;
>  
> -	/*
> -	 * Manually clear the new page pointer since we've moved ownership to
> -	 * the newly armed PTE.
> -	 */
> -	data->cow_new_page = NULL;
> -
>  	return true;
>  }
>  
> @@ -958,16 +941,12 @@
>  	struct copy_mm_data data;
>  
>  again:
> -	/* We don't reset this for COPY_MM_BREAK_COW */
> -	memset(&data, 0, sizeof(data));
> -
> -again_break_cow:
>  	init_rss_vec(rss);
>  
>  	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>  	if (!dst_pte) {
> -		/* Guarantee that the new page is released if there is */
> -		page_release_cow(&data);
> +		if (unlikely(copy_ret == COPY_MM_BREAK_COW))
> +			put_page(data.cow_new_page);
>  		return -ENOMEM;
>  	}
>  	src_pte = pte_offset_map(src_pmd, addr);
> @@ -978,6 +957,22 @@
>  	arch_enter_lazy_mmu_mode();
>  
>  	progress = 0;
> +	if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> +		/*
> +		 * Note that in all cases pte_install_copied_page()
> +		 * will properly release the objects in copy_mm_data.
> +		 */
> +		copy_ret = COPY_MM_DONE;
> +		if (pte_install_copied_page(dst_mm, new, src_pte,
> +					    dst_pte, addr, rss,
> +					    &data)) {
> +			/* We installed the pte successfully; move on */
> +			progress++;
> +			goto next;

... to avoid jumps like this because I think it's really tricky. :)

> +		}
> +		/* PTE changed.  Retry this pte (falls through) */
> +	}
> +
>  	do {
>  		/*
>  		 * We are holding two locks at this point - either of them
> @@ -990,24 +985,6 @@
>  				break;
>  		}
>  
> -		if (unlikely(data.cow_new_page)) {
> -			/*
> -			 * If cow_new_page set, we must be at the 2nd round of
> -			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> -			 * page now.  Note that in all cases page_break_cow()
> -			 * will properly release the objects in copy_mm_data.
> -			 */
> -			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> -			if (pte_install_copied_page(dst_mm, new, src_pte,
> -						    dst_pte, addr, rss,
> -						    &data)) {
> -				/* We installed the pte successfully; move on */
> -				progress++;
> -				continue;
> -			}
> -			/* PTE changed.  Retry this pte (falls through) */
> -		}
> -
>  		if (pte_none(*src_pte)) {
>  			progress++;
>  			continue;
> @@ -1017,6 +994,7 @@
>  		if (copy_ret != COPY_MM_DONE)
>  			break;
>  		progress += 8;
> +next:
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> @@ -1030,13 +1008,14 @@
>  	case COPY_MM_SWAP_CONT:
>  		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
>  			return -ENOMEM;
> -		break;
> +		copy_ret = COPY_MM_DONE;

Kind of a continuation of the discussion from previous patch - I think we'd
better reset copy_ret not only for this case, but move it after the switch
(just in case there'll be new ones).  The new BREAK_COW uses goto so it's quite
special.

> +		goto again;

I feel like this could go wrong without the "addr != end" check later, when
this is the last pte to check.

Thanks,

>  	case COPY_MM_BREAK_COW:
>  		/* Do accounting onto parent mm directly */
>  		ret = page_duplicate(src_mm, vma, addr, &data);
>  		if (ret)
>  			return ret;
> -		goto again_break_cow;
> +		goto again;
>  	case COPY_MM_DONE:
>  		/* This means we're all good. */
>  		break;
>
Oleg Nesterov Sept. 22, 2020, 4:52 p.m. UTC | #9
On 09/22, Peter Xu wrote:
>
> On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> > On 09/22, Oleg Nesterov wrote:
> > >
> > > On 09/21, Peter Xu wrote:
> > > >
> > > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > > >  				break;
> > > >  		}
> > > > +
> > > > +		if (unlikely(data.cow_new_page)) {
> > > > +			/*
> > > > +			 * If cow_new_page set, we must be at the 2nd round of
> > > > +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> > > > +			 * page now.  Note that in all cases page_break_cow()
> > > > +			 * will properly release the objects in copy_mm_data.
> > > > +			 */
> > > > +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > > +			if (pte_install_copied_page(dst_mm, new, src_pte,
> > > > +						    dst_pte, addr, rss,
> > > > +						    &data)) {
> > > > +				/* We installed the pte successfully; move on */
> > > > +				progress++;
> > > > +				continue;
> > >
> > > I'm afraid I misread this patch too ;)
> > >
> > > But it seems to me in this case the main loop can really "leak"
> > > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > > need_resched() is true.
> > >
> > > No?
>
> I still think it's a no...
>
> Note that now we'll reset "progress" every time before the do loop, so we'll
> never reach need_resched() (since progress<32) before pte_install_copied_page()
> when needed.

Yes. But copy_ret is still COPY_MM_BREAK_COW after pte_install_copied_page().
Now suppose that the next 31 pte's are pte_none(), progress will be incremented
every time.

> I explicitly put the pte_install_copied_page() into the loop just...
...
> >  	progress = 0;
> > +	if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> > +		/*
> > +		 * Note that in all cases pte_install_copied_page()
> > +		 * will properly release the objects in copy_mm_data.
> > +		 */
> > +		copy_ret = COPY_MM_DONE;
> > +		if (pte_install_copied_page(dst_mm, new, src_pte,
> > +					    dst_pte, addr, rss,
> > +					    &data)) {
> > +			/* We installed the pte successfully; move on */
> > +			progress++;
> > +			goto next;
>
> ... to avoid jumps like this because I think it's really tricky. :)

To me it looks better before the main loop because we know that
data.cow_new_page != NULL is only possible at the 1st iterattion after
restart ;)

But I agree, this is subjective, please ignore. However, I still think
it is better to rely on the copy_ret == COPY_MM_BREAK_COW check rather
than data.cow_new_page != NULL.

> >  	case COPY_MM_SWAP_CONT:
> >  		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> >  			return -ENOMEM;
> > -		break;
> > +		copy_ret = COPY_MM_DONE;
>
> Kind of a continuation of the discussion from previous patch - I think we'd
> better reset copy_ret not only for this case, but move it after the switch
> (just in case there'll be new ones).  The new BREAK_COW uses goto so it's quite
> special.
>
> > +		goto again;
>
> I feel like this could go wrong without the "addr != end" check later, when
> this is the last pte to check.

How? We know that copy_one_pte() failed and returned COPY_MM_SWAP_CONT
before addr = end.

And this matters "case COPY_MM_BREAK_COW" below which does "goto again"
without the "addr != end" check.

Oleg.
Peter Xu Sept. 22, 2020, 6:34 p.m. UTC | #10
On Tue, Sep 22, 2020 at 06:52:17PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > On Tue, Sep 22, 2020 at 02:40:14PM +0200, Oleg Nesterov wrote:
> > > On 09/22, Oleg Nesterov wrote:
> > > >
> > > > On 09/21, Peter Xu wrote:
> > > > >
> > > > > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > >  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> > > > >  				break;
> > > > >  		}
> > > > > +
> > > > > +		if (unlikely(data.cow_new_page)) {
> > > > > +			/*
> > > > > +			 * If cow_new_page set, we must be at the 2nd round of
> > > > > +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> > > > > +			 * page now.  Note that in all cases page_break_cow()
> > > > > +			 * will properly release the objects in copy_mm_data.
> > > > > +			 */
> > > > > +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > > > > +			if (pte_install_copied_page(dst_mm, new, src_pte,
> > > > > +						    dst_pte, addr, rss,
> > > > > +						    &data)) {
> > > > > +				/* We installed the pte successfully; move on */
> > > > > +				progress++;
> > > > > +				continue;
> > > >
> > > > I'm afraid I misread this patch too ;)
> > > >
> > > > But it seems to me in this case the main loop can really "leak"
> > > > COPY_MM_BREAK_COW. Suppose the the next 31 pte's are pte_none() and
> > > > need_resched() is true.
> > > >
> > > > No?
> >
> > I still think it's a no...
> >
> > Note that now we'll reset "progress" every time before the do loop, so we'll
> > never reach need_resched() (since progress<32) before pte_install_copied_page()
> > when needed.
> 
> Yes. But copy_ret is still COPY_MM_BREAK_COW after pte_install_copied_page().
> Now suppose that the next 31 pte's are pte_none(), progress will be incremented
> every time.

Yes, I think you're right - I'll need to reset that.

> 
> > I explicitly put the pte_install_copied_page() into the loop just...
> ...
> > >  	progress = 0;
> > > +	if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> > > +		/*
> > > +		 * Note that in all cases pte_install_copied_page()
> > > +		 * will properly release the objects in copy_mm_data.
> > > +		 */
> > > +		copy_ret = COPY_MM_DONE;
> > > +		if (pte_install_copied_page(dst_mm, new, src_pte,
> > > +					    dst_pte, addr, rss,
> > > +					    &data)) {
> > > +			/* We installed the pte successfully; move on */
> > > +			progress++;
> > > +			goto next;
> >
> > ... to avoid jumps like this because I think it's really tricky. :)
> 
> To me it looks better before the main loop because we know that
> data.cow_new_page != NULL is only possible at the 1st iterattion after
> restart ;)
> 
> But I agree, this is subjective, please ignore.

Thanks.  For simplicity, I'll keep the code majorly as is.  But I'm still open
to change if e.g. someone else still perfers the other way.

> However, I still think
> it is better to rely on the copy_ret == COPY_MM_BREAK_COW check rather
> than data.cow_new_page != NULL.

Yes.  Logically we should check both, but now as I'm written it as:

        if (unlikely(data.cow_new_page)) {
                WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
                ...
        }

I think it's even safer because it's actually checking both, but also warn if
only cow_new_page is set, which should never happen anyways.

Or I can also do it in inverted order if you think better:

        if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
                WARN_ON_ONCE(!data.cow_new_page);
                ...
        }

> 
> > >  	case COPY_MM_SWAP_CONT:
> > >  		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
> > >  			return -ENOMEM;
> > > -		break;
> > > +		copy_ret = COPY_MM_DONE;
> >
> > Kind of a continuation of the discussion from previous patch - I think we'd
> > better reset copy_ret not only for this case, but move it after the switch
> > (just in case there'll be new ones).  The new BREAK_COW uses goto so it's quite
> > special.
> >
> > > +		goto again;
> >
> > I feel like this could go wrong without the "addr != end" check later, when
> > this is the last pte to check.
> 
> How? We know that copy_one_pte() failed and returned COPY_MM_SWAP_CONT
> before addr = end.

I think you're right, again. :)

Thanks,
Oleg Nesterov Sept. 22, 2020, 6:44 p.m. UTC | #11
On 09/22, Peter Xu wrote:
>
> Or I can also do it in inverted order if you think better:
>
>         if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
>                 WARN_ON_ONCE(!data.cow_new_page);
>                 ...
>         }

Peter, let me say this again. I don't understand this code enough, you
can safely ignore me ;)

However. Personally I strongly prefer the above. Personally I really
dislike this part of 4/5:

	 again:
	+	/* We don't reset this for COPY_MM_BREAK_COW */
	+	memset(&data, 0, sizeof(data));
	+
	+again_break_cow:

If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
"again_break_cow", we don't need to clear ->cow_new_page, this makes the
logic more understandable. To me at least ;)

Oleg.
Peter Xu Sept. 23, 2020, 1:03 a.m. UTC | #12
On Tue, Sep 22, 2020 at 08:44:00PM +0200, Oleg Nesterov wrote:
> On 09/22, Peter Xu wrote:
> >
> > Or I can also do it in inverted order if you think better:
> >
> >         if (unlikely(copy_ret == COPY_MM_BREAK_COW)) {
> >                 WARN_ON_ONCE(!data.cow_new_page);
> >                 ...
> >         }
> 
> Peter, let me say this again. I don't understand this code enough, you
> can safely ignore me ;)

Why? I appreciate every single comment from you! :)

> 
> However. Personally I strongly prefer the above. Personally I really
> dislike this part of 4/5:
> 
> 	 again:
> 	+	/* We don't reset this for COPY_MM_BREAK_COW */
> 	+	memset(&data, 0, sizeof(data));
> 	+
> 	+again_break_cow:
> 
> If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
> "again_break_cow", we don't need to clear ->cow_new_page, this makes the
> logic more understandable. To me at least ;)

I see your point.  I'll definitely try it out.  I think I'll at least use what
you preferred above since it's actually the same as before, logically.  Then
I'll consider drop the again_break_cow, as long as I'm still as confident after
I do the change on not leaking anything :).

Thanks,
Linus Torvalds Sept. 23, 2020, 8:25 p.m. UTC | #13
On Tue, Sep 22, 2020 at 6:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> > If we rely on "copy_ret == COPY_MM_BREAK_COW" we can unify "again" and
> > "again_break_cow", we don't need to clear ->cow_new_page, this makes the
> > logic more understandable. To me at least ;)
>
> I see your point.  I'll definitely try it out.  I think I'll at least use what
> you preferred above since it's actually the same as before, logically.  Then
> I'll consider drop the again_break_cow, as long as I'm still as confident after
> I do the change on not leaking anything :).

So the two patches I sent out to re-organize copy_one_pte() were
literally meant to make all this mess go away.

IOW, the third patch would be something (COMPLETELY UNTESTED) like the attached.

I think the logic for the preallocation is fairly obvious, but it
might be better to allocate a batch of pages for all I know. That
said, I can't really make myself care about the performance of a
fork() after you've pinned pages in it, so..

                 Linus
Kirill Tkhai Sept. 24, 2020, 11:48 a.m. UTC | #14
On 22.09.2020 00:20, Peter Xu wrote:
> This patch is greatly inspired by the discussions on the list from Linus, Jason
> Gunthorpe and others [1].
> 
> It allows copy_pte_range() to do early cow if the pages were pinned on the
> source mm.  Currently we don't have an accurate way to know whether a page is
> pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
> that's good enough for now.  Especially, with the newly added mm->has_pinned
> flag to make sure we won't affect processes that never pinned any pages.
> 
> It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
> Unluckily, we can't because we're with the page table locks held for both the
> parent and child processes.  So the page copy process needs to be done outside
> copy_one_pte().
> 
> The new COPY_MM_BREAK_COW is introduced for this - copy_one_pte() would return
> this when it finds any pte that may need an early breaking of cow.
> 
> page_duplicate() is used to handle the page copy process in copy_pte_range().
> Of course we need to do that after releasing of the locks.
> 
> The slightly tricky part is page_duplicate() will fill in the copy_mm_data with
> the new page copied and we'll need to re-install the pte again with page table
> locks held again.  That's done in pte_install_copied_page().
> 
> The whole procedure looks quite similar to wp_page_copy() however it's simpler
> because we know the page is special (pinned) and we know we don't need tlb
> flushings because no one is referencing the new mm yet.
> 
> Though we still have to be very careful on maintaining the two pages (one old
> source page, one new allocated page) across all these lock taking/releasing
> process and make sure neither of them will get lost.
> 
> [1] https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 167 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 1530bb1070f4..8f3521be80ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -691,12 +691,72 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  
>  #define  COPY_MM_DONE               0
>  #define  COPY_MM_SWAP_CONT          1
> +#define  COPY_MM_BREAK_COW          2
>  
>  struct copy_mm_data {
>  	/* COPY_MM_SWAP_CONT */
>  	swp_entry_t entry;
> +	/* COPY_MM_BREAK_COW */
> +	struct {
> +		struct page *cow_old_page; /* Released by page_duplicate() */
> +		struct page *cow_new_page; /* Released by page_release_cow() */
> +		pte_t cow_oldpte;
> +	};
>  };
>  
> +static inline void page_release_cow(struct copy_mm_data *data)
> +{
> +	/* The old page should only be released in page_duplicate() */
> +	WARN_ON_ONCE(data->cow_old_page);
> +
> +	if (data->cow_new_page) {
> +		put_page(data->cow_new_page);
> +		data->cow_new_page = NULL;
> +	}
> +}
> +
> +/*
> + * Duplicate the page for this PTE.  Returns zero if page copied (so we need to
> + * retry on the same PTE again to arm the copied page very soon), or negative
> + * if error happened.  In all cases, the old page will be properly released.
> + */
> +static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
> +			  unsigned long address, struct copy_mm_data *data)
> +{
> +	struct page *new_page = NULL;
> +	int ret;
> +
> +	/* This should have been set in change_one_pte() when reach here */
> +	WARN_ON_ONCE(!data->cow_old_page);

Despite WARN() is preferred over BUG() in kernel, it looks a little strange that
we catch WARN once here, but later avoid panic in put_page().

> +	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> +	if (!new_page) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	copy_user_highpage(new_page, data->cow_old_page, address, vma);
> +	ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);

All failing operations should go first, while copy_user_highpage() should go last.

> +	if (ret) {
> +		put_page(new_page);
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	cgroup_throttle_swaprate(new_page, GFP_KERNEL);
> +	__SetPageUptodate(new_page);
> +
> +	/* So far so good; arm the new page for the next attempt */
> +	data->cow_new_page = new_page;
> +
> +out:
> +	/* Always release the old page */
> +	put_page(data->cow_old_page);
> +	data->cow_old_page = NULL;
> +
> +	return ret;
> +}
> +
>  /*
>   * copy one vm_area from one task to the other. Assumes the page tables
>   * already present in the new task to be cleared in the whole range
> @@ -711,6 +771,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
>  	struct page *page;
> +	bool wp;
>  
>  	/* pte contains position in swap or file, so copy. */
>  	if (unlikely(!pte_present(pte))) {
> @@ -789,10 +850,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * If it's a COW mapping, write protect it both
>  	 * in the parent and the child
>  	 */
> -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> -		pte = pte_wrprotect(pte);
> -	}
> +	wp = is_cow_mapping(vm_flags) && pte_write(pte);
>  
>  	/*
>  	 * If it's a shared mapping, mark it clean in
> @@ -813,15 +871,80 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	page = vm_normal_page(vma, addr, pte);
>  	if (page) {
>  		get_page(page);
> +
> +		/*
> +		 * If the page is pinned in source mm, do early cow right now
> +		 * so that the pinned page won't be replaced by another random
> +		 * page without being noticed after the fork().
> +		 *
> +		 * Note: there can be some very rare cases that we'll do
> +		 * unnecessary cow here, due to page_maybe_dma_pinned() is
> +		 * sometimes bogus, and has_pinned flag is currently aggresive
> +		 * too.  However this should be good enough for us for now as
> +		 * long as we covered all the pinned pages.  We can make this
> +		 * better in the future by providing an accurate accounting for
> +		 * pinned pages.
> +		 *
> +		 * Because we'll need to release the locks before doing cow,
> +		 * pass this work to upper layer.
> +		 */
> +		if (READ_ONCE(src_mm->has_pinned) && wp &&
> +		    page_maybe_dma_pinned(page)) {
> +			/* We've got the page already; we're safe */
> +			data->cow_old_page = page;
> +			data->cow_oldpte = *src_pte;
> +			return COPY_MM_BREAK_COW;
> +		}
> +
>  		page_dup_rmap(page, false);
>  		rss[mm_counter(page)]++;
>  	}
>  
> +	if (wp) {
> +		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		pte = pte_wrprotect(pte);
> +	}
> +
>  out_set_pte:
>  	set_pte_at(dst_mm, addr, dst_pte, pte);
>  	return COPY_MM_DONE;
>  }
>  
> +/*
> + * Install the pte with the copied page stored in `data'.  Returns true when
> + * installation completes, or false when src pte has changed.
> + */
> +static int pte_install_copied_page(struct mm_struct *dst_mm,
> +				   struct vm_area_struct *new,
> +				   pte_t *src_pte, pte_t *dst_pte,
> +				   unsigned long addr, int *rss,
> +				   struct copy_mm_data *data)
> +{
> +	struct page *new_page = data->cow_new_page;
> +	pte_t entry;
> +
> +	if (!pte_same(*src_pte, data->cow_oldpte)) {
> +		/* PTE has changed under us.  Release the page and retry */
> +		page_release_cow(data);
> +		return false;
> +	}
> +
> +	entry = mk_pte(new_page, new->vm_page_prot);
> +	entry = pte_sw_mkyoung(entry);
> +	entry = maybe_mkwrite(pte_mkdirty(entry), new);
> +	page_add_new_anon_rmap(new_page, new, addr, false);
> +	set_pte_at(dst_mm, addr, dst_pte, entry);
> +	rss[mm_counter(new_page)]++;
> +
> +	/*
> +	 * Manually clear the new page pointer since we've moved ownership to
> +	 * the newly armed PTE.
> +	 */
> +	data->cow_new_page = NULL;
> +
> +	return true;
> +}
> +
>  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
>  		   struct vm_area_struct *new,
> @@ -830,16 +953,23 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
>  	spinlock_t *src_ptl, *dst_ptl;
> -	int progress, copy_ret = COPY_MM_DONE;
> +	int progress, ret, copy_ret = COPY_MM_DONE;
>  	int rss[NR_MM_COUNTERS];
>  	struct copy_mm_data data;
>  
>  again:
> +	/* We don't reset this for COPY_MM_BREAK_COW */
> +	memset(&data, 0, sizeof(data));
> +
> +again_break_cow:
>  	init_rss_vec(rss);
>  
>  	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> -	if (!dst_pte)
> +	if (!dst_pte) {
> +		/* Guarantee that the new page is released if there is */
> +		page_release_cow(&data);
>  		return -ENOMEM;
> +	}
>  	src_pte = pte_offset_map(src_pmd, addr);
>  	src_ptl = pte_lockptr(src_mm, src_pmd);
>  	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>  				break;
>  		}
> +
> +		if (unlikely(data.cow_new_page)) {
> +			/*
> +			 * If cow_new_page set, we must be at the 2nd round of
> +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> +			 * page now.  Note that in all cases page_break_cow()
> +			 * will properly release the objects in copy_mm_data.
> +			 */
> +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> +			if (pte_install_copied_page(dst_mm, new, src_pte,
> +						    dst_pte, addr, rss,
> +						    &data)) {

It looks a little confusing, that all helpers in this function return 0 in case of success,
while pte_install_copied_page() returns true. Won't be better to return 0 and -EAGAIN instead
from it?

> +				/* We installed the pte successfully; move on */
> +				progress++;
> +				continue;
> +			}
> +			/* PTE changed.  Retry this pte (falls through) */
> +		}
> +
>  		if (pte_none(*src_pte)) {
>  			progress++;
>  			continue;
> @@ -882,8 +1031,19 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
>  			return -ENOMEM;
>  		break;
> -	default:
> +	case COPY_MM_BREAK_COW:
> +		/* Do accounting onto parent mm directly */
> +		ret = page_duplicate(src_mm, vma, addr, &data);
> +		if (ret)
> +			return ret;
> +		goto again_break_cow;
> +	case COPY_MM_DONE:
> +		/* This means we're all good. */
>  		break;
> +	default:
> +		/* This should mean copy_ret < 0.  Time to fail this fork().. */
> +		WARN_ON_ONCE(copy_ret >= 0);
> +		return copy_ret;
>  	}
>  
>  	if (addr != end)
>
Peter Xu Sept. 24, 2020, 3:08 p.m. UTC | #15
On Wed, Sep 23, 2020 at 01:25:52PM -0700, Linus Torvalds wrote:
> IOW, the third patch would be something (COMPLETELY UNTESTED) like the attached.

Thanks.  I'll rework on top.
Peter Xu Sept. 24, 2020, 3:16 p.m. UTC | #16
On Thu, Sep 24, 2020 at 02:48:00PM +0300, Kirill Tkhai wrote:
> > +/*
> > + * Duplicate the page for this PTE.  Returns zero if page copied (so we need to
> > + * retry on the same PTE again to arm the copied page very soon), or negative
> > + * if error happened.  In all cases, the old page will be properly released.
> > + */
> > +static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
> > +			  unsigned long address, struct copy_mm_data *data)
> > +{
> > +	struct page *new_page = NULL;
> > +	int ret;
> > +
> > +	/* This should have been set in change_one_pte() when reach here */
> > +	WARN_ON_ONCE(!data->cow_old_page);
> 
> Despite WARN() is preferred over BUG() in kernel, it looks a little strange that
> we catch WARN once here, but later avoid panic in put_page().

Do you mean "it'll panic in put_page()"?  I'll agree if so, seems this
WARN_ON_ONCE() won't help much.

> 
> > +	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > +	if (!new_page) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	copy_user_highpage(new_page, data->cow_old_page, address, vma);
> > +	ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);
> 
> All failing operations should go first, while copy_user_highpage() should go last.

Since I'll rebase to Linus's patch, I'll move this into the critical section
because the preallocated page can be used by any pte after that.  The spin
locks will need to be taken longer for that, but assuming that's not a problem
for an unlikely path.

> > @@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >  				break;
> >  		}
> > +
> > +		if (unlikely(data.cow_new_page)) {
> > +			/*
> > +			 * If cow_new_page set, we must be at the 2nd round of
> > +			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
> > +			 * page now.  Note that in all cases page_break_cow()
> > +			 * will properly release the objects in copy_mm_data.
> > +			 */
> > +			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
> > +			if (pte_install_copied_page(dst_mm, new, src_pte,
> > +						    dst_pte, addr, rss,
> > +						    &data)) {
> 
> It looks a little confusing, that all helpers in this function return 0 in case of success,
> while pte_install_copied_page() returns true. Won't be better to return 0 and -EAGAIN instead
> from it?

IMHO it's fine as long as no real errno will be popped out of the new helper.
But no strong opinion either, I'll see what I can do after the rebase.

Thanks for reviewing the patch even if it's going away.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 1530bb1070f4..8f3521be80ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -691,12 +691,72 @@  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 #define  COPY_MM_DONE               0
 #define  COPY_MM_SWAP_CONT          1
+#define  COPY_MM_BREAK_COW          2
 
 struct copy_mm_data {
 	/* COPY_MM_SWAP_CONT */
 	swp_entry_t entry;
+	/* COPY_MM_BREAK_COW */
+	struct {
+		struct page *cow_old_page; /* Released by page_duplicate() */
+		struct page *cow_new_page; /* Released by page_release_cow() */
+		pte_t cow_oldpte;
+	};
 };
 
+static inline void page_release_cow(struct copy_mm_data *data)
+{
+	/* The old page should only be released in page_duplicate() */
+	WARN_ON_ONCE(data->cow_old_page);
+
+	if (data->cow_new_page) {
+		put_page(data->cow_new_page);
+		data->cow_new_page = NULL;
+	}
+}
+
+/*
+ * Duplicate the page for this PTE.  Returns zero if page copied (so we need to
+ * retry on the same PTE again to arm the copied page very soon), or negative
+ * if error happened.  In all cases, the old page will be properly released.
+ */
+static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struct *vma,
+			  unsigned long address, struct copy_mm_data *data)
+{
+	struct page *new_page = NULL;
+	int ret;
+
+	/* This should have been set in change_one_pte() when reach here */
+	WARN_ON_ONCE(!data->cow_old_page);
+
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	copy_user_highpage(new_page, data->cow_old_page, address, vma);
+	ret = mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);
+	if (ret) {
+		put_page(new_page);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	cgroup_throttle_swaprate(new_page, GFP_KERNEL);
+	__SetPageUptodate(new_page);
+
+	/* So far so good; arm the new page for the next attempt */
+	data->cow_new_page = new_page;
+
+out:
+	/* Always release the old page */
+	put_page(data->cow_old_page);
+	data->cow_old_page = NULL;
+
+	return ret;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -711,6 +771,7 @@  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	bool wp;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -789,10 +850,7 @@  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
-	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = pte_wrprotect(pte);
-	}
+	wp = is_cow_mapping(vm_flags) && pte_write(pte);
 
 	/*
 	 * If it's a shared mapping, mark it clean in
@@ -813,15 +871,80 @@  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
+
+		/*
+		 * If the page is pinned in source mm, do early cow right now
+		 * so that the pinned page won't be replaced by another random
+		 * page without being noticed after the fork().
+		 *
+		 * Note: there can be some very rare cases that we'll do
+		 * unnecessary cow here, due to page_maybe_dma_pinned() is
+		 * sometimes bogus, and has_pinned flag is currently aggresive
+		 * too.  However this should be good enough for us for now as
+		 * long as we covered all the pinned pages.  We can make this
+		 * better in the future by providing an accurate accounting for
+		 * pinned pages.
+		 *
+		 * Because we'll need to release the locks before doing cow,
+		 * pass this work to upper layer.
+		 */
+		if (READ_ONCE(src_mm->has_pinned) && wp &&
+		    page_maybe_dma_pinned(page)) {
+			/* We've got the page already; we're safe */
+			data->cow_old_page = page;
+			data->cow_oldpte = *src_pte;
+			return COPY_MM_BREAK_COW;
+		}
+
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
 	}
 
+	if (wp) {
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte = pte_wrprotect(pte);
+	}
+
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return COPY_MM_DONE;
 }
 
+/*
+ * Install the pte with the copied page stored in `data'.  Returns true when
+ * installation completes, or false when src pte has changed.
+ */
+static int pte_install_copied_page(struct mm_struct *dst_mm,
+				   struct vm_area_struct *new,
+				   pte_t *src_pte, pte_t *dst_pte,
+				   unsigned long addr, int *rss,
+				   struct copy_mm_data *data)
+{
+	struct page *new_page = data->cow_new_page;
+	pte_t entry;
+
+	if (!pte_same(*src_pte, data->cow_oldpte)) {
+		/* PTE has changed under us.  Release the page and retry */
+		page_release_cow(data);
+		return false;
+	}
+
+	entry = mk_pte(new_page, new->vm_page_prot);
+	entry = pte_sw_mkyoung(entry);
+	entry = maybe_mkwrite(pte_mkdirty(entry), new);
+	page_add_new_anon_rmap(new_page, new, addr, false);
+	set_pte_at(dst_mm, addr, dst_pte, entry);
+	rss[mm_counter(new_page)]++;
+
+	/*
+	 * Manually clear the new page pointer since we've moved ownership to
+	 * the newly armed PTE.
+	 */
+	data->cow_new_page = NULL;
+
+	return true;
+}
+
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 		   struct vm_area_struct *new,
@@ -830,16 +953,23 @@  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress, copy_ret = COPY_MM_DONE;
+	int progress, ret, copy_ret = COPY_MM_DONE;
 	int rss[NR_MM_COUNTERS];
 	struct copy_mm_data data;
 
 again:
+	/* We don't reset this for COPY_MM_BREAK_COW */
+	memset(&data, 0, sizeof(data));
+
+again_break_cow:
 	init_rss_vec(rss);
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
-	if (!dst_pte)
+	if (!dst_pte) {
+		/* Guarantee that the new page is released if there is */
+		page_release_cow(&data);
 		return -ENOMEM;
+	}
 	src_pte = pte_offset_map(src_pmd, addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -859,6 +989,25 @@  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
 				break;
 		}
+
+		if (unlikely(data.cow_new_page)) {
+			/*
+			 * If cow_new_page set, we must be at the 2nd round of
+			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
+			 * page now.  Note that in all cases page_break_cow()
+			 * will properly release the objects in copy_mm_data.
+			 */
+			WARN_ON_ONCE(copy_ret != COPY_MM_BREAK_COW);
+			if (pte_install_copied_page(dst_mm, new, src_pte,
+						    dst_pte, addr, rss,
+						    &data)) {
+				/* We installed the pte successfully; move on */
+				progress++;
+				continue;
+			}
+			/* PTE changed.  Retry this pte (falls through) */
+		}
+
 		if (pte_none(*src_pte)) {
 			progress++;
 			continue;
@@ -882,8 +1031,19 @@  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
 		break;
-	default:
+	case COPY_MM_BREAK_COW:
+		/* Do accounting onto parent mm directly */
+		ret = page_duplicate(src_mm, vma, addr, &data);
+		if (ret)
+			return ret;
+		goto again_break_cow;
+	case COPY_MM_DONE:
+		/* This means we're all good. */
 		break;
+	default:
+		/* This should mean copy_ret < 0.  Time to fail this fork().. */
+		WARN_ON_ONCE(copy_ret >= 0);
+		return copy_ret;
 	}
 
 	if (addr != end)