diff mbox series

[RFC,v2,9/9] mm: Introduce Copy-On-Write PTE table

Message ID 20220927162957.270460-10-shiyn.lin@gmail.com (mailing list archive)
State New
Headers show
Series Introduce Copy-On-Write to Page Table | expand

Commit Message

Chih-En Lin Sept. 27, 2022, 4:29 p.m. UTC
This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
To enable the COW page table use the sysctl vm.cow_pte file with the
corresponding PID. It will set the MMF_COW_PTE_READY flag to the
process for enabling COW PTE during the next time of fork.

It uses the MMF_COW_PTE flag to distinguish the normal page table
and the COW one. Moreover, it is difficult to distinguish whether the
entire page table is out of COW state. So the MMF_COW_PTE flag won't be
disabled after its setup.

Since the memory space of the page table is distinctive for each process
in kernel space. It uses the address of the PMD index for the PTE table
ownership to identify which one of the processes needs to update the
page table state. In other words, only the owner will update shared
(COWed) PTE table state, like the RSS and pgtable_bytes.

Some PTE tables (e.g., pinned pages that reside in the table) still need
to be copied immediately for consistency with the current COW logic. As
a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE
table is exclusive (i.e., only one task owns it at a time) is added to
the table’s owner pointer. Every time a PTE table is copied during the
fork, the owner pointer (and thus the exclusive flag) will be checked to
determine whether the PTE table can be shared across processes.

It uses a reference count to track the lifetime of COWed PTE table.
Doing the fork with COW PTE will increase the refcount. And, when
someone writes to the COWed PTE table, it will cause the write fault to
break COW PTE. If the COWed PTE table's refcount is one, the process
that triggers the fault will reuse the COWed PTE table. Otherwise, the
process will decrease the refcount, copy the information to a new PTE
table or dereference all the information and change the owner if they
have the COWed PTE table.

If doing the COW to the PTE table once as the time touching the PMD
entry, it cannot preserves the reference count of the COWed PTE table.
Since the address range of VMA may overlap the PTE table, the copying
function will use VMA to travel the page table for copying it. So it may
increase the reference count of the COWed PTE table multiple times in
one COW page table forking. Generically it will only increase once time
as the child reference it. To solve this problem, it needs to check the
destination of PMD entry does exist. And the reference count of the
source PTE table is more than one before doing the COW.

This patch modifies the part of the copy page table to do the basic COW.
For the break COW, it modifies the part of a page fault, zaps page table
, unmapping, and remapping.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/memory.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/mmap.c   |  3 ++
 mm/mremap.c |  3 ++
 3 files changed, 87 insertions(+), 6 deletions(-)

Comments

Nadav Amit Sept. 27, 2022, 6:38 p.m. UTC | #1
On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
> To enable the COW page table use the sysctl vm.cow_pte file with the
> corresponding PID. It will set the MMF_COW_PTE_READY flag to the
> process for enabling COW PTE during the next time of fork.
> 
> It uses the MMF_COW_PTE flag to distinguish the normal page table
> and the COW one. Moreover, it is difficult to distinguish whether the
> entire page table is out of COW state. So the MMF_COW_PTE flag won't be
> disabled after its setup.
> 
> Since the memory space of the page table is distinctive for each process
> in kernel space. It uses the address of the PMD index for the PTE table
> ownership to identify which one of the processes needs to update the
> page table state. In other words, only the owner will update shared
> (COWed) PTE table state, like the RSS and pgtable_bytes.
> 
> Some PTE tables (e.g., pinned pages that reside in the table) still need
> to be copied immediately for consistency with the current COW logic. As
> a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE
> table is exclusive (i.e., only one task owns it at a time) is added to
> the table’s owner pointer. Every time a PTE table is copied during the
> fork, the owner pointer (and thus the exclusive flag) will be checked to
> determine whether the PTE table can be shared across processes.
> 
> It uses a reference count to track the lifetime of COWed PTE table.
> Doing the fork with COW PTE will increase the refcount. And, when
> someone writes to the COWed PTE table, it will cause the write fault to
> break COW PTE. If the COWed PTE table's refcount is one, the process
> that triggers the fault will reuse the COWed PTE table. Otherwise, the
> process will decrease the refcount, copy the information to a new PTE
> table or dereference all the information and change the owner if they
> have the COWed PTE table.
> 
> If doing the COW to the PTE table once as the time touching the PMD
> entry, it cannot preserves the reference count of the COWed PTE table.
> Since the address range of VMA may overlap the PTE table, the copying
> function will use VMA to travel the page table for copying it. So it may
> increase the reference count of the COWed PTE table multiple times in
> one COW page table forking. Generically it will only increase once time
> as the child reference it. To solve this problem, it needs to check the
> destination of PMD entry does exist. And the reference count of the
> source PTE table is more than one before doing the COW.
> 
> This patch modifies the part of the copy page table to do the basic COW.
> For the break COW, it modifies the part of a page fault, zaps page table
> , unmapping, and remapping.

I only skimmed the patches that you sent. The last couple of patches seem a
bit rough and dirty, so I am sorry to say that I skipped them (too many
“TODO” and “XXX” for my taste).

I am sure other will have better feedback than me. I understand there is a
tradeoff and that this mechanism is mostly for high performance
snapshotting/forking. It would be beneficial to see whether this mechanism
can somehow be combined with existing ones (mshare?).

The code itself can be improved. I found the reasoning about synchronization
and TLB flushes and synchronizations to be lacking, and the code to seem
potentially incorrect. Better comments would help, even if the code is
correct.

There are additional general questions. For instance, when sharing a
page-table, do you properly update the refcount/mapcount of the mapped
pages? And are there any possible interactions with THP?

Thanks,
Nadav
Chih-En Lin Sept. 27, 2022, 7:53 p.m. UTC | #2
On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> I only skimmed the patches that you sent. The last couple of patches seem a
> bit rough and dirty, so I am sorry to say that I skipped them (too many
> “TODO” and “XXX” for my taste).
> 
> I am sure other will have better feedback than me. I understand there is a
> tradeoff and that this mechanism is mostly for high performance
> snapshotting/forking. It would be beneficial to see whether this mechanism
> can somehow be combined with existing ones (mshare?).

Still thanks for your feedback. :)
I'm looking at the PTE refcount and mshare patches. And, maybe it can
combine with them in the future.

> The code itself can be improved. I found the reasoning about synchronization
> and TLB flushes and synchronizations to be lacking, and the code to seem
> potentially incorrect. Better comments would help, even if the code is
> correct.
> 
> There are additional general questions. For instance, when sharing a
> page-table, do you properly update the refcount/mapcount of the mapped
> pages? And are there any possible interactions with THP?

Since access to those mapped pages will cost a lot of time, and this
will make fork() even have more overhead. It will not update the
refcount/mapcount of the mapped pages.

I'm not familiar with THP right now. But we have a plan for looking
at it to see what will happen with COW PTE.
Currently, I can only say that I prefer to avoid involving the behavior
of huge-page/THP. If there are any ideas here please tell us.

Thanks,
Chih-En Lin
John Hubbard Sept. 27, 2022, 9:26 p.m. UTC | #3
On 9/27/22 12:53, Chih-En Lin wrote:
> I'm not familiar with THP right now. But we have a plan for looking
> at it to see what will happen with COW PTE.
> Currently, I can only say that I prefer to avoid involving the behavior
> of huge-page/THP. If there are any ideas here please tell us.
> 
In order to be considered at all, this would have to at least behave 
correctly in the presence of THP and hugetlbfs, IMHO. Those are no
longer niche features.


thanks,
Chih-En Lin Sept. 28, 2022, 8:52 a.m. UTC | #4
On Tue, Sep 27, 2022 at 02:26:19PM -0700, John Hubbard wrote:
> On 9/27/22 12:53, Chih-En Lin wrote:
> > I'm not familiar with THP right now. But we have a plan for looking
> > at it to see what will happen with COW PTE.
> > Currently, I can only say that I prefer to avoid involving the behavior
> > of huge-page/THP. If there are any ideas here please tell us.
> > 
> In order to be considered at all, this would have to at least behave 
> correctly in the presence of THP and hugetlbfs, IMHO. Those are no
> longer niche features.
> 

To make sure it will work well with them. During fork() and page fault,
we put the mechanism after the hug-page/THP to make it doesn't mess up.
It may have corner cases that I didn't handle. I will keep looking at
it.

Thanks,
Chih-En lin
David Hildenbrand Sept. 28, 2022, 2:03 p.m. UTC | #5
On 27.09.22 21:53, Chih-En Lin wrote:
> On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
>> I only skimmed the patches that you sent. The last couple of patches seem a
>> bit rough and dirty, so I am sorry to say that I skipped them (too many
>> “TODO” and “XXX” for my taste).
>>
>> I am sure other will have better feedback than me. I understand there is a
>> tradeoff and that this mechanism is mostly for high performance
>> snapshotting/forking. It would be beneficial to see whether this mechanism
>> can somehow be combined with existing ones (mshare?).
> 
> Still thanks for your feedback. :)
> I'm looking at the PTE refcount and mshare patches. And, maybe it can
> combine with them in the future.
> 
>> The code itself can be improved. I found the reasoning about synchronization
>> and TLB flushes and synchronizations to be lacking, and the code to seem
>> potentially incorrect. Better comments would help, even if the code is
>> correct.
>>
>> There are additional general questions. For instance, when sharing a
>> page-table, do you properly update the refcount/mapcount of the mapped
>> pages? And are there any possible interactions with THP?
> 
> Since access to those mapped pages will cost a lot of time, and this
> will make fork() even have more overhead. It will not update the
> refcount/mapcount of the mapped pages.

Oh no.

So we'd have pages logically mapped into two processes (two page table 
structures), but the refcount/mapcount/PageAnonExclusive would not 
reflect that?

Honestly, I don't think it is upstream material in that hacky form. No, 
we don't need more COW CVEs or more COW over-complications that 
destabilize the whole system.

IMHO, a relaxed form that focuses on only the memory consumption 
reduction could *possibly* be accepted upstream if it's not too invasive 
or complex. During fork(), we'd do exactly what we used to do to PTEs 
(increment mapcount, refcount, trying to clear PageAnonExclusive, map 
the page R/O, duplicate swap entries; all while holding the page table 
lock), however, sharing the prepared page table with the child process 
using COW after we prepared it.

Any (most once we want to *optimize* rmap handling) modification 
attempts require breaking COW -- copying the page table for the faulting 
process. But at that point, the PTEs are already write-protected and 
properly accounted (refcount/mapcount/PageAnonExclusive).

Doing it that way might not require any questionable GUP hacks and 
swapping, MMU notifiers etc. "might just work as expected" because the 
accounting remains unchanged" -- we simply de-duplicate the page table 
itself we'd have after fork and any modification attempts simply replace 
the mapped copy.

But devil is in the detail (page table lock, TLB flushing).

"will make fork() even have more overhead" is not a good excuse for such 
complexity/hacks -- sure, it will make your benchmark results look 
better in comparison ;)
Chih-En Lin Sept. 29, 2022, 1:38 p.m. UTC | #6
Sorry for replying late.

On Wed, Sep 28, 2022 at 04:03:19PM +0200, David Hildenbrand wrote:
> On 27.09.22 21:53, Chih-En Lin wrote:
> > On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> > > I only skimmed the patches that you sent. The last couple of patches seem a
> > > bit rough and dirty, so I am sorry to say that I skipped them (too many
> > > “TODO” and “XXX” for my taste).
> > > 
> > > I am sure other will have better feedback than me. I understand there is a
> > > tradeoff and that this mechanism is mostly for high performance
> > > snapshotting/forking. It would be beneficial to see whether this mechanism
> > > can somehow be combined with existing ones (mshare?).
> > 
> > Still thanks for your feedback. :)
> > I'm looking at the PTE refcount and mshare patches. And, maybe it can
> > combine with them in the future.
> > 
> > > The code itself can be improved. I found the reasoning about synchronization
> > > and TLB flushes and synchronizations to be lacking, and the code to seem
> > > potentially incorrect. Better comments would help, even if the code is
> > > correct.
> > > 
> > > There are additional general questions. For instance, when sharing a
> > > page-table, do you properly update the refcount/mapcount of the mapped
> > > pages? And are there any possible interactions with THP?
> > 
> > Since access to those mapped pages will cost a lot of time, and this
> > will make fork() even have more overhead. It will not update the
> > refcount/mapcount of the mapped pages.
> 
> Oh no.
> 
> So we'd have pages logically mapped into two processes (two page table
> structures), but the refcount/mapcount/PageAnonExclusive would not reflect
> that?
> 
> Honestly, I don't think it is upstream material in that hacky form. No, we
> don't need more COW CVEs or more COW over-complications that destabilize the
> whole system.
>

I know setting the write protection is not enough to prove the security
safe since the previous COW CVEs are related to it. And, if skipping the
accounting to reduce the overhead of fork() is not suitable for upstream
, we can change it. But, I think COW to the table can still be an
upstream material.

Recently the patches, like refcount for the empty user PTE page table
pages and mshare for the pages shared between the processes require more
PTE entries, showing that the modern system uses a lot of memory for the
page table (especially the PTE table). So, I think the method, COW to
the table, might reduce the memory usage for the side of the multiple
users PTE page table.

> IMHO, a relaxed form that focuses on only the memory consumption reduction
> could *possibly* be accepted upstream if it's not too invasive or complex.
> During fork(), we'd do exactly what we used to do to PTEs (increment
> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> duplicate swap entries; all while holding the page table lock), however,
> sharing the prepared page table with the child process using COW after we
> prepared it.
> 
> Any (most once we want to *optimize* rmap handling) modification attempts
> require breaking COW -- copying the page table for the faulting process. But
> at that point, the PTEs are already write-protected and properly accounted
> (refcount/mapcount/PageAnonExclusive).
> 
> Doing it that way might not require any questionable GUP hacks and swapping,
> MMU notifiers etc. "might just work as expected" because the accounting
> remains unchanged" -- we simply de-duplicate the page table itself we'd have
> after fork and any modification attempts simply replace the mapped copy.

Agree.
However for GUP hacks, if we want to do the COW to page table, we still
need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
check whether the PTE table is available or not before we do the COW to
the table). Otherwise, it will be more complicated since it might need
to handle situations like while preparing the COW work, it just figuring
out that it needs to duplicate the whole table and roll back (recover
the state and copy it to new table). Hopefully, I'm not wrong here.

> But devil is in the detail (page table lock, TLB flushing).

Sure, it might be an overhead in the page fault and needs to be handled
carefully. ;)

> "will make fork() even have more overhead" is not a good excuse for such
> complexity/hacks -- sure, it will make your benchmark results look better in
> comparison ;)

;);)
I think that, even if we do the accounting with the COW page table, it
still has a little bit improve.

> -- 
> Thanks,
> 
> David / dhildenb
>

Thanks,
Chih-En Lin
Chih-En Lin Sept. 29, 2022, 1:49 p.m. UTC | #7
On Thu, Sep 29, 2022 at 09:38:53PM +0800, Chih-En Lin wrote:
> Sorry for replying late.
> 
> On Wed, Sep 28, 2022 at 04:03:19PM +0200, David Hildenbrand wrote:
> > On 27.09.22 21:53, Chih-En Lin wrote:
> > > On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> > > > I only skimmed the patches that you sent. The last couple of patches seem a
> > > > bit rough and dirty, so I am sorry to say that I skipped them (too many
> > > > “TODO” and “XXX” for my taste).
> > > > 
> > > > I am sure other will have better feedback than me. I understand there is a
> > > > tradeoff and that this mechanism is mostly for high performance
> > > > snapshotting/forking. It would be beneficial to see whether this mechanism
> > > > can somehow be combined with existing ones (mshare?).
> > > 
> > > Still thanks for your feedback. :)
> > > I'm looking at the PTE refcount and mshare patches. And, maybe it can
> > > combine with them in the future.
> > > 
> > > > The code itself can be improved. I found the reasoning about synchronization
> > > > and TLB flushes and synchronizations to be lacking, and the code to seem
> > > > potentially incorrect. Better comments would help, even if the code is
> > > > correct.
> > > > 
> > > > There are additional general questions. For instance, when sharing a
> > > > page-table, do you properly update the refcount/mapcount of the mapped
> > > > pages? And are there any possible interactions with THP?
> > > 
> > > Since access to those mapped pages will cost a lot of time, and this
> > > will make fork() even have more overhead. It will not update the
> > > refcount/mapcount of the mapped pages.
> > 
> > Oh no.
> > 
> > So we'd have pages logically mapped into two processes (two page table
> > structures), but the refcount/mapcount/PageAnonExclusive would not reflect
> > that?
> > 
> > Honestly, I don't think it is upstream material in that hacky form. No, we
> > don't need more COW CVEs or more COW over-complications that destabilize the
> > whole system.
> >
> 
> I know setting the write protection is not enough to prove the security
> safe since the previous COW CVEs are related to it. And, if skipping the
> accounting to reduce the overhead of fork() is not suitable for upstream
> , we can change it. But, I think COW to the table can still be an
> upstream material.
> 
> Recently the patches, like refcount for the empty user PTE page table
> pages and mshare for the pages shared between the processes require more
> PTE entries, showing that the modern system uses a lot of memory for the
> page table (especially the PTE table). So, I think the method, COW to
> the table, might reduce the memory usage for the side of the multiple
> users PTE page table.

Sorry, I think I need to explain more about "the multiple users PTE page
table". It means that it has more than one user holding the page table
and the mapped page that still has the same context. So, we can use COW
to reduce the memory usage at first.

>
> > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > could *possibly* be accepted upstream if it's not too invasive or complex.
> > During fork(), we'd do exactly what we used to do to PTEs (increment
> > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > duplicate swap entries; all while holding the page table lock), however,
> > sharing the prepared page table with the child process using COW after we
> > prepared it.
> > 
> > Any (most once we want to *optimize* rmap handling) modification attempts
> > require breaking COW -- copying the page table for the faulting process. But
> > at that point, the PTEs are already write-protected and properly accounted
> > (refcount/mapcount/PageAnonExclusive).
> > 
> > Doing it that way might not require any questionable GUP hacks and swapping,
> > MMU notifiers etc. "might just work as expected" because the accounting
> > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > after fork and any modification attempts simply replace the mapped copy.
> 
> Agree.
> However for GUP hacks, if we want to do the COW to page table, we still
> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> check whether the PTE table is available or not before we do the COW to
> the table). Otherwise, it will be more complicated since it might need
> to handle situations like while preparing the COW work, it just figuring
> out that it needs to duplicate the whole table and roll back (recover
> the state and copy it to new table). Hopefully, I'm not wrong here.
> 
> > But devil is in the detail (page table lock, TLB flushing).
> 
> Sure, it might be an overhead in the page fault and needs to be handled
> carefully. ;)
> 
> > "will make fork() even have more overhead" is not a good excuse for such
> > complexity/hacks -- sure, it will make your benchmark results look better in
> > comparison ;)
> 
> ;);)
> I think that, even if we do the accounting with the COW page table, it
> still has a little bit improve.
> 
> > -- 
> > Thanks,
> > 
> > David / dhildenb
> >
> 
> Thanks,
> Chih-En Lin

Thanks,
Chih-En Lin
David Hildenbrand Sept. 29, 2022, 5:24 p.m. UTC | #8
>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>> could *possibly* be accepted upstream if it's not too invasive or complex.
>> During fork(), we'd do exactly what we used to do to PTEs (increment
>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>> duplicate swap entries; all while holding the page table lock), however,
>> sharing the prepared page table with the child process using COW after we
>> prepared it.
>>
>> Any (most once we want to *optimize* rmap handling) modification attempts
>> require breaking COW -- copying the page table for the faulting process. But
>> at that point, the PTEs are already write-protected and properly accounted
>> (refcount/mapcount/PageAnonExclusive).
>>
>> Doing it that way might not require any questionable GUP hacks and swapping,
>> MMU notifiers etc. "might just work as expected" because the accounting
>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>> after fork and any modification attempts simply replace the mapped copy.
> 
> Agree.
> However for GUP hacks, if we want to do the COW to page table, we still
> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> check whether the PTE table is available or not before we do the COW to
> the table). Otherwise, it will be more complicated since it might need
> to handle situations like while preparing the COW work, it just figuring
> out that it needs to duplicate the whole table and roll back (recover
> the state and copy it to new table). Hopefully, I'm not wrong here.

The nice thing is that GUP itself *usually* doesn't modify page tables. 
One corner case is follow_pfn_pte(). All other modifications should 
happen in the actual fault handler that has to deal with such kind of 
unsharing either way when modifying the PTE.

If the pages are already in a COW-ed pagetable in the desired "shared" 
state (e.g., PageAnonExclusive cleared on an anonymous page), R/O 
pinning of such pages will just work as expected and we shouldn't be 
surprised by another set of GUP+COW CVEs.

We'd really only deduplicate the page table and not play other tricks 
with the actual page table content that differ from the existing way of 
handling fork().

I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code 
when not modifying the page table. I think we only need "we have to 
unshare this page table now" in follow_pfn_pte() and inside the fault 
handling when GUP triggers a fault.

I hope my assumption is correct, or am I missing something?

> 
>> But devil is in the detail (page table lock, TLB flushing).
> 
> Sure, it might be an overhead in the page fault and needs to be handled
> carefully. ;)
> 
>> "will make fork() even have more overhead" is not a good excuse for such
>> complexity/hacks -- sure, it will make your benchmark results look better in
>> comparison ;)
> 
> ;);)
> I think that, even if we do the accounting with the COW page table, it
> still has a little bit improve.

:)

My gut feeling is that this is true. While we have to do a pass over the 
parent page table during fork and wrprotect all PTEs etc., we don't have 
to duplicate the page table content and allocate/free memory for that.

One interesting case is when we cannot share an anon page with the child 
process because it maybe pinned -- and we have to copy it via 
copy_present_page(). In that case, the page table between the parent and 
the child would differ and we'd not be able to share the page table.

That case could be caught in copy_pte_range(): in case we'd have to 
allocate a page via page_copy_prealloc(), we'd have to fall back to the 
ordinary "separate page table for the child" way of doing things.

But that looks doable to me.
Chih-En Lin Sept. 29, 2022, 6:29 p.m. UTC | #9
On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
> > > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > > could *possibly* be accepted upstream if it's not too invasive or complex.
> > > During fork(), we'd do exactly what we used to do to PTEs (increment
> > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > > duplicate swap entries; all while holding the page table lock), however,
> > > sharing the prepared page table with the child process using COW after we
> > > prepared it.
> > > 
> > > Any (most once we want to *optimize* rmap handling) modification attempts
> > > require breaking COW -- copying the page table for the faulting process. But
> > > at that point, the PTEs are already write-protected and properly accounted
> > > (refcount/mapcount/PageAnonExclusive).
> > > 
> > > Doing it that way might not require any questionable GUP hacks and swapping,
> > > MMU notifiers etc. "might just work as expected" because the accounting
> > > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > > after fork and any modification attempts simply replace the mapped copy.
> > 
> > Agree.
> > However for GUP hacks, if we want to do the COW to page table, we still
> > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> > check whether the PTE table is available or not before we do the COW to
> > the table). Otherwise, it will be more complicated since it might need
> > to handle situations like while preparing the COW work, it just figuring
> > out that it needs to duplicate the whole table and roll back (recover
> > the state and copy it to new table). Hopefully, I'm not wrong here.
> 
> The nice thing is that GUP itself *usually* doesn't modify page tables. One
> corner case is follow_pfn_pte(). All other modifications should happen in
> the actual fault handler that has to deal with such kind of unsharing either
> way when modifying the PTE.
> 
> If the pages are already in a COW-ed pagetable in the desired "shared" state
> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
> pages will just work as expected and we shouldn't be surprised by another
> set of GUP+COW CVEs.
> 
> We'd really only deduplicate the page table and not play other tricks with
> the actual page table content that differ from the existing way of handling
> fork().
> 
> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
> not modifying the page table. I think we only need "we have to unshare this
> page table now" in follow_pfn_pte() and inside the fault handling when GUP
> triggers a fault.
> 
> I hope my assumption is correct, or am I missing something?
> 

My consideration is when we pinned the page and did the COW to make the
page table be shared. It might not allow mapping the pinned page to R/O)
into both processes.

So, if the fork is working on the shared state, it needs to recover the
table and copy to a new one since that pinned page will need to copy
immediately. We can hold the shared state after occurring such a
situation. So we still need some trick to let the fork() know which page
table already has the pinned page (or such page won't let us share)
before going to duplicate.

Am I wrong here?

After that, since we handled the accounting in fork(), we don't need
ownership (pmd_t pointer) anymore. We have to find another way to mark
the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
stored at that space.)

> > 
> > > But devil is in the detail (page table lock, TLB flushing).
> > 
> > Sure, it might be an overhead in the page fault and needs to be handled
> > carefully. ;)
> > 
> > > "will make fork() even have more overhead" is not a good excuse for such
> > > complexity/hacks -- sure, it will make your benchmark results look better in
> > > comparison ;)
> > 
> > ;);)
> > I think that, even if we do the accounting with the COW page table, it
> > still has a little bit improve.
> 
> :)
> 
> My gut feeling is that this is true. While we have to do a pass over the
> parent page table during fork and wrprotect all PTEs etc., we don't have to
> duplicate the page table content and allocate/free memory for that.
> 
> One interesting case is when we cannot share an anon page with the child
> process because it maybe pinned -- and we have to copy it via
> copy_present_page(). In that case, the page table between the parent and the
> child would differ and we'd not be able to share the page table.

That is what I want to say above.
The case might happen in the middle of the shared page table progress.
It might cost more overhead to recover it. Therefore, if GUP wants to
pin the mapped page we can mark the PTE table first, so fork() won't
waste time doing the work for sharing.

> That case could be caught in copy_pte_range(): in case we'd have to allocate
> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
> "separate page table for the child" way of doing things.
> 
> But that looks doable to me.

Sounds good. :)

> -- 
> Thanks,
> 
> David / dhildenb
> 

Thanks,
Chih-En Lin
David Hildenbrand Sept. 29, 2022, 6:38 p.m. UTC | #10
On 29.09.22 20:29, Chih-En Lin wrote:
> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>>>> could *possibly* be accepted upstream if it's not too invasive or complex.
>>>> During fork(), we'd do exactly what we used to do to PTEs (increment
>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>>>> duplicate swap entries; all while holding the page table lock), however,
>>>> sharing the prepared page table with the child process using COW after we
>>>> prepared it.
>>>>
>>>> Any (most once we want to *optimize* rmap handling) modification attempts
>>>> require breaking COW -- copying the page table for the faulting process. But
>>>> at that point, the PTEs are already write-protected and properly accounted
>>>> (refcount/mapcount/PageAnonExclusive).
>>>>
>>>> Doing it that way might not require any questionable GUP hacks and swapping,
>>>> MMU notifiers etc. "might just work as expected" because the accounting
>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>>>> after fork and any modification attempts simply replace the mapped copy.
>>>
>>> Agree.
>>> However for GUP hacks, if we want to do the COW to page table, we still
>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
>>> check whether the PTE table is available or not before we do the COW to
>>> the table). Otherwise, it will be more complicated since it might need
>>> to handle situations like while preparing the COW work, it just figuring
>>> out that it needs to duplicate the whole table and roll back (recover
>>> the state and copy it to new table). Hopefully, I'm not wrong here.
>>
>> The nice thing is that GUP itself *usually* doesn't modify page tables. One
>> corner case is follow_pfn_pte(). All other modifications should happen in
>> the actual fault handler that has to deal with such kind of unsharing either
>> way when modifying the PTE.
>>
>> If the pages are already in a COW-ed pagetable in the desired "shared" state
>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
>> pages will just work as expected and we shouldn't be surprised by another
>> set of GUP+COW CVEs.
>>
>> We'd really only deduplicate the page table and not play other tricks with
>> the actual page table content that differ from the existing way of handling
>> fork().
>>
>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
>> not modifying the page table. I think we only need "we have to unshare this
>> page table now" in follow_pfn_pte() and inside the fault handling when GUP
>> triggers a fault.
>>
>> I hope my assumption is correct, or am I missing something?
>>
> 
> My consideration is when we pinned the page and did the COW to make the
> page table be shared. It might not allow mapping the pinned page to R/O)
> into both processes.
> 
> So, if the fork is working on the shared state, it needs to recover the
> table and copy to a new one since that pinned page will need to copy
> immediately. We can hold the shared state after occurring such a
> situation. So we still need some trick to let the fork() know which page
> table already has the pinned page (or such page won't let us share)
> before going to duplicate.
> 
> Am I wrong here?

I think you might be overthinking this. Let's keep it simple:

1) Handle pinned anon pages just as I described below, falling back to 
the "slow" path of page table copying.

2) Once we passed that stage, you can be sure that the COW-ed page table 
cannot have actually pinned anon pages. All anon pages in such a page 
table have PageAnonExclusive cleared and are "maybe shared". GUP cannot 
succeed in pinning these pages anymore, because it will only pin 
exclusive anon pages!

3) If anybody wants to take a R/O pin on a shared anon page that is 
mapped into a COW-ed page table, we trigger a fault with 
FAULT_FLAG_UNSHARE instead of pinning the page. This has to break COW on 
the page table and properly map an exclusive anon page into it, breaking 
COW.

Do you see a problem with that?

> 
> After that, since we handled the accounting in fork(), we don't need
> ownership (pmd_t pointer) anymore. We have to find another way to mark
> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
> stored at that space.)
> 
>>>
>>>> But devil is in the detail (page table lock, TLB flushing).
>>>
>>> Sure, it might be an overhead in the page fault and needs to be handled
>>> carefully. ;)
>>>
>>>> "will make fork() even have more overhead" is not a good excuse for such
>>>> complexity/hacks -- sure, it will make your benchmark results look better in
>>>> comparison ;)
>>>
>>> ;);)
>>> I think that, even if we do the accounting with the COW page table, it
>>> still has a little bit improve.
>>
>> :)
>>
>> My gut feeling is that this is true. While we have to do a pass over the
>> parent page table during fork and wrprotect all PTEs etc., we don't have to
>> duplicate the page table content and allocate/free memory for that.
>>
>> One interesting case is when we cannot share an anon page with the child
>> process because it maybe pinned -- and we have to copy it via
>> copy_present_page(). In that case, the page table between the parent and the
>> child would differ and we'd not be able to share the page table.
> 
> That is what I want to say above.
> The case might happen in the middle of the shared page table progress.
> It might cost more overhead to recover it. Therefore, if GUP wants to
> pin the mapped page we can mark the PTE table first, so fork() won't
> waste time doing the work for sharing.

Having pinned pages is a corner case for most apps. No need to worry 
about optimizing this corner case for now.

I see what you are trying to optimize, but I don't think this is needed 
in a first version, and probably never is needed.


Any attempts to mark page tables in a certain way from GUP 
(COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast 
(get_user_pages_fast) can race with pretty much anything, even with 
concurrent fork. I suspect your current code might be really racy in 
that regard.
Nadav Amit Sept. 29, 2022, 6:40 p.m. UTC | #11
On Sep 29, 2022, at 11:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> That case could be caught in copy_pte_range(): in case we'd have to allocate
>> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
>> "separate page table for the child" way of doing things.
>> 
>> But that looks doable to me.
> 
> Sounds good. :)

Chih-En, I admit I did not fully read the entire correspondence and got deep
into all the details.

I would note, however, that there are several additional components that I
did not see (and perhaps missed) in your patches. Basically, there are many
page-table manipulations that are done not through the page-fault handler or
reclamation mechanisms. I did not see any of them being addressed.

So if/when you send a new version, please have a look at mprotect(),
madvise(), soft-dirty, userfaultfd and THP. In these cases, I presume, you
would have to COW-break (aka COW-unshare) the page-tables.
Chih-En Lin Sept. 29, 2022, 6:57 p.m. UTC | #12
On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote:
> On 29.09.22 20:29, Chih-En Lin wrote:
> > On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
> > > > > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > > > > could *possibly* be accepted upstream if it's not too invasive or complex.
> > > > > During fork(), we'd do exactly what we used to do to PTEs (increment
> > > > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > > > > duplicate swap entries; all while holding the page table lock), however,
> > > > > sharing the prepared page table with the child process using COW after we
> > > > > prepared it.
> > > > > 
> > > > > Any (most once we want to *optimize* rmap handling) modification attempts
> > > > > require breaking COW -- copying the page table for the faulting process. But
> > > > > at that point, the PTEs are already write-protected and properly accounted
> > > > > (refcount/mapcount/PageAnonExclusive).
> > > > > 
> > > > > Doing it that way might not require any questionable GUP hacks and swapping,
> > > > > MMU notifiers etc. "might just work as expected" because the accounting
> > > > > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > > > > after fork and any modification attempts simply replace the mapped copy.
> > > > 
> > > > Agree.
> > > > However for GUP hacks, if we want to do the COW to page table, we still
> > > > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> > > > check whether the PTE table is available or not before we do the COW to
> > > > the table). Otherwise, it will be more complicated since it might need
> > > > to handle situations like while preparing the COW work, it just figuring
> > > > out that it needs to duplicate the whole table and roll back (recover
> > > > the state and copy it to new table). Hopefully, I'm not wrong here.
> > > 
> > > The nice thing is that GUP itself *usually* doesn't modify page tables. One
> > > corner case is follow_pfn_pte(). All other modifications should happen in
> > > the actual fault handler that has to deal with such kind of unsharing either
> > > way when modifying the PTE.
> > > 
> > > If the pages are already in a COW-ed pagetable in the desired "shared" state
> > > (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
> > > pages will just work as expected and we shouldn't be surprised by another
> > > set of GUP+COW CVEs.
> > > 
> > > We'd really only deduplicate the page table and not play other tricks with
> > > the actual page table content that differ from the existing way of handling
> > > fork().
> > > 
> > > I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
> > > not modifying the page table. I think we only need "we have to unshare this
> > > page table now" in follow_pfn_pte() and inside the fault handling when GUP
> > > triggers a fault.
> > > 
> > > I hope my assumption is correct, or am I missing something?
> > > 
> > 
> > My consideration is when we pinned the page and did the COW to make the
> > page table be shared. It might not allow mapping the pinned page to R/O)
> > into both processes.
> > 
> > So, if the fork is working on the shared state, it needs to recover the
> > table and copy to a new one since that pinned page will need to copy
> > immediately. We can hold the shared state after occurring such a
> > situation. So we still need some trick to let the fork() know which page
> > table already has the pinned page (or such page won't let us share)
> > before going to duplicate.
> > 
> > Am I wrong here?
> 
> I think you might be overthinking this. Let's keep it simple:
> 
> 1) Handle pinned anon pages just as I described below, falling back to the
> "slow" path of page table copying.
> 
> 2) Once we passed that stage, you can be sure that the COW-ed page table
> cannot have actually pinned anon pages. All anon pages in such a page table
> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in
> pinning these pages anymore, because it will only pin exclusive anon pages!
> 
> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped
> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead
> of pinning the page. This has to break COW on the page table and properly
> map an exclusive anon page into it, breaking COW.
> 
> Do you see a problem with that?
> 
> > 
> > After that, since we handled the accounting in fork(), we don't need
> > ownership (pmd_t pointer) anymore. We have to find another way to mark
> > the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
> > stored at that space.)
> > 
> > > > 
> > > > > But devil is in the detail (page table lock, TLB flushing).
> > > > 
> > > > Sure, it might be an overhead in the page fault and needs to be handled
> > > > carefully. ;)
> > > > 
> > > > > "will make fork() even have more overhead" is not a good excuse for such
> > > > > complexity/hacks -- sure, it will make your benchmark results look better in
> > > > > comparison ;)
> > > > 
> > > > ;);)
> > > > I think that, even if we do the accounting with the COW page table, it
> > > > still has a little bit improve.
> > > 
> > > :)
> > > 
> > > My gut feeling is that this is true. While we have to do a pass over the
> > > parent page table during fork and wrprotect all PTEs etc., we don't have to
> > > duplicate the page table content and allocate/free memory for that.
> > > 
> > > One interesting case is when we cannot share an anon page with the child
> > > process because it maybe pinned -- and we have to copy it via
> > > copy_present_page(). In that case, the page table between the parent and the
> > > child would differ and we'd not be able to share the page table.
> > 
> > That is what I want to say above.
> > The case might happen in the middle of the shared page table progress.
> > It might cost more overhead to recover it. Therefore, if GUP wants to
> > pin the mapped page we can mark the PTE table first, so fork() won't
> > waste time doing the work for sharing.
> 
> Having pinned pages is a corner case for most apps. No need to worry about
> optimizing this corner case for now.
> 
> I see what you are trying to optimize, but I don't think this is needed in a
> first version, and probably never is needed.
> 
> 
> Any attempts to mark page tables in a certain way from GUP
> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast
> (get_user_pages_fast) can race with pretty much anything, even with
> concurrent fork. I suspect your current code might be really racy in that
> regard.

I see.
Now, I know why optimizing that corner case is not worth it.
Thank you for explaining that.

Thanks,
Chih-En Lin
David Hildenbrand Sept. 29, 2022, 7 p.m. UTC | #13
On 29.09.22 20:57, Chih-En Lin wrote:
> On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote:
>> On 29.09.22 20:29, Chih-En Lin wrote:
>>> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
>>>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>>>>>> could *possibly* be accepted upstream if it's not too invasive or complex.
>>>>>> During fork(), we'd do exactly what we used to do to PTEs (increment
>>>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>>>>>> duplicate swap entries; all while holding the page table lock), however,
>>>>>> sharing the prepared page table with the child process using COW after we
>>>>>> prepared it.
>>>>>>
>>>>>> Any (most once we want to *optimize* rmap handling) modification attempts
>>>>>> require breaking COW -- copying the page table for the faulting process. But
>>>>>> at that point, the PTEs are already write-protected and properly accounted
>>>>>> (refcount/mapcount/PageAnonExclusive).
>>>>>>
>>>>>> Doing it that way might not require any questionable GUP hacks and swapping,
>>>>>> MMU notifiers etc. "might just work as expected" because the accounting
>>>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>>>>>> after fork and any modification attempts simply replace the mapped copy.
>>>>>
>>>>> Agree.
>>>>> However for GUP hacks, if we want to do the COW to page table, we still
>>>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
>>>>> check whether the PTE table is available or not before we do the COW to
>>>>> the table). Otherwise, it will be more complicated since it might need
>>>>> to handle situations like while preparing the COW work, it just figuring
>>>>> out that it needs to duplicate the whole table and roll back (recover
>>>>> the state and copy it to new table). Hopefully, I'm not wrong here.
>>>>
>>>> The nice thing is that GUP itself *usually* doesn't modify page tables. One
>>>> corner case is follow_pfn_pte(). All other modifications should happen in
>>>> the actual fault handler that has to deal with such kind of unsharing either
>>>> way when modifying the PTE.
>>>>
>>>> If the pages are already in a COW-ed pagetable in the desired "shared" state
>>>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
>>>> pages will just work as expected and we shouldn't be surprised by another
>>>> set of GUP+COW CVEs.
>>>>
>>>> We'd really only deduplicate the page table and not play other tricks with
>>>> the actual page table content that differ from the existing way of handling
>>>> fork().
>>>>
>>>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
>>>> not modifying the page table. I think we only need "we have to unshare this
>>>> page table now" in follow_pfn_pte() and inside the fault handling when GUP
>>>> triggers a fault.
>>>>
>>>> I hope my assumption is correct, or am I missing something?
>>>>
>>>
>>> My consideration is when we pinned the page and did the COW to make the
>>> page table be shared. It might not allow mapping the pinned page to R/O)
>>> into both processes.
>>>
>>> So, if the fork is working on the shared state, it needs to recover the
>>> table and copy to a new one since that pinned page will need to copy
>>> immediately. We can hold the shared state after occurring such a
>>> situation. So we still need some trick to let the fork() know which page
>>> table already has the pinned page (or such page won't let us share)
>>> before going to duplicate.
>>>
>>> Am I wrong here?
>>
>> I think you might be overthinking this. Let's keep it simple:
>>
>> 1) Handle pinned anon pages just as I described below, falling back to the
>> "slow" path of page table copying.
>>
>> 2) Once we passed that stage, you can be sure that the COW-ed page table
>> cannot have actually pinned anon pages. All anon pages in such a page table
>> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in
>> pinning these pages anymore, because it will only pin exclusive anon pages!
>>
>> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped
>> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead
>> of pinning the page. This has to break COW on the page table and properly
>> map an exclusive anon page into it, breaking COW.
>>
>> Do you see a problem with that?
>>
>>>
>>> After that, since we handled the accounting in fork(), we don't need
>>> ownership (pmd_t pointer) anymore. We have to find another way to mark
>>> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
>>> stored at that space.)
>>>
>>>>>
>>>>>> But devil is in the detail (page table lock, TLB flushing).
>>>>>
>>>>> Sure, it might be an overhead in the page fault and needs to be handled
>>>>> carefully. ;)
>>>>>
>>>>>> "will make fork() even have more overhead" is not a good excuse for such
>>>>>> complexity/hacks -- sure, it will make your benchmark results look better in
>>>>>> comparison ;)
>>>>>
>>>>> ;);)
>>>>> I think that, even if we do the accounting with the COW page table, it
>>>>> still has a little bit improve.
>>>>
>>>> :)
>>>>
>>>> My gut feeling is that this is true. While we have to do a pass over the
>>>> parent page table during fork and wrprotect all PTEs etc., we don't have to
>>>> duplicate the page table content and allocate/free memory for that.
>>>>
>>>> One interesting case is when we cannot share an anon page with the child
>>>> process because it maybe pinned -- and we have to copy it via
>>>> copy_present_page(). In that case, the page table between the parent and the
>>>> child would differ and we'd not be able to share the page table.
>>>
>>> That is what I want to say above.
>>> The case might happen in the middle of the shared page table progress.
>>> It might cost more overhead to recover it. Therefore, if GUP wants to
>>> pin the mapped page we can mark the PTE table first, so fork() won't
>>> waste time doing the work for sharing.
>>
>> Having pinned pages is a corner case for most apps. No need to worry about
>> optimizing this corner case for now.
>>
>> I see what you are trying to optimize, but I don't think this is needed in a
>> first version, and probably never is needed.
>>
>>
>> Any attempts to mark page tables in a certain way from GUP
>> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast
>> (get_user_pages_fast) can race with pretty much anything, even with
>> concurrent fork. I suspect your current code might be really racy in that
>> regard.
> 
> I see.
> Now, I know why optimizing that corner case is not worth it.
> Thank you for explaining that.

Falling back after already processing some PTEs requires some care, 
though. I guess it's not too hard to get it right -- it might be harder 
to get it "clean". But we can talk about that detail later.
Chih-En Lin Sept. 29, 2022, 7:02 p.m. UTC | #14
On Thu, Sep 29, 2022 at 06:40:36PM +0000, Nadav Amit wrote:
> On Sep 29, 2022, at 11:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
> 
> > That case could be caught in copy_pte_range(): in case we'd have to allocate
> >> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
> >> "separate page table for the child" way of doing things.
> >> 
> >> But that looks doable to me.
> > 
> > Sounds good. :)
> 
> Chih-En, I admit I did not fully read the entire correspondence and got deep
> into all the details.
> 
> I would note, however, that there are several additional components that I
> did not see (and perhaps missed) in your patches. Basically, there are many
> page-table manipulations that are done not through the page-fault handler or
> reclamation mechanisms. I did not see any of them being addressed.
> 
> So if/when you send a new version, please have a look at mprotect(),
> madvise(), soft-dirty, userfaultfd and THP. In these cases, I presume, you
> would have to COW-break (aka COW-unshare) the page-tables.
> 

Sure. Before I send the new version I will try to handle all of them.
Thank you for the note.

Thanks,
Chih-En Lin
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 4cf3f74fb183f..c532448b5e086 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -250,6 +250,9 @@  static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
+		VM_BUG_ON(cow_pte_count(pmd) != 1);
+		if (!pmd_cow_pte_exclusive(pmd))
+			VM_BUG_ON(!cow_pte_owner_is_same(pmd, NULL));
 		free_pte_range(tlb, pmd, addr);
 	} while (pmd++, addr = next, addr != end);
 
@@ -1006,7 +1009,12 @@  copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		/*
+		 * If parent's PTE table is COWing, keep it as it is.
+		 * Don't set wrprotect to that table.
+		 */
+		if (!__is_pte_table_cowing(src_vma, NULL, addr))
+			ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 	VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page));
@@ -1197,11 +1205,64 @@  copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 				continue;
 			/* fall through */
 		}
-		if (pmd_none_or_clear_bad(src_pmd))
-			continue;
-		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-				   addr, next))
-			return -ENOMEM;
+
+		if (is_cow_pte_available(src_vma, src_pmd)) {
+			/*
+			 * Setting wrprotect to pmd entry will trigger
+			 * pmd_bad() for normal PTE table. Skip the bad
+			 * checking here.
+			 */
+			if (pmd_none(*src_pmd))
+				continue;
+
+			/* Skip if the PTE already COW this time. */
+			if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd))
+				continue;
+
+			/*
+			 * If PTE doesn't have an owner, the parent needs to
+			 * take this PTE.
+			 */
+			if (cow_pte_owner_is_same(src_pmd, NULL)) {
+				set_cow_pte_owner(src_pmd, src_pmd);
+				/*
+				 * XXX: The process may COW PTE fork two times.
+				 * But in some situations, owner has cleared.
+				 * Previously Child (This time is the parent)
+				 * COW PTE forking, but previously parent, the
+				 * owner , break COW. So it needs to add back
+				 * the RSS state and pgtable bytes.
+				 */
+				if (!pmd_write(*src_pmd)) {
+					cow_pte_rss(src_mm, src_vma, src_pmd,
+						    get_pmd_start_edge(src_vma,
+									addr),
+						    get_pmd_end_edge(src_vma,
+									addr),
+						    true /* inc */);
+					/* Do we need pt lock here? */
+					mm_inc_nr_ptes(src_mm);
+					/* See the comments in pmd_install(). */
+					smp_wmb();
+					pmd_populate(src_mm, src_pmd,
+						     pmd_page(*src_pmd));
+				}
+			}
+
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+
+			/* Child reference count */
+			pmd_get_pte(src_pmd);
+
+			/* COW for PTE table */
+			set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
+		} else {
+			if (pmd_none_or_clear_bad(src_pmd))
+				continue;
+			if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
+					   addr, next))
+				return -ENOMEM;
+		}
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1594,6 +1655,10 @@  static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			spin_unlock(ptl);
 		}
 
+		/* TODO: Does TLB needs to flush page info in COWed table? */
+		if (is_pte_table_cowing(vma, pmd))
+			handle_cow_pte(vma, pmd, addr, false);
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -5321,6 +5386,16 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				return 0;
 			}
 		}
+
+		/*
+		 * When the PMD entry is set with write protection, it needs to
+		 * handle the on-demand PTE. It will allocate a new PTE and copy
+		 * the old one, then set this entry writeable and decrease the
+		 * reference count at COW PTE.
+		 */
+		if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
+				   cow_pte_count(&vmf.orig_pmd) > 1) < 0)
+			return VM_FAULT_OOM;
 	}
 
 	return handle_pte_fault(&vmf);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3c..463359292f8a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2685,6 +2685,9 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 			return err;
 	}
 
+	if (handle_cow_pte(vma, NULL, addr, true) < 0)
+		return -ENOMEM;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
diff --git a/mm/mremap.c b/mm/mremap.c
index b522cd0259a0f..14f6ad250289c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,6 +532,9 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
+
+		handle_cow_pte(vma, old_pmd, old_addr, true);
+
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;