[RFC,0/5] hugetlb: Change huge pmd sharing

Message ID	20220406204823.46548-1-mike.kravetz@oracle.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Mike Kravetz <mike.kravetz@oracle.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>, Naoya Horiguchi <naoya.horiguchi@linux.dev>, "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>, Andrea Arcangeli <aarcange@redhat.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Davidlohr Bueso <dave@stgolabs.net>, Prakash Sangappa <prakash.sangappa@oracle.com>, James Houghton <jthoughton@google.com>, Mina Almasry <almasrymina@google.com>, Ray Fucillo <Ray.Fucillo@intersystems.com>, Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com> Subject: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Date: Wed, 6 Apr 2022 13:48:18 -0700 Message-Id: <20220406204823.46548-1-mike.kravetz@oracle.com> Content-Type: text/plain MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hugetlb: Change huge pmd sharing \| expand [RFC,0/5] hugetlb: Change huge pmd sharing [RFC,1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race [RFC,2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization [RFC,3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c [RFC,4/5] hugetlbfs: catch and handle truncate racing with page faults [RFC,5/5] hugetlb: Check for pmd unshare and fault/lookup races

Mike Kravetz April 6, 2022, 8:48 p.m. UTC

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

v5.17
-----
[  708.763114] Wait_debug: faults sec  3622
[  708.764010]             num faults  36220
[  708.765016]             num waits   36220
[  708.766054]             intvl wait time 54074 msecs
[  708.767287]             max_wait_time   31000 usecs


v5.17 + this series (similar to v5.6)
-------------------------------------
[  282.191391] Wait_debug: faults sec  1777939
[  282.192571]             num faults  17779393
[  282.193746]             num locks   5517
[  282.194858]             intvl wait time 19907 msecs
[  282.196226]             max_wait_time   43000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
fault and truncate races
- Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
unshare and fault/lookup races
- Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

The commit message in patch 5 suggests that it is not safe to use
SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
others confirm/agree then there will need to be additional work.

Please help with comments or suggestions.  I would like to come up with
something that is performant and safe.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

Mike Kravetz (5):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlb: Check for pmd unshare and fault/lookup races

 fs/hugetlbfs/inode.c    |  84 ++++++++------------
 include/linux/hugetlb.h |   3 +-
 mm/hugetlb.c            | 169 +++++++++++++++++++---------------------
 mm/rmap.c               |  14 +---
 mm/userfaultfd.c        |  11 +--
 5 files changed, 118 insertions(+), 163 deletions(-)

David Hildenbrand April 7, 2022, 10:08 a.m. UTC | #1

On 06.04.22 22:48, Mike Kravetz wrote:
> hugetlb fault scalability regressions have recently been reported [1].
> This is not the first such report, as regressions were also noted when
> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") was added [2] in v5.7.  At that time, a proposal to
> address the regression was suggested [3] but went nowhere.
> 
> To illustrate the regression, I created a simple program that does the
> following in an infinite loop:
> - mmap a 4GB hugetlb file (size insures pmd sharing)
> - fault in all pages
> - unmap the hugetlb file
> 
> The hugetlb fault code was then instrumented to collect number of times
> the mutex was locked and wait time.  Samples are from 10 second
> intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
> map/fault/unmap program are running.
> 
> v5.17
> -----
> [  708.763114] Wait_debug: faults sec  3622
> [  708.764010]             num faults  36220
> [  708.765016]             num waits   36220
> [  708.766054]             intvl wait time 54074 msecs
> [  708.767287]             max_wait_time   31000 usecs
> 
> 
> v5.17 + this series (similar to v5.6)
> -------------------------------------
> [  282.191391] Wait_debug: faults sec  1777939
> [  282.192571]             num faults  17779393
> [  282.193746]             num locks   5517
> [  282.194858]             intvl wait time 19907 msecs
> [  282.196226]             max_wait_time   43000 usecs
> 
> As can be seen, fault time suffers when there are other operations
> taking i_mmap_rwsem in write mode such as unmap.
> 
> This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
> depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
> fault path back to huge_pmd_share where it is only taken when necessary.
> After, reverting these patches we still need to handle:
> fault and truncate races
> - Catch and properly backout faults beyond i_size
>   Backing out reservations is much easier after 846be08578ed to expand
>   restore_reserve_on_error functionality.
> unshare and fault/lookup races
> - Since the pointer returned from huge_pte_offset or huge_pte_alloc may
>   become invalid until we lock the page table, we must revalidate after
>   taking the lock.  Code paths must backout and possibly retry if
>   page table pointer changes.
> 
> The commit message in patch 5 suggests that it is not safe to use
> SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
> others confirm/agree then there will need to be additional work.
> 
> Please help with comments or suggestions.  I would like to come up with
> something that is performant and safe.

May I challenge the existence of huge PMD sharing? TBH I am not
convinced that the code complexity is worth the benefit.


Let me know if I get something wrong:

Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.

Sure, with thousands of processes sharing that memory, the size of page
tables required would increase with each and every process. But TBH,
that's in no way different to other file systems where we're even
dealing with PTE tables.


Which results in me wondering if

a) We should simply use gigantic pages for such extreme use case. Allows
   for freeing up more memory via vmemmap either way.
b) We should instead look into reclaiming reconstruct-able page table.
   It's hard to imagine that each and every process accesses each and
   every part of the gigantic file all of the time.
c) We should instead establish a more generic page table sharing
   mechanism.


Consequently, I'd be much more in favor of ripping it out :/ but that's
just my personal opinion.

Mike Kravetz April 7, 2022, 4:17 p.m. UTC | #2

On 4/7/22 03:08, David Hildenbrand wrote:
> On 06.04.22 22:48, Mike Kravetz wrote:
>> hugetlb fault scalability regressions have recently been reported [1].
>> This is not the first such report, as regressions were also noted when
>> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
>> synchronization") was added [2] in v5.7.  At that time, a proposal to
>> address the regression was suggested [3] but went nowhere.
<snip>
>> Please help with comments or suggestions.  I would like to come up with
>> something that is performant and safe.
> 
> May I challenge the existence of huge PMD sharing? TBH I am not
> convinced that the code complexity is worth the benefit.
> 

That is a fair question.
Huge PMD sharing is not a documented or well known feature.  Most people would
not notice it going away.  However, I suspect some people will notice.
> Let me know if I get something wrong:
> 
> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
> 
> Sure, with thousands of processes sharing that memory, the size of page
> tables required would increase with each and every process. But TBH,
> that's in no way different to other file systems where we're even
> dealing with PTE tables.

The numbers for a real use case I am frequently quoted are something like:
1TB shared mapping, 10,000 processes sharing the mapping
4K PMD Page per 1GB of shared mapping
4M saving for each shared process
9,999 * 4M ~= 39GB savings

However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
it states that performance rather than memory savings was the primary
objective.

"For hugetlb, the saving on page table memory is not the primary
 objective (as hugetlb itself already cuts down page table overhead
 significantly), instead, the purpose of using shared page table on hugetlb is
 to allow faster TLB refill and smaller cache pollution upon TLB miss.

 With PT sharing, pte entries are shared among hundreds of processes, the
 cache consumption used by all the page table is smaller and in return,
 application gets much higher cache hit ratio.  One other effect is that
 cache hit ratio with hardware page walker hitting on pte in cache will be
 higher and this helps to reduce tlb miss latency.  These two effects
 contribute to higher application performance."

That 'makes sense', but I have never tried to measure any such performance
benefit.  It is easier to calculate the memory savings.

> 
> Which results in me wondering if
> 
> a) We should simply use gigantic pages for such extreme use case. Allows
>    for freeing up more memory via vmemmap either way.

The only problem with this is that many processors in use today have
limited TLB entries for gigantic pages.

> b) We should instead look into reclaiming reconstruct-able page table.
>    It's hard to imagine that each and every process accesses each and
>    every part of the gigantic file all of the time.
> c) We should instead establish a more generic page table sharing
>    mechanism.

Yes.  I think that is the direction taken by mshare() proposal.  If we have
a more generic approach we can certainly start deprecating hugetlb pmd
sharing.

> 
> 
> Consequently, I'd be much more in favor of ripping it out :/ but that's
> just my personal opinion.
>

David Hildenbrand April 8, 2022, 9:26 a.m. UTC | #3

>>
>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>
>> Sure, with thousands of processes sharing that memory, the size of page
>> tables required would increase with each and every process. But TBH,
>> that's in no way different to other file systems where we're even
>> dealing with PTE tables.
> 
> The numbers for a real use case I am frequently quoted are something like:
> 1TB shared mapping, 10,000 processes sharing the mapping
> 4K PMD Page per 1GB of shared mapping
> 4M saving for each shared process
> 9,999 * 4M ~= 39GB savings

3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
care about supporting such corner cases that result in a maintenance
burden? My take is a clear no.

> 
> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
> it states that performance rather than memory savings was the primary
> objective.
> 
> "For hugetlb, the saving on page table memory is not the primary
>  objective (as hugetlb itself already cuts down page table overhead
>  significantly), instead, the purpose of using shared page table on hugetlb is
>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>     
>  With PT sharing, pte entries are shared among hundreds of processes, the
>  cache consumption used by all the page table is smaller and in return,
>  application gets much higher cache hit ratio.  One other effect is that
>  cache hit ratio with hardware page walker hitting on pte in cache will be
>  higher and this helps to reduce tlb miss latency.  These two effects
>  contribute to higher application performance."
> 
> That 'makes sense', but I have never tried to measure any such performance
> benefit.  It is easier to calculate the memory savings.

It does makes sense; but then, again, what's specific here about hugetlb?

Most probably it was just easy to add to hugetlb in contrast to other
types of shared memory.

> 
>>
>> Which results in me wondering if
>>
>> a) We should simply use gigantic pages for such extreme use case. Allows
>>    for freeing up more memory via vmemmap either way.
> 
> The only problem with this is that many processors in use today have
> limited TLB entries for gigantic pages.
> 
>> b) We should instead look into reclaiming reconstruct-able page table.
>>    It's hard to imagine that each and every process accesses each and
>>    every part of the gigantic file all of the time.
>> c) We should instead establish a more generic page table sharing
>>    mechanism.
> 
> Yes.  I think that is the direction taken by mshare() proposal.  If we have
> a more generic approach we can certainly start deprecating hugetlb pmd
> sharing.

My strong opinion is to remove it ASAP and get something proper into place.

Mike Kravetz April 19, 2022, 10:50 p.m. UTC | #4

On 4/8/22 02:26, David Hildenbrand wrote:
>>>
>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>
>>> Sure, with thousands of processes sharing that memory, the size of page
>>> tables required would increase with each and every process. But TBH,
>>> that's in no way different to other file systems where we're even
>>> dealing with PTE tables.
>>
>> The numbers for a real use case I am frequently quoted are something like:
>> 1TB shared mapping, 10,000 processes sharing the mapping
>> 4K PMD Page per 1GB of shared mapping
>> 4M saving for each shared process
>> 9,999 * 4M ~= 39GB savings
> 
> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
> care about supporting such corner cases that result in a maintenance
> burden? My take is a clear no.
> 
>>
>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>> it states that performance rather than memory savings was the primary
>> objective.
>>
>> "For hugetlb, the saving on page table memory is not the primary
>>  objective (as hugetlb itself already cuts down page table overhead
>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>     
>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>  cache consumption used by all the page table is smaller and in return,
>>  application gets much higher cache hit ratio.  One other effect is that
>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>  higher and this helps to reduce tlb miss latency.  These two effects
>>  contribute to higher application performance."
>>
>> That 'makes sense', but I have never tried to measure any such performance
>> benefit.  It is easier to calculate the memory savings.
> 
> It does makes sense; but then, again, what's specific here about hugetlb?
> 
> Most probably it was just easy to add to hugetlb in contrast to other
> types of shared memory.
> 
>>
>>>
>>> Which results in me wondering if
>>>
>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>    for freeing up more memory via vmemmap either way.
>>
>> The only problem with this is that many processors in use today have
>> limited TLB entries for gigantic pages.
>>
>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>    It's hard to imagine that each and every process accesses each and
>>>    every part of the gigantic file all of the time.
>>> c) We should instead establish a more generic page table sharing
>>>    mechanism.
>>
>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>> a more generic approach we can certainly start deprecating hugetlb pmd
>> sharing.
> 
> My strong opinion is to remove it ASAP and get something proper into place.
> 

No arguments about the complexity of this code.  However, there will be some
people who will notice if it is removed.

Whether or not we remove huge pmd sharing support, I would still like to
address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
for fault processing needs to go away.  With this gone, the issue of faults
racing with truncation needs to be addressed as it depended on fault code
taking the mutex.  At a high level, this is fairly simple but hugetlb
reservations add to the complexity.  This was not completely addressed in
this series.

I will be sending out another RFC that more correctly address all the issues
this series attempted to address.  I am not discounting your opinion that we
should get rid of huge pmd sharing.  Rather, I would at least like to get
some eyes on my approach to addressing the issue with reservations during
fault and truncate races.

David Hildenbrand April 20, 2022, 7:12 a.m. UTC | #5

On 20.04.22 00:50, Mike Kravetz wrote:
> On 4/8/22 02:26, David Hildenbrand wrote:
>>>>
>>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>>
>>>> Sure, with thousands of processes sharing that memory, the size of page
>>>> tables required would increase with each and every process. But TBH,
>>>> that's in no way different to other file systems where we're even
>>>> dealing with PTE tables.
>>>
>>> The numbers for a real use case I am frequently quoted are something like:
>>> 1TB shared mapping, 10,000 processes sharing the mapping
>>> 4K PMD Page per 1GB of shared mapping
>>> 4M saving for each shared process
>>> 9,999 * 4M ~= 39GB savings
>>
>> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
>> care about supporting such corner cases that result in a maintenance
>> burden? My take is a clear no.
>>
>>>
>>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>>> it states that performance rather than memory savings was the primary
>>> objective.
>>>
>>> "For hugetlb, the saving on page table memory is not the primary
>>>  objective (as hugetlb itself already cuts down page table overhead
>>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>>     
>>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>>  cache consumption used by all the page table is smaller and in return,
>>>  application gets much higher cache hit ratio.  One other effect is that
>>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>>  higher and this helps to reduce tlb miss latency.  These two effects
>>>  contribute to higher application performance."
>>>
>>> That 'makes sense', but I have never tried to measure any such performance
>>> benefit.  It is easier to calculate the memory savings.
>>
>> It does makes sense; but then, again, what's specific here about hugetlb?
>>
>> Most probably it was just easy to add to hugetlb in contrast to other
>> types of shared memory.
>>
>>>
>>>>
>>>> Which results in me wondering if
>>>>
>>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>>    for freeing up more memory via vmemmap either way.
>>>
>>> The only problem with this is that many processors in use today have
>>> limited TLB entries for gigantic pages.
>>>
>>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>>    It's hard to imagine that each and every process accesses each and
>>>>    every part of the gigantic file all of the time.
>>>> c) We should instead establish a more generic page table sharing
>>>>    mechanism.
>>>
>>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>>> a more generic approach we can certainly start deprecating hugetlb pmd
>>> sharing.
>>
>> My strong opinion is to remove it ASAP and get something proper into place.
>>
> 
> No arguments about the complexity of this code.  However, there will be some
> people who will notice if it is removed.

Yes, it should never have been added that way -- unfortunately.

> 
> Whether or not we remove huge pmd sharing support, I would still like to
> address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
> for fault processing needs to go away.  With this gone, the issue of faults
> racing with truncation needs to be addressed as it depended on fault code
> taking the mutex.  At a high level, this is fairly simple but hugetlb
> reservations add to the complexity.  This was not completely addressed in
> this series.

Okay.

> 
> I will be sending out another RFC that more correctly address all the issues
> this series attempted to address.  I am not discounting your opinion that we
> should get rid of huge pmd sharing.  Rather, I would at least like to get
> some eyes on my approach to addressing the issue with reservations during
> fault and truncate races.

Makes sense to me. I agree that we should fix all that. What I
experienced is that the pmd sharing over-complicates the situation quite
a lot and makes the code hard to follow

[huge page reservation is another thing I dislike, especially because
it's no good in NUMA setups and we still have to preallocate huge pages
to make it work reliably]

[RFC,0/5] hugetlb: Change huge pmd sharing

Message

Comments