mbox series

[v4,0/6] add mTHP support for anonymous shmem

Message ID cover.1717495894.git.baolin.wang@linux.alibaba.com (mailing list archive)
Headers show
Series add mTHP support for anonymous shmem | expand

Message

Baolin Wang June 4, 2024, 10:17 a.m. UTC
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
all of shmem, not only anonymous shmem, but support will be added iteratively.
Therefore, this patch set starts with support for anonymous shmem.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size, which
is set to "inherit". This ensures backward compatibility with the anonymous
shmem enabled of the top level, meanwhile also allows independent control of
anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/

Changes from v3:
 - Drop 'force' and 'deny' testing options for each mTHP.
 - Use new helper update_mmu_tlb_range(), per Lance.
 - Update documentation to drop "anonymous thp" terminology, per David.
 - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
   reported by kernel test robot.
 - Fix the highest mTHP order in shmem_get_unmapped_area().
 - Update some commit message.

Changes from v2:
 - Rebased to mm/mm-unstable.
 - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
 - Drop the patch that re-arranges the position of highest_order() and
   next_order(), per Ryan.
 - Modify the finish_fault() to fix VA alignment issue, per Ryan and
   David.
 - Fix some building issues, reported by Lance and kernel test robot.
 - Update some commit message.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (6):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  23 ++
 include/linux/huge_mm.h                    |  23 ++
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  57 +++-
 mm/shmem.c                                 | 344 ++++++++++++++++++---
 5 files changed, 403 insertions(+), 61 deletions(-)

Comments

Andrew Morton June 4, 2024, 11:50 p.m. UTC | #1
On Tue,  4 Jun 2024 18:17:44 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
>
> ...
>
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 

Geeze, is that the best you can do ;)

It's early and there's review work to be done.  But I'll queue this up
for testing now, as it's clearly something we should finish off and
get merged.
Daniel Gomez June 10, 2024, 12:10 p.m. UTC | #2
Hi Baolin,

On Tue, Jun 04, 2024 at 06:17:44PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
> all of shmem, not only anonymous shmem, but support will be added iteratively.
> Therefore, this patch set starts with support for anonymous shmem.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size, which
> is set to "inherit". This ensures backward compatibility with the anonymous
> shmem enabled of the top level, meanwhile also allows independent control of
> anonymous shmem enabled for each mTHP.
> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem

I'm not familiar with this tool. Could you share which repo/tool you are
referring to?

Also, are you running or are you aware of any other tools/tests available for
shmem that we can use to make sure we do not introduce any regressions?

Thanks!
Daniel

> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> [1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
> 
> Changes from v3:
>  - Drop 'force' and 'deny' testing options for each mTHP.
>  - Use new helper update_mmu_tlb_range(), per Lance.
>  - Update documentation to drop "anonymous thp" terminology, per David.
>  - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
>    reported by kernel test robot.
>  - Fix the highest mTHP order in shmem_get_unmapped_area().
>  - Update some commit message.
> 
> Changes from v2:
>  - Rebased to mm/mm-unstable.
>  - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.
> 
> Changes from v1:
>  - Drop the patch that re-arranges the position of highest_order() and
>    next_order(), per Ryan.
>  - Modify the finish_fault() to fix VA alignment issue, per Ryan and
>    David.
>  - Fix some building issues, reported by Lance and kernel test robot.
>  - Update some commit message.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (6):
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  23 ++
>  include/linux/huge_mm.h                    |  23 ++
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  57 +++-
>  mm/shmem.c                                 | 344 ++++++++++++++++++---
>  5 files changed, 403 insertions(+), 61 deletions(-)
> 
> -- 
> 2.39.3
>
Baolin Wang June 11, 2024, 2:53 a.m. UTC | #3
On 2024/6/10 20:10, Daniel Gomez wrote:
> Hi Baolin,
> 
> On Tue, Jun 04, 2024 at 06:17:44PM +0800, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule configured
>> through the sysfs interface, and can only use the PMD-mapped THP, that is not
>> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
>> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
>> to apply an unified mTHP strategy for anonymous pages, also including the
>> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
>> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
>> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
>> all of shmem, not only anonymous shmem, but support will be added iteratively.
>> Therefore, this patch set starts with support for anonymous shmem.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have almost the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option and dropping the testing options 'force' and
>> 'deny'. By default all sizes will be set to "never" except PMD size, which
>> is set to "inherit". This ensures backward compatibility with the anonymous
>> shmem enabled of the top level, meanwhile also allows independent control of
>> anonymous shmem enabled for each mTHP.
>>
>> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> 
> I'm not familiar with this tool. Could you share which repo/tool you are
> referring to?

Sure. The git repo is: https://github.com/gormanm/pft.git

And I did a little changes to test anon shmem:
diff --git a/pft.c b/pft.c
index 3ab1457..bbcd7e6 100644
--- a/pft.c
+++ b/pft.c
@@ -739,7 +739,7 @@ alloc_test_memory(void)
         int j;

         if (do_shm) {
-               if (p = alloc_shm(bytes)) {
+               if (p = valloc_shared(bytes)) {
                         do_mbind(p, bytes);
                         do_noclear(p, bytes);
                 }

> Also, are you running or are you aware of any other tools/tests available for
> shmem that we can use to make sure we do not introduce any regressions?

I did run the mm selftest cases, as well as some testing cases I wrote 
for anon shmem.