[0/8] add mTHP support for anonymous shmem

Message ID	cover.1714978902.git.baolin.wang@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ying.huang@intel.com, 21cnbao@gmail.com, ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/8] add mTHP support for anonymous shmem Date: Mon, 6 May 2024 16:46:24 +0800 Message-Id: <cover.1714978902.git.baolin.wang@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	add mTHP support for anonymous shmem \| expand [0/8] add mTHP support for anonymous shmem [1/8] mm: move highest_order() and next_order() out of the THP config [2/8] mm: memory: extend finish_fault() to support large folio [3/8] mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio() [4/8] mm: shmem: add THP validation for PMD-mapped THP related statistics [5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem [6/8] mm: shmem: add mTHP support for anonymous shmem [7/8] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area [8/8] mm: shmem: add mTHP counters for anonymous shmem

Baolin Wang May 6, 2024, 8:46 a.m. UTC

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

TODO:
 - Support mTHP for tmpfs.
 - Do not split the large folio when share memory swap out.
 - Can swap in a large folio for share memory.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (8):
  mm: move highest_order() and next_order() out of the THP config
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  29 ++
 include/linux/huge_mm.h                    |  35 ++-
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  43 ++-
 mm/shmem.c                                 | 335 ++++++++++++++++++---
 5 files changed, 387 insertions(+), 72 deletions(-)

Lance Yang May 6, 2024, 10:54 a.m. UTC | #1

Hey Baolin,

I found a compilation issue that failed one[1] of my configurations
after applying this series. The error message is as follows:

mm/shmem.c: In function ‘shmem_get_unmapped_area’:
././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
        _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                                            ^
././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
                         prefix ## suffix();                             \
                         ^~~~~~
././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
        _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
 #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                     ^~~~~~~~~~~~~~~~
./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
                            ^~~~~~~~~
./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
 #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
                                   ^~~~~~~~~~~~~~~
mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
        unsigned long hpage_size = HPAGE_PMD_SIZE;
                                   ^~~~~~~~~~~~~~~

It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
is undefined.

[1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)

Thanks,
Lance

Baolin Wang May 7, 2024, 1:47 a.m. UTC | #2

Hi Lance,

On 2024/5/6 18:54, Lance Yang wrote:
> Hey Baolin,
> 
> I found a compilation issue that failed one[1] of my configurations
> after applying this series. The error message is as follows:
> 
> mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
>          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>                                              ^
> ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
>                           prefix ## suffix();                             \
>                           ^~~~~~
> ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
>          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          ^~~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
>   #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>                                       ^~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
>   #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>                       ^~~~~~~~~~~~~~~~
> ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
>   #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>                              ^~~~~~~~~
> ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
>   #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
>                                     ^~~~~~~~~~~~~~~
> mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
>          unsigned long hpage_size = HPAGE_PMD_SIZE;
>                                     ^~~~~~~~~~~~~~~
> 
> It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> is undefined.
> 
> [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)

Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the 
check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:

diff --git a/mm/shmem.c b/mm/shmem.c
index 1af2f0aa384d..d603e36e0f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file 
*file,
         unsigned long inflated_len;
         unsigned long inflated_addr;
         unsigned long inflated_offset;
-       unsigned long hpage_size = HPAGE_PMD_SIZE;
+       unsigned long hpage_size;

         if (len > TASK_SIZE)
                 return -ENOMEM;
@@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file 
*file,
         if (uaddr == addr)
                 return addr;

+       hpage_size = HPAGE_PMD_SIZE;
         if (shmem_huge != SHMEM_HUGE_FORCE) {
                 struct super_block *sb;
                 unsigned long __maybe_unused hpage_orders;

Lance Yang May 7, 2024, 6:50 a.m. UTC | #3

On Tue, May 7, 2024 at 9:47 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Hi Lance,
>
> On 2024/5/6 18:54, Lance Yang wrote:
> > Hey Baolin,
> >
> > I found a compilation issue that failed one[1] of my configurations
> > after applying this series. The error message is as follows:
> >
> > mm/shmem.c: In function ‘shmem_get_unmapped_area’:
> > ././include/linux/compiler_types.h:460:45: error: call to ‘__compiletime_assert_481’ declared with attribute error: BUILD_BUG failed
> >          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >                                              ^
> > ././include/linux/compiler_types.h:441:25: note: in definition of macro ‘__compiletime_assert’
> >                           prefix ## suffix();                             \
> >                           ^~~~~~
> > ././include/linux/compiler_types.h:460:9: note: in expansion of macro ‘_compiletime_assert’
> >          _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >          ^~~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> >   #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> >                                       ^~~~~~~~~~~~~~~~~~
> > ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> >   #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> >                       ^~~~~~~~~~~~~~~~
> > ./include/linux/huge_mm.h:97:28: note: in expansion of macro ‘BUILD_BUG’
> >   #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> >                              ^~~~~~~~~
> > ./include/linux/huge_mm.h:104:35: note: in expansion of macro ‘HPAGE_PMD_SHIFT’
> >   #define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)
> >                                     ^~~~~~~~~~~~~~~
> > mm/shmem.c:2419:36: note: in expansion of macro ‘HPAGE_PMD_SIZE’
> >          unsigned long hpage_size = HPAGE_PMD_SIZE;
> >                                     ^~~~~~~~~~~~~~~
> >
> > It seems like we need to handle the case where CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > is undefined.
> >
> > [1] export ARCH=arm64 && make allnoconfig && make olddefconfig && make -j$(nproc)
>
> Thanks for reporting. I can move the use of HPAGE_PMD_SIZE to after the
> check for CONFIG_TRANSPARENT_HUGEPAGE, which can avoid the building error:

I confirmed that the issue I reported before has disappeared after applying
this change. For the fix,

Tested-by: Lance Yang <ioworker0@gmail.com>

Thanks,
Lance

>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1af2f0aa384d..d603e36e0f4f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2416,7 +2416,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
>          unsigned long inflated_len;
>          unsigned long inflated_addr;
>          unsigned long inflated_offset;
> -       unsigned long hpage_size = HPAGE_PMD_SIZE;
> +       unsigned long hpage_size;
>
>          if (len > TASK_SIZE)
>                  return -ENOMEM;
> @@ -2446,6 +2446,7 @@ unsigned long shmem_get_unmapped_area(struct file
> *file,
>          if (uaddr == addr)
>                  return addr;
>
> +       hpage_size = HPAGE_PMD_SIZE;
>          if (shmem_huge != SHMEM_HUGE_FORCE) {
>                  struct super_block *sb;
>                  unsigned long __maybe_unused hpage_orders;

Ryan Roberts May 7, 2024, 10:20 a.m. UTC | #4

On 06/05/2024 09:46, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level

Didn't we agree that "force" would not be supported for now, and would return an
error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
indirectly through inheritance)?

> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.
> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> TODO:
>  - Support mTHP for tmpfs.
>  - Do not split the large folio when share memory swap out.
>  - Can swap in a large folio for share memory.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (8):
>   mm: move highest_order() and next_order() out of the THP config
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  29 ++
>  include/linux/huge_mm.h                    |  35 ++-
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  43 ++-
>  mm/shmem.c                                 | 335 ++++++++++++++++++---
>  5 files changed, 387 insertions(+), 72 deletions(-)
>

Baolin Wang May 8, 2024, 5:45 a.m. UTC | #5

On 2024/5/7 18:20, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
> 
> Didn't we agree that "force" would not be supported for now, and would return an
> error when attempting to set for a non-PMD-size hugepage-XXkb/shmem_enabled (or
> indirectly through inheritance)?

Yes. Sorry, I did not explain it in detail in the cover letter. Please 
see patch 5 you already commented.

Daniel Gomez May 8, 2024, 11:39 a.m. UTC | #6

On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the shmem enabled of the top level, meanwhile also allows independent
> control of shmem enabled for each mTHP.

I'm trying to understand the adoption of mTHP and how it fits into the adoption
of (large) folios that the kernel is moving towards. Can you, or anyone involved
here, explain this? How much do they overlap, and can we benefit from having
both? Is there any argument against the adoption of large folios here that I
might have missed?

> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> TODO:
>  - Support mTHP for tmpfs.
>  - Do not split the large folio when share memory swap out.
>  - Can swap in a large folio for share memory.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (8):
>   mm: move highest_order() and next_order() out of the THP config
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  29 ++
>  include/linux/huge_mm.h                    |  35 ++-
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  43 ++-
>  mm/shmem.c                                 | 335 ++++++++++++++++++---
>  5 files changed, 387 insertions(+), 72 deletions(-)
> 
> -- 
> 2.39.3
>

David Hildenbrand May 8, 2024, 11:58 a.m. UTC | #7

On 08.05.24 13:39, Daniel Gomez wrote:
> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option. By default all sizes will be set to "never"
>> except PMD size, which is set to "inherit". This ensures backward compatibility
>> with the shmem enabled of the top level, meanwhile also allows independent
>> control of shmem enabled for each mTHP.
> 
> I'm trying to understand the adoption of mTHP and how it fits into the adoption
> of (large) folios that the kernel is moving towards. Can you, or anyone involved
> here, explain this? How much do they overlap, and can we benefit from having
> both? Is there any argument against the adoption of large folios here that I
> might have missed?

mTHP are implemented using large folios, just like traditional PMD-sized 
THP are. (you really should explore the history of mTHP and how it all 
works internally)

The biggest challenge with memory that cannot be evicted on memory 
pressure to be reclaimed (in contrast to your ordinary files in the 
pagecache) is memory waste, well, and placement of large chunks of 
memory in general, during page faults.

In the worst case (no swap), you allocate a large chunk of memory once 
and it will stick around until freed: no reclaim of that memory.

That's the reason why THP for anonymous memory and SHMEM have toggles to 
manually enable and configure them, in contrast to the pagecache. The 
same was done for mTHP for anonymous memory, and now (anon) shmem follows.

There are plans to have, at some point, have it all working 
automatically, but a lot for that for anonymous memory (and shmem 
similarly) is still missing and unclear.

Daniel Gomez May 8, 2024, 2:28 p.m. UTC | #8

On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > 
> > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > configured through the sysfs interface, and can only use the PMD-mapped
> > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > 
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> > 
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
> 
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are. (you really should explore the history of mTHP and how it all works
> internally)

I'll check more in deep the code. By any chance are any of you going to be at
LSFMM this year? I have this session [1] scheduled for Wednesday and it would
be nice to get your feedback on it and if you see this working together with
mTHP/THP.

[1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/

> 
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
> 
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.

I can see that path being triggered by some fstests but only for THP (where we
can actually reclaim memory).

> 
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
> 
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.

Thanks.

> 
> -- 
> Cheers,
> 
> David / dhildenb
>

David Hildenbrand May 8, 2024, 5:03 p.m. UTC | #9

On 08.05.24 16:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
> 
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
>

I'll be around and will attend that session! But note that I am still 
scratching my head what to do with "ordinary" shmem, especially because 
of the weird way shmem behaves in contrast to real files (below). Some 
input from Hugh might be very helpful.

Example: you write() to a shmem file and populate a 2M THP. Then, nobody 
touches that file for a long time. There are certainly other mmap() 
users that could better benefit from that THP ... and without swap that 
THP will be trapped there possibly a long time (unless I am missing an 
important piece of shmem THP design :) )? Sure, if we only have THP's 
it's nice, that's just not the reality unfortunately. IIRC, that's one 
of the reasons why THP for shmem can be enabled/disabled. But again, 
still scratching my head ...

Note that this patch set only tackles anonymous shmem 
(MAP_SHARED|MAP_ANON), which is in 99.999% of all cases only accessed 
via page tables (memory allocated during page faults). I think there are 
ways to grab the fd (/proc/self/fd), but IIRC only corner cases 
read/write that.

So in that sense, anonymous shmem (this patch set) behaves mostly like 
ordinary anonymous memory, and likely there is not much overlap with 
other "allocate large folios during read/write/fallocate" as in [1]. 
swap might have an overlap.

The real confusion begins when we have ordinary shmem: some users never 
mmap it and only read/write, some users never read/write it and only 
mmap it and some (less common?) users do both.

And shmem really is special: it looks like "just another file", but 
memory-consumption and reclaim wise it behaves just like anonymous 
memory. It might be swappable ("usually very limited backing disk space 
available") or it might not.

In a subthread here we are discussing what to do with that special 
"shmem_enabled = force" mode ... and it's all complicated I think.

> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> 
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
> 
> I can see that path being triggered by some fstests but only for THP (where we
> can actually reclaim memory).

Is that when we punch-hole a partial THP and split it? I'd be interested 
in what that test does.

Luis Chamberlain May 8, 2024, 7:23 p.m. UTC | #10

On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> On 08.05.24 13:39, Daniel Gomez wrote:
> > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > which can have all the same values as the top-level
> > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > additional "inherit" option. By default all sizes will be set to "never"
> > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > with the shmem enabled of the top level, meanwhile also allows independent
> > > control of shmem enabled for each mTHP.
> > 
> > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > here, explain this? How much do they overlap, and can we benefit from having
> > both? Is there any argument against the adoption of large folios here that I
> > might have missed?
> 
> mTHP are implemented using large folios, just like traditional PMD-sized THP
> are.
> 
> The biggest challenge with memory that cannot be evicted on memory pressure
> to be reclaimed (in contrast to your ordinary files in the pagecache) is
> memory waste, well, and placement of large chunks of memory in general,
> during page faults.
> 
> In the worst case (no swap), you allocate a large chunk of memory once and
> it will stick around until freed: no reclaim of that memory.
> 
> That's the reason why THP for anonymous memory and SHMEM have toggles to
> manually enable and configure them, in contrast to the pagecache. The same
> was done for mTHP for anonymous memory, and now (anon) shmem follows.
> 
> There are plans to have, at some point, have it all working automatically,
> but a lot for that for anonymous memory (and shmem similarly) is still
> missing and unclear.

Whereas the use for large folios for filesystems is already automatic,
so long as the filesystem supports it. We do this in readahead and write
path already for iomap, we opportunistically use large folios if we can,
otherwise we use smaller folios.

So a recommended approach by Matthew was to use the readahead and write
path, just as in iomap to determine the size of the folio to use [0].
The use of large folios would also be automatic and not require any
knobs at all.

The mTHP approach would be growing the "THP" use in filesystems by the
only single filesystem to use THP. Meanwhile use of large folios is already
automatic with the approach taken by iomap.

We're at a crux where it does beg the question if we should continue to
chug on with tmpfs being special and doing things differently extending
the old THP interface with mTHP, or if it should just use large folios
using the same approach as iomap did.

From my perspective the more shared code the better, and the more shared
paths the better. There is a chance to help test swap with large folios
instead of splitting the folios for swap, and that would could be done
first with tmpfs. I have not evaluated the difference in testing or how
we could get the most of shared code if we take a mTHP approach or the
iomap approach for tmpfs, that should be considered.

Are there other things to consider? Does this require some dialog at
LSFMM?

[0] https://lore.kernel.org/all/ZHD9zmIeNXICDaRJ@casper.infradead.org/

  Luis

Baolin Wang May 9, 2024, 3:08 a.m. UTC | #11

On 2024/5/8 22:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
> 
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
> 
> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/

Great. I'm also interested in tmpfs support for large folios (or mTHP), 
so please CC me if you plan to send a new version.

As David mentioned, this patchset is mainly about adding mTHP support 
for anonymous shmem, and I think that some of the swap support for large 
folios could work together.

David Hildenbrand May 9, 2024, 5:48 p.m. UTC | #12

On 08.05.24 21:23, Luis Chamberlain wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are.
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
>>
>> That's the reason why THP for anonymous memory and SHMEM have toggles to
>> manually enable and configure them, in contrast to the pagecache. The same
>> was done for mTHP for anonymous memory, and now (anon) shmem follows.
>>
>> There are plans to have, at some point, have it all working automatically,
>> but a lot for that for anonymous memory (and shmem similarly) is still
>> missing and unclear.
> 
> Whereas the use for large folios for filesystems is already automatic,
> so long as the filesystem supports it. We do this in readahead and write
> path already for iomap, we opportunistically use large folios if we can,
> otherwise we use smaller folios.
> 
> So a recommended approach by Matthew was to use the readahead and write
> path, just as in iomap to determine the size of the folio to use [0].
> The use of large folios would also be automatic and not require any
> knobs at all.

Yes, I remember discussing that with Willy at some point, including why 
shmem is unfortunately a bit more "special", because you might not even 
have a disk backend ("swap") at all where you could easily reclaim memory.

In the extreme form, you can consider SHMEM as memory that might be 
always mlocked, even without the user requiring special mlock limits ...

> 
> The mTHP approach would be growing the "THP" use in filesystems by the
> only single filesystem to use THP. Meanwhile use of large folios is already
> automatic with the approach taken by iomap.

Yes, it's the extension of existing shmem_enabled (that -- I'm afraid -- 
was added for good reasons).

> 
> We're at a crux where it does beg the question if we should continue to
> chug on with tmpfs being special and doing things differently extending
> the old THP interface with mTHP, or if it should just use large folios
> using the same approach as iomap did.

I'm afraid shmem will remain to some degree special. Fortunately it's 
not alone, hugetlbfs is even more special ;)

> 
>  From my perspective the more shared code the better, and the more shared
> paths the better. There is a chance to help test swap with large folios
> instead of splitting the folios for swap, and that would could be done
> first with tmpfs. I have not evaluated the difference in testing or how
> we could get the most of shared code if we take a mTHP approach or the
> iomap approach for tmpfs, that should be considered.

I don't have a clear picture yet of what might be best for ordinary 
shmem (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy 
answer.

As long as we don't end up wasting memory, it's not obviously bad. But 
some things might be tricky (see my example about large folios stranding 
in shmem and never being able to be really reclaimed+reused for better 
purposes)

I'll note that mTHP really is just (supposed to be) a user interface to 
enable the various folio sizes (well, and to expose better per-size 
stats), not more.

 From that point of view, it's just a filter. Enable all, and you get 
the same behavior as you likely would in the pagecache mode.

 From a shared-code and testing point of view, there really wouldn't be 
a lot of differences. Again, essentially just a filter.

> 
> Are there other things to consider? Does this require some dialog at
> LSFMM?

As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss. 
I'm also not a SHMEM expert, so I'm hoping at some point we'd get 
feedback from Hugh.

Daniel Gomez May 9, 2024, 7:18 p.m. UTC | #13

On Wed, May 08, 2024 at 07:03:57PM +0200, David Hildenbrand wrote:
> On 08.05.24 16:28, Daniel Gomez wrote:
> > On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> > > On 08.05.24 13:39, Daniel Gomez wrote:
> > > > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > > > 
> > > > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > > > configured through the sysfs interface, and can only use the PMD-mapped
> > > > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > > > 
> > > > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > > > which can have all the same values as the top-level
> > > > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > > > additional "inherit" option. By default all sizes will be set to "never"
> > > > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > > > with the shmem enabled of the top level, meanwhile also allows independent
> > > > > control of shmem enabled for each mTHP.
> > > > 
> > > > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > > > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > > > here, explain this? How much do they overlap, and can we benefit from having
> > > > both? Is there any argument against the adoption of large folios here that I
> > > > might have missed?
> > > 
> > > mTHP are implemented using large folios, just like traditional PMD-sized THP
> > > are. (you really should explore the history of mTHP and how it all works
> > > internally)
> > 
> > I'll check more in deep the code. By any chance are any of you going to be at
> > LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> > be nice to get your feedback on it and if you see this working together with
> > mTHP/THP.
> > 
> 
> I'll be around and will attend that session! But note that I am still
> scratching my head what to do with "ordinary" shmem, especially because of
> the weird way shmem behaves in contrast to real files (below). Some input
> from Hugh might be very helpful.

I'm looking forward to meet you there and have your feedback!

> 
> Example: you write() to a shmem file and populate a 2M THP. Then, nobody
> touches that file for a long time. There are certainly other mmap() users
> that could better benefit from that THP ... and without swap that THP will
> be trapped there possibly a long time (unless I am missing an important
> piece of shmem THP design :) )? Sure, if we only have THP's it's nice,
> that's just not the reality unfortunately. IIRC, that's one of the reasons
> why THP for shmem can be enabled/disabled. But again, still scratching my
> head ...
> 
> 
> Note that this patch set only tackles anonymous shmem (MAP_SHARED|MAP_ANON),
> which is in 99.999% of all cases only accessed via page tables (memory
> allocated during page faults). I think there are ways to grab the fd
> (/proc/self/fd), but IIRC only corner cases read/write that.
> 
> So in that sense, anonymous shmem (this patch set) behaves mostly like
> ordinary anonymous memory, and likely there is not much overlap with other
> "allocate large folios during read/write/fallocate" as in [1]. swap might
> have an overlap.
> 
> 
> The real confusion begins when we have ordinary shmem: some users never mmap
> it and only read/write, some users never read/write it and only mmap it and
> some (less common?) users do both.
> 
> And shmem really is special: it looks like "just another file", but
> memory-consumption and reclaim wise it behaves just like anonymous memory.
> It might be swappable ("usually very limited backing disk space available")
> or it might not.
> 
> In a subthread here we are discussing what to do with that special
> "shmem_enabled = force" mode ... and it's all complicated I think.
> 
> > [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> > 
> > > 
> > > The biggest challenge with memory that cannot be evicted on memory pressure
> > > to be reclaimed (in contrast to your ordinary files in the pagecache) is
> > > memory waste, well, and placement of large chunks of memory in general,
> > > during page faults.
> > > 
> > > In the worst case (no swap), you allocate a large chunk of memory once and
> > > it will stick around until freed: no reclaim of that memory.
> > 
> > I can see that path being triggered by some fstests but only for THP (where we
> > can actually reclaim memory).
> 
> Is that when we punch-hole a partial THP and split it? I'd be interested in
> what that test does.

The reclaim path I'm referring to is triggered when we reach max capacity
(-ENOSPC) in shmem_alloc_and_add_folio(). We reclaim space by splitting large
folios (regardless of their dirty or uptodate condition).

One of the tests that hits this path is generic/100 (with huge option enabled).
- First, it creates a directory structure in $TEMP_DIR (/tmp). Dir size is
around 26M.
- Then, it tars it up into $TEMP_DIR/temp.tar.
- Finally, untars the compressed file into $TEST_DIR (/media/test, which is the
huge tmpfs mountdir). What happens in generic/100 under the huge=always case
is that you fill up the dedicated space very quickly (this is 1G in xfstests
for tmpfs) and then you start reclaiming.

> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
>

Luis Chamberlain May 10, 2024, 6:53 p.m. UTC | #14

On Thu, May 09, 2024 at 07:48:46PM +0200, David Hildenbrand wrote:
> On 08.05.24 21:23, Luis Chamberlain wrote:
> >  From my perspective the more shared code the better, and the more shared
> > paths the better. There is a chance to help test swap with large folios
> > instead of splitting the folios for swap, and that would could be done
> > first with tmpfs. I have not evaluated the difference in testing or how
> > we could get the most of shared code if we take a mTHP approach or the
> > iomap approach for tmpfs, that should be considered.
> 
> I don't have a clear picture yet of what might be best for ordinary shmem
> (IOW, not MAP_SHARED|MAP_PRIVATE), and I'm afraid there is no easy answer.

OK so it sounds like the different options needs to be thought out and
reviewed.

> As long as we don't end up wasting memory, it's not obviously bad.

Sure.

> But some
> things might be tricky (see my example about large folios stranding in shmem
> and never being able to be really reclaimed+reused for better purposes)

Where is that stated BTW? Could that be resolved?

> I'll note that mTHP really is just (supposed to be) a user interface to
> enable the various folio sizes (well, and to expose better per-size stats),
> not more.

Sure but given filesystems using large folios don't have silly APIs for
using which large folios to enable, it just seems odd for tmpfs to take
a different approach.

> From that point of view, it's just a filter. Enable all, and you get the
> same behavior as you likely would in the pagecache mode.

Which begs the quesiton, *why* have an API to just constrain to certain
large folios, which diverges from what filesystems are doing with large
folios?

> > Are there other things to consider? Does this require some dialog at
> > LSFMM?
> 
> As raised in my reply to Daniel, I'll be at LSF/MM and happy to discuss. I'm
> also not a SHMEM expert, so I'm hoping at some point we'd get feedback from
> Hugh.

Hugh, will you be at LSFMM?

  Luis

[0/8] add mTHP support for anonymous shmem

Message

Comments