mbox series

[v5,0/6] add mTHP support for anonymous shmem

Message ID cover.1718090413.git.baolin.wang@linux.alibaba.com (mailing list archive)
Headers show
Series add mTHP support for anonymous shmem | expand

Message

Baolin Wang June 11, 2024, 10:11 a.m. UTC
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
all of shmem, not only anonymous shmem, but support will be added iteratively.
Therefore, this patch set starts with support for anonymous shmem.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size, which
is set to "inherit". This ensures backward compatibility with the anonymous
shmem enabled of the top level, meanwhile also allows independent control of
anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/

Changes from v4:
 - Fix the unused variable warning reported by kernel test robot.
 - Drop the 'anon' prefix for variables and functions, per Daniel.

Changes from v3:
 - Drop 'force' and 'deny' testing options for each mTHP.
 - Use new helper update_mmu_tlb_range(), per Lance.
 - Update documentation to drop "anonymous thp" terminology, per David.
 - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
   reported by kernel test robot.
 - Fix the highest mTHP order in shmem_get_unmapped_area().
 - Update some commit message.

Changes from v2:
 - Rebased to mm/mm-unstable.
 - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
 - Drop the patch that re-arranges the position of highest_order() and
   next_order(), per Ryan.
 - Modify the finish_fault() to fix VA alignment issue, per Ryan and
   David.
 - Fix some building issues, reported by Lance and kernel test robot.
 - Update some commit message.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (6):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  23 ++
 include/linux/huge_mm.h                    |  23 ++
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  57 +++-
 mm/shmem.c                                 | 344 ++++++++++++++++++---
 5 files changed, 403 insertions(+), 61 deletions(-)

Comments

Matthew Wilcox July 4, 2024, 6:43 p.m. UTC | #1
On Tue, Jun 11, 2024 at 06:11:04PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.

OK, this makes sense.

> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
> all of shmem, not only anonymous shmem, but support will be added iteratively.
> Therefore, this patch set starts with support for anonymous shmem.

But then this doesn't.  You say first that users want the same controls
to control all anonymous memory, then you introduce a completely
separate set of controls for shared anonymous memory.

shmem has two uses:

 - MAP_ANONYMOUS | MAP_SHARED (this patch set)
 - tmpfs

For the second use case we don't want controls *at all*, we want the
same heiristics used for all other filesystems to apply to tmpfs.

There's no reason to have separate controls for choosing folio size
in shmem.

> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size, which
> is set to "inherit". This ensures backward compatibility with the anonymous
> shmem enabled of the top level, meanwhile also allows independent control of
> anonymous shmem enabled for each mTHP.
David Hildenbrand July 4, 2024, 7:03 p.m. UTC | #2
> shmem has two uses:
> 
>   - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>   - tmpfs
> 
> For the second use case we don't want controls *at all*, we want the
> same heiristics used for all other filesystems to apply to tmpfs.

As discussed in the MM meeting, Hugh had a different opinion on that.
David Hildenbrand July 4, 2024, 7:19 p.m. UTC | #3
On 04.07.24 21:03, David Hildenbrand wrote:
>> shmem has two uses:
>>
>>    - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>    - tmpfs
>>
>> For the second use case we don't want controls *at all*, we want the
>> same heiristics used for all other filesystems to apply to tmpfs.
> 
> As discussed in the MM meeting, Hugh had a different opinion on that.

FWIW, I just recalled that I wrote a quick summary:

https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com

I believe the meetings are recorded as well, but never looked at recordings.
Matthew Wilcox July 4, 2024, 7:49 p.m. UTC | #4
On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> On 04.07.24 21:03, David Hildenbrand wrote:
> > > shmem has two uses:
> > > 
> > >    - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> > >    - tmpfs
> > > 
> > > For the second use case we don't want controls *at all*, we want the
> > > same heiristics used for all other filesystems to apply to tmpfs.
> > 
> > As discussed in the MM meeting, Hugh had a different opinion on that.
> 
> FWIW, I just recalled that I wrote a quick summary:
> 
> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> 
> I believe the meetings are recorded as well, but never looked at recordings.

That's not what I understood Hugh to mean.  To me, it seemed that Hugh
was expressing an opinion on using shmem as shmem, not as using it as
tmpfs.

If I misunderstood Hugh, well, I still disagree.  We should not have
separate controls for this.  tmpfs is just not that special.
Baolin Wang July 5, 2024, 5:47 a.m. UTC | #5
On 2024/7/5 03:49, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>> shmem has two uses:
>>>>
>>>>     - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>     - tmpfs
>>>>
>>>> For the second use case we don't want controls *at all*, we want the
>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>
>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>
>> FWIW, I just recalled that I wrote a quick summary:
>>
>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>
>> I believe the meetings are recorded as well, but never looked at recordings.
> 
> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> was expressing an opinion on using shmem as shmem, not as using it as
> tmpfs.
> 
> If I misunderstood Hugh, well, I still disagree.  We should not have
> separate controls for this.  tmpfs is just not that special.

But now we already have a PMD-mapped THP control for tmpfs, and mTHP 
simply extends this control to per-size.

IIUC, as David mentioned before, for tmpfs, mTHP should act like a huge 
order filter which should be respected by the expected huge orders in 
the write() and fallocate() paths. This would also solve the issue of 
allocating huge orders in writable mmap() path for tmpfs, as well as 
unifying the interface.

Anyway, I will try to provide an RFC to discuss the mTHP for tmpfs approach.
Ryan Roberts July 5, 2024, 8:45 a.m. UTC | #6
On 05/07/2024 06:47, Baolin Wang wrote:
> 
> 
> On 2024/7/5 03:49, Matthew Wilcox wrote:
>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>> shmem has two uses:
>>>>>
>>>>>     - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>     - tmpfs
>>>>>
>>>>> For the second use case we don't want controls *at all*, we want the
>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>
>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>
>>> FWIW, I just recalled that I wrote a quick summary:
>>>
>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>
>>> I believe the meetings are recorded as well, but never looked at recordings.
>>
>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>> was expressing an opinion on using shmem as shmem, not as using it as
>> tmpfs.
>>
>> If I misunderstood Hugh, well, I still disagree.  We should not have
>> separate controls for this.  tmpfs is just not that special.

I wasn't at the meeting that's being referred to, but I thought we previously
agreed that tmpfs *is* special because in some configurations its not backed by
swap so is locked in ram?

> 
> But now we already have a PMD-mapped THP control for tmpfs, and mTHP simply
> extends this control to per-size.
> 
> IIUC, as David mentioned before, for tmpfs, mTHP should act like a huge order
> filter which should be respected by the expected huge orders in the write() and
> fallocate() paths. This would also solve the issue of allocating huge orders in
> writable mmap() path for tmpfs, as well as unifying the interface.
> 
> Anyway, I will try to provide an RFC to discuss the mTHP for tmpfs approach.
David Hildenbrand July 5, 2024, 8:59 a.m. UTC | #7
On 05.07.24 10:45, Ryan Roberts wrote:
> On 05/07/2024 06:47, Baolin Wang wrote:
>>
>>
>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>> shmem has two uses:
>>>>>>
>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>      - tmpfs
>>>>>>
>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>
>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>
>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>
>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>
>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>
>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>> was expressing an opinion on using shmem as shmem, not as using it as
>>> tmpfs.
>>>
>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>> separate controls for this.  tmpfs is just not that special.
> 
> I wasn't at the meeting that's being referred to, but I thought we previously
> agreed that tmpfs *is* special because in some configurations its not backed by
> swap so is locked in ram?

There are multiple things to that, like:

* Machines only having limited/no swap configured
* tmpfs can be configured to never go to swap
* memfd/tmpfs files getting used purely for mmap(): there is no real
   difference to MAP_ANON|MAP_SHARE besides the processes we share that
   memory with.

Especially when it comes to memory waste concerns and access behavior in 
some cases, tmpfs behaved much more like anonymous memory. But there are 
for sure other use cases where tmpfs is not that special.

My opinion is that we need to let people configure orders (if you feel 
like it, configure all), but *select* the order to allocate based on 
readahead information -- in contrast to anonymous memory where we start 
at the highest order and don't have readahead information available.

Maybe we need different "order allcoation" logic for read/write vs. 
fault, not sure.

But I don't maintain that code, so I can only give stupid suggestions 
and repeat what I understood from the meeting with Hugh and Kirill :)
Ryan Roberts July 5, 2024, 9:13 a.m. UTC | #8
On 05/07/2024 09:59, David Hildenbrand wrote:
> On 05.07.24 10:45, Ryan Roberts wrote:
>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>> shmem has two uses:
>>>>>>>
>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>      - tmpfs
>>>>>>>
>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>
>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>
>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>
>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>
>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>
>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>> tmpfs.
>>>>
>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>> separate controls for this.  tmpfs is just not that special.
>>
>> I wasn't at the meeting that's being referred to, but I thought we previously
>> agreed that tmpfs *is* special because in some configurations its not backed by
>> swap so is locked in ram?
> 
> There are multiple things to that, like:
> 
> * Machines only having limited/no swap configured
> * tmpfs can be configured to never go to swap
> * memfd/tmpfs files getting used purely for mmap(): there is no real
>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>   memory with.
> 
> Especially when it comes to memory waste concerns and access behavior in some
> cases, tmpfs behaved much more like anonymous memory. But there are for sure
> other use cases where tmpfs is not that special.
> 
> My opinion is that we need to let people configure orders (if you feel like it,
> configure all), but *select* the order to allocate based on readahead
> information -- in contrast to anonymous memory where we start at the highest
> order and don't have readahead information available.

That approach is exactly what I proposed to start playing with yesterday [1] for
regular pagecache folio allocations too :)

[1] https://lore.kernel.org/linux-mm/bdde4008-60db-4717-a6b5-53d77ab76bdb@arm.com/

> 
> Maybe we need different "order allcoation" logic for read/write vs. fault, not
> sure.
> 
> But I don't maintain that code, so I can only give stupid suggestions and repeat
> what I understood from the meeting with Hugh and Kirill :)
>
David Hildenbrand July 5, 2024, 9:16 a.m. UTC | #9
On 05.07.24 11:13, Ryan Roberts wrote:
> On 05/07/2024 09:59, David Hildenbrand wrote:
>> On 05.07.24 10:45, Ryan Roberts wrote:
>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>> shmem has two uses:
>>>>>>>>
>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>       - tmpfs
>>>>>>>>
>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>
>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>
>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>
>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>
>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>
>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>> tmpfs.
>>>>>
>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>> separate controls for this.  tmpfs is just not that special.
>>>
>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>> swap so is locked in ram?
>>
>> There are multiple things to that, like:
>>
>> * Machines only having limited/no swap configured
>> * tmpfs can be configured to never go to swap
>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>    memory with.
>>
>> Especially when it comes to memory waste concerns and access behavior in some
>> cases, tmpfs behaved much more like anonymous memory. But there are for sure
>> other use cases where tmpfs is not that special.
>>
>> My opinion is that we need to let people configure orders (if you feel like it,
>> configure all), but *select* the order to allocate based on readahead
>> information -- in contrast to anonymous memory where we start at the highest
>> order and don't have readahead information available.
> 
> That approach is exactly what I proposed to start playing with yesterday [1] for
> regular pagecache folio allocations too :)

In German, there is this saying "zwei Dumme ein Gedanke".

The official English alternative is "great minds think alike".

... well, the direct German->English translation definitely has a 
"German touch" to it: "two stupid ones one thought"
Ryan Roberts July 5, 2024, 9:23 a.m. UTC | #10
On 05/07/2024 10:16, David Hildenbrand wrote:
> On 05.07.24 11:13, Ryan Roberts wrote:
>> On 05/07/2024 09:59, David Hildenbrand wrote:
>>> On 05.07.24 10:45, Ryan Roberts wrote:
>>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>>> shmem has two uses:
>>>>>>>>>
>>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>>       - tmpfs
>>>>>>>>>
>>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>>
>>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>>
>>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>>
>>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>>
>>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>>
>>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>>> tmpfs.
>>>>>>
>>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>>> separate controls for this.  tmpfs is just not that special.
>>>>
>>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>>> swap so is locked in ram?
>>>
>>> There are multiple things to that, like:
>>>
>>> * Machines only having limited/no swap configured
>>> * tmpfs can be configured to never go to swap
>>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>>    memory with.
>>>
>>> Especially when it comes to memory waste concerns and access behavior in some
>>> cases, tmpfs behaved much more like anonymous memory. But there are for sure
>>> other use cases where tmpfs is not that special.
>>>
>>> My opinion is that we need to let people configure orders (if you feel like it,
>>> configure all), but *select* the order to allocate based on readahead
>>> information -- in contrast to anonymous memory where we start at the highest
>>> order and don't have readahead information available.
>>
>> That approach is exactly what I proposed to start playing with yesterday [1] for
>> regular pagecache folio allocations too :)
> 
> In German, there is this saying "zwei Dumme ein Gedanke".
> 
> The official English alternative is "great minds think alike".
> 
> ... well, the direct German->English translation definitely has a "German touch"
> to it: "two stupid ones one thought"

I definitely prefer the direct translation. :)
Daniel Gomez July 7, 2024, 4:39 p.m. UTC | #11
On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
> On 05.07.24 10:45, Ryan Roberts wrote:
> > On 05/07/2024 06:47, Baolin Wang wrote:
> > > 
> > > 
> > > On 2024/7/5 03:49, Matthew Wilcox wrote:
> > > > On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> > > > > On 04.07.24 21:03, David Hildenbrand wrote:
> > > > > > > shmem has two uses:
> > > > > > > 
> > > > > > >      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> > > > > > >      - tmpfs
> > > > > > > 
> > > > > > > For the second use case we don't want controls *at all*, we want the
> > > > > > > same heiristics used for all other filesystems to apply to tmpfs.
> > > > > > 
> > > > > > As discussed in the MM meeting, Hugh had a different opinion on that.
> > > > > 
> > > > > FWIW, I just recalled that I wrote a quick summary:
> > > > > 
> > > > > https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> > > > > 
> > > > > I believe the meetings are recorded as well, but never looked at recordings.
> > > > 
> > > > That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> > > > was expressing an opinion on using shmem as shmem, not as using it as
> > > > tmpfs.
> > > > 
> > > > If I misunderstood Hugh, well, I still disagree.  We should not have
> > > > separate controls for this.  tmpfs is just not that special.
> > 
> > I wasn't at the meeting that's being referred to, but I thought we previously
> > agreed that tmpfs *is* special because in some configurations its not backed by
> > swap so is locked in ram?
> 
> There are multiple things to that, like:
> 
> * Machines only having limited/no swap configured
> * tmpfs can be configured to never go to swap
> * memfd/tmpfs files getting used purely for mmap(): there is no real
>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>   memory with.
> 
> Especially when it comes to memory waste concerns and access behavior in
> some cases, tmpfs behaved much more like anonymous memory. But there are for
> sure other use cases where tmpfs is not that special.

Having controls to select the allowable folio order allocations for
tmpfs does not address any of these issues. The suggested filesystem
approach [1] involves allocating orders in larger chunks, but always
the same size you would allocate when using order-0 folios. So,
it's a conservative approach. Using mTHP knobs in tmpfs would cause:
* Over allocation when using mTHP and/ord THP under the 'always' flag.
* Allocate in bigger chunks in a non optimal way, when
not all mTHP and THP orders are enabled.
* Operate in a similar manner as in [1] when all mTHP and THP orders
are enabled and 'within_size' flag is used (assuming we use patch 11
from [1]).

[1] Last 3 patches of these series:
https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/

My understanding of why mTHP was preferred is to raise awareness in
user space and allow tmpfs mounts used at boot time to operate in
'safe' mode (no large folios). Does it make more sense to have a large
folios enable flag to control order allocation as in [1], instead of
every single order possible?

> 
> My opinion is that we need to let people configure orders (if you feel like
> it, configure all), but *select* the order to allocate based on readahead
> information -- in contrast to anonymous memory where we start at the highest
> order and don't have readahead information available.
> 
> Maybe we need different "order allcoation" logic for read/write vs. fault,
> not sure.

I would suggest [1] the file size of the write for the write
and fallocate paths. But when does make sense to use readahead
information? Maybe when swap is involved?

> 
> But I don't maintain that code, so I can only give stupid suggestions and
> repeat what I understood from the meeting with Hugh and Kirill :)
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Ryan Roberts July 9, 2024, 8:28 a.m. UTC | #12
On 07/07/2024 17:39, Daniel Gomez wrote:
> On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
>> On 05.07.24 10:45, Ryan Roberts wrote:
>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>> shmem has two uses:
>>>>>>>>
>>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>      - tmpfs
>>>>>>>>
>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>
>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>
>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>
>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>
>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>
>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>> tmpfs.
>>>>>
>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>> separate controls for this.  tmpfs is just not that special.
>>>
>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>> swap so is locked in ram?
>>
>> There are multiple things to that, like:
>>
>> * Machines only having limited/no swap configured
>> * tmpfs can be configured to never go to swap
>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>   memory with.
>>
>> Especially when it comes to memory waste concerns and access behavior in
>> some cases, tmpfs behaved much more like anonymous memory. But there are for
>> sure other use cases where tmpfs is not that special.
> 
> Having controls to select the allowable folio order allocations for
> tmpfs does not address any of these issues. The suggested filesystem
> approach [1] involves allocating orders in larger chunks, but always
> the same size you would allocate when using order-0 folios. 

Well you can't know that you will never allocate more. If you allocate a 2M
block, you probably have some good readahead data that tells you you are likely
to keep reading sequentially, but you don't know for sure that the application
won't stop after just 4K.

> So,
> it's a conservative approach. Using mTHP knobs in tmpfs would cause:
> * Over allocation when using mTHP and/ord THP under the 'always' flag.
> * Allocate in bigger chunks in a non optimal way, when
> not all mTHP and THP orders are enabled.
> * Operate in a similar manner as in [1] when all mTHP and THP orders
> are enabled and 'within_size' flag is used (assuming we use patch 11
> from [1]).

Large folios may still be considered scarce resources even if the amount of
memory allocated is still the same. And if shmem isn't backed by swap then once
you have allocated a large folio for shmem, it is stuck in shmem, even if it
would be better used somewhere else.

And it's possible (likely even, in my opinion) that allocating lots of different
folio sizes will exacerbate memory fragmentation, leading to more order-0
fallbacks, which would hurt the overall system performance in the long run, vs
restricting to a couple of folio sizes.

I'm starting some work to actually measure how limiting the folio sizes
allocated for page cache memory can help reduce large folio allocation failure
overall. My hypothesis is that the data will show us that in an environment like
Android, where memory pressure is high, limiting everything to order-0 and
order-4 will significantly improve the allocation success rate of order-4. Let's
see.

> 
> [1] Last 3 patches of these series:
> https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> 
> My understanding of why mTHP was preferred is to raise awareness in
> user space and allow tmpfs mounts used at boot time to operate in
> 'safe' mode (no large folios). Does it make more sense to have a large
> folios enable flag to control order allocation as in [1], instead of
> every single order possible?

My intuition is towards every order possible, as per above. Let's see what the
data tells us.

> 
>>
>> My opinion is that we need to let people configure orders (if you feel like
>> it, configure all), but *select* the order to allocate based on readahead
>> information -- in contrast to anonymous memory where we start at the highest
>> order and don't have readahead information available.
>>
>> Maybe we need different "order allcoation" logic for read/write vs. fault,
>> not sure.
> 
> I would suggest [1] the file size of the write for the write
> and fallocate paths. But when does make sense to use readahead
> information? Maybe when swap is involved?
> 
>>
>> But I don't maintain that code, so I can only give stupid suggestions and
>> repeat what I understood from the meeting with Hugh and Kirill :)
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
Daniel Gomez July 16, 2024, 1:11 p.m. UTC | #13
On Tue, Jul 09, 2024 at 09:28:48AM GMT, Ryan Roberts wrote:
> On 07/07/2024 17:39, Daniel Gomez wrote:
> > On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
> >> On 05.07.24 10:45, Ryan Roberts wrote:
> >>> On 05/07/2024 06:47, Baolin Wang wrote:
> >>>>
> >>>>
> >>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
> >>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
> >>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
> >>>>>>>> shmem has two uses:
> >>>>>>>>
> >>>>>>>>      - MAP_ANONYMOUS | MAP_SHARED (this patch set)
> >>>>>>>>      - tmpfs
> >>>>>>>>
> >>>>>>>> For the second use case we don't want controls *at all*, we want the
> >>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
> >>>>>>>
> >>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
> >>>>>>
> >>>>>> FWIW, I just recalled that I wrote a quick summary:
> >>>>>>
> >>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
> >>>>>>
> >>>>>> I believe the meetings are recorded as well, but never looked at recordings.
> >>>>>
> >>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
> >>>>> was expressing an opinion on using shmem as shmem, not as using it as
> >>>>> tmpfs.
> >>>>>
> >>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
> >>>>> separate controls for this.  tmpfs is just not that special.
> >>>
> >>> I wasn't at the meeting that's being referred to, but I thought we previously
> >>> agreed that tmpfs *is* special because in some configurations its not backed by
> >>> swap so is locked in ram?
> >>
> >> There are multiple things to that, like:
> >>
> >> * Machines only having limited/no swap configured
> >> * tmpfs can be configured to never go to swap
> >> * memfd/tmpfs files getting used purely for mmap(): there is no real
> >>   difference to MAP_ANON|MAP_SHARE besides the processes we share that
> >>   memory with.
> >>
> >> Especially when it comes to memory waste concerns and access behavior in
> >> some cases, tmpfs behaved much more like anonymous memory. But there are for
> >> sure other use cases where tmpfs is not that special.
> > 
> > Having controls to select the allowable folio order allocations for
> > tmpfs does not address any of these issues. The suggested filesystem
> > approach [1] involves allocating orders in larger chunks, but always
> > the same size you would allocate when using order-0 folios. 
> 
> Well you can't know that you will never allocate more. If you allocate a 2M

In the fs large folio approach implementation [1], the allocation of a 2M (or
any non order-0) occurs when the size of the write/fallocate is 2M (and index
is aligned).

> block, you probably have some good readahead data that tells you you are likely
> to keep reading sequentially, but you don't know for sure that the application
> won't stop after just 4K.

Is shmem_file_read_iter() getting readahead data to perform the read? or what do
you mean exactly?

In [1], read is perform in chunks of 4k, so I think this does not apply.

> 
> > So,
> > it's a conservative approach. Using mTHP knobs in tmpfs would cause:
> > * Over allocation when using mTHP and/ord THP under the 'always' flag.
> > * Allocate in bigger chunks in a non optimal way, when
> > not all mTHP and THP orders are enabled.
> > * Operate in a similar manner as in [1] when all mTHP and THP orders
> > are enabled and 'within_size' flag is used (assuming we use patch 11
> > from [1]).
> 
> Large folios may still be considered scarce resources even if the amount of
> memory allocated is still the same. And if shmem isn't backed by swap then once
> you have allocated a large folio for shmem, it is stuck in shmem, even if it
> would be better used somewhere else.

Is that true for tmpfs as well? We have shmem_unused_huge_shrink() that will
reclaim unused large folios (when ENOSPC and free_cached_objects()). Can't we
reuse that when the system is under memory pressure?

> 
> And it's possible (likely even, in my opinion) that allocating lots of different
> folio sizes will exacerbate memory fragmentation, leading to more order-0
> fallbacks, which would hurt the overall system performance in the long run, vs
> restricting to a couple of folio sizes.

Since we are transitioning to large folios in other filesystems, the impact
of restricting the order here will only depend on the extent of tmpfs usage
relative to the rest of the system. Luis discussed the topic of mm fragmentation
and measurment in a session at LSFMM this year [2].

[2] https://lore.kernel.org/all/ZkUOXQvVjXP1T6Nk@bombadil.infradead.org/

> 
> I'm starting some work to actually measure how limiting the folio sizes
> allocated for page cache memory can help reduce large folio allocation failure

It would be great to hear more about that effort.

> overall. My hypothesis is that the data will show us that in an environment like
> Android, where memory pressure is high, limiting everything to order-0 and
> order-4 will significantly improve the allocation success rate of order-4. Let's
> see.
> 
> > 
> > [1] Last 3 patches of these series:
> > https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> > 
> > My understanding of why mTHP was preferred is to raise awareness in
> > user space and allow tmpfs mounts used at boot time to operate in
> > 'safe' mode (no large folios). Does it make more sense to have a large
> > folios enable flag to control order allocation as in [1], instead of
> > every single order possible?
> 
> My intuition is towards every order possible, as per above. Let's see what the
> data tells us.
> 
> > 
> >>
> >> My opinion is that we need to let people configure orders (if you feel like
> >> it, configure all), but *select* the order to allocate based on readahead
> >> information -- in contrast to anonymous memory where we start at the highest
> >> order and don't have readahead information available.
> >>
> >> Maybe we need different "order allcoation" logic for read/write vs. fault,
> >> not sure.
> > 
> > I would suggest [1] the file size of the write for the write
> > and fallocate paths. But when does make sense to use readahead
> > information? Maybe when swap is involved?
> > 
> >>
> >> But I don't maintain that code, so I can only give stupid suggestions and
> >> repeat what I understood from the meeting with Hugh and Kirill :)
> >>
> >> -- 
> >> Cheers,
> >>
> >> David / dhildenb
>
David Hildenbrand July 16, 2024, 1:22 p.m. UTC | #14
On 16.07.24 15:11, Daniel Gomez wrote:
> On Tue, Jul 09, 2024 at 09:28:48AM GMT, Ryan Roberts wrote:
>> On 07/07/2024 17:39, Daniel Gomez wrote:
>>> On Fri, Jul 05, 2024 at 10:59:02AM GMT, David Hildenbrand wrote:
>>>> On 05.07.24 10:45, Ryan Roberts wrote:
>>>>> On 05/07/2024 06:47, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/7/5 03:49, Matthew Wilcox wrote:
>>>>>>> On Thu, Jul 04, 2024 at 09:19:10PM +0200, David Hildenbrand wrote:
>>>>>>>> On 04.07.24 21:03, David Hildenbrand wrote:
>>>>>>>>>> shmem has two uses:
>>>>>>>>>>
>>>>>>>>>>       - MAP_ANONYMOUS | MAP_SHARED (this patch set)
>>>>>>>>>>       - tmpfs
>>>>>>>>>>
>>>>>>>>>> For the second use case we don't want controls *at all*, we want the
>>>>>>>>>> same heiristics used for all other filesystems to apply to tmpfs.
>>>>>>>>>
>>>>>>>>> As discussed in the MM meeting, Hugh had a different opinion on that.
>>>>>>>>
>>>>>>>> FWIW, I just recalled that I wrote a quick summary:
>>>>>>>>
>>>>>>>> https://lkml.kernel.org/r/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com
>>>>>>>>
>>>>>>>> I believe the meetings are recorded as well, but never looked at recordings.
>>>>>>>
>>>>>>> That's not what I understood Hugh to mean.  To me, it seemed that Hugh
>>>>>>> was expressing an opinion on using shmem as shmem, not as using it as
>>>>>>> tmpfs.
>>>>>>>
>>>>>>> If I misunderstood Hugh, well, I still disagree.  We should not have
>>>>>>> separate controls for this.  tmpfs is just not that special.
>>>>>
>>>>> I wasn't at the meeting that's being referred to, but I thought we previously
>>>>> agreed that tmpfs *is* special because in some configurations its not backed by
>>>>> swap so is locked in ram?
>>>>
>>>> There are multiple things to that, like:
>>>>
>>>> * Machines only having limited/no swap configured
>>>> * tmpfs can be configured to never go to swap
>>>> * memfd/tmpfs files getting used purely for mmap(): there is no real
>>>>    difference to MAP_ANON|MAP_SHARE besides the processes we share that
>>>>    memory with.
>>>>
>>>> Especially when it comes to memory waste concerns and access behavior in
>>>> some cases, tmpfs behaved much more like anonymous memory. But there are for
>>>> sure other use cases where tmpfs is not that special.
>>>
>>> Having controls to select the allowable folio order allocations for
>>> tmpfs does not address any of these issues. The suggested filesystem
>>> approach [1] involves allocating orders in larger chunks, but always
>>> the same size you would allocate when using order-0 folios.
>>
>> Well you can't know that you will never allocate more. If you allocate a 2M
> 
> In the fs large folio approach implementation [1], the allocation of a 2M (or
> any non order-0) occurs when the size of the write/fallocate is 2M (and index
> is aligned).

I don't have time right now follow the discussion in detail here (I 
thought we had a meeting to discuss that and received guidance from 
Hugh?), but I'll point out two things:

(1) We need a reasonable model for handling/allocating of large folios
     during page faults. shmem/tmpfs can be used just like anon-shmem if
     you simply only mmap that thing (hello VMs!).

(2) Hugh gave (IMHO) clear feedback during the meeting how he thinks we
     should approach large folios in shmem.

Maybe I got (2) all wrong and people can point out all the issues in my 
summary from the meeting.

Otherwise, if people don't want to accept the result from that meeting, 
we need further guidance from Hugh.