diff mbox series

[RFC,v2,05/17] mm: Routines to determine max anon folio allocation order

Message ID 20230414130303.2345383-6-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series variable-order, large folios for anonymous memory | expand

Commit Message

Ryan Roberts April 14, 2023, 1:02 p.m. UTC
For variable-order anonymous folios, we want to tune the order that we
prefer to allocate based on the vma. Add the routines to manage that
heuristic.

TODO: Currently we always use the global maximum. Add per-vma logic!

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 5 +++++
 mm/memory.c        | 8 ++++++++
 2 files changed, 13 insertions(+)

--
2.25.1

Comments

Kirill A . Shutemov April 14, 2023, 2:09 p.m. UTC | #1
On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
> For variable-order anonymous folios, we want to tune the order that we
> prefer to allocate based on the vma. Add the routines to manage that
> heuristic.
> 
> TODO: Currently we always use the global maximum. Add per-vma logic!
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h | 5 +++++
>  mm/memory.c        | 8 ++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdb8c6031d0f..cc8d0b239116 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  }
>  #endif
> 
> +/*
> + * TODO: Should this be set per-architecture?
> + */
> +#define ANON_FOLIO_ORDER_MAX	4
> +

I think it has to be derived from size in bytes, not directly specifies
page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
Ryan Roberts April 14, 2023, 2:38 p.m. UTC | #2
On 14/04/2023 15:09, Kirill A. Shutemov wrote:
> On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
>> For variable-order anonymous folios, we want to tune the order that we
>> prefer to allocate based on the vma. Add the routines to manage that
>> heuristic.
>>
>> TODO: Currently we always use the global maximum. Add per-vma logic!
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h | 5 +++++
>>  mm/memory.c        | 8 ++++++++
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index cdb8c6031d0f..cc8d0b239116 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>>  }
>>  #endif
>>
>> +/*
>> + * TODO: Should this be set per-architecture?
>> + */
>> +#define ANON_FOLIO_ORDER_MAX	4
>> +
> 
> I think it has to be derived from size in bytes, not directly specifies
> page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
> 

Yes I see where you are coming from. What's your feel for what a sensible upper
bound in bytes is?

My difficulty is that I would like to be able to use this allocation mechanism
to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
that are mapped to physically contiguous memory, and the HW can use that hint to
coalesce the TLB entries.

For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
allocating 2MB pages here is going to lead to too much memory wastage?
Kirill A . Shutemov April 14, 2023, 3:37 p.m. UTC | #3
On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
> > On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
> >> For variable-order anonymous folios, we want to tune the order that we
> >> prefer to allocate based on the vma. Add the routines to manage that
> >> heuristic.
> >>
> >> TODO: Currently we always use the global maximum. Add per-vma logic!
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/mm.h | 5 +++++
> >>  mm/memory.c        | 8 ++++++++
> >>  2 files changed, 13 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index cdb8c6031d0f..cc8d0b239116 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >>  }
> >>  #endif
> >>
> >> +/*
> >> + * TODO: Should this be set per-architecture?
> >> + */
> >> +#define ANON_FOLIO_ORDER_MAX	4
> >> +
> > 
> > I think it has to be derived from size in bytes, not directly specifies
> > page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
> > 
> 
> Yes I see where you are coming from. What's your feel for what a sensible upper
> bound in bytes is?
> 
> My difficulty is that I would like to be able to use this allocation mechanism
> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
> that are mapped to physically contiguous memory, and the HW can use that hint to
> coalesce the TLB entries.
> 
> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
> allocating 2MB pages here is going to lead to too much memory wastage?

I think it boils down to the specifics of the microarchitecture.

We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
only reduces TLB pressure (that contiguous bit does too, I believe), but
also saves one more memory access on page table walk.

It may or may not matter for the processor. It has to be evaluated.

Maybe moving it to per-arch is the right way. With default in generic code
to be ilog2(SZ_64K >> PAGE_SIZE) or something.
Ryan Roberts April 14, 2023, 4:06 p.m. UTC | #4
On 14/04/2023 16:37, Kirill A. Shutemov wrote:
> On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
>> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
>>> On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
>>>> For variable-order anonymous folios, we want to tune the order that we
>>>> prefer to allocate based on the vma. Add the routines to manage that
>>>> heuristic.
>>>>
>>>> TODO: Currently we always use the global maximum. Add per-vma logic!
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/mm.h | 5 +++++
>>>>  mm/memory.c        | 8 ++++++++
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index cdb8c6031d0f..cc8d0b239116 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>>>>  }
>>>>  #endif
>>>>
>>>> +/*
>>>> + * TODO: Should this be set per-architecture?
>>>> + */
>>>> +#define ANON_FOLIO_ORDER_MAX	4
>>>> +
>>>
>>> I think it has to be derived from size in bytes, not directly specifies
>>> page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
>>>
>>
>> Yes I see where you are coming from. What's your feel for what a sensible upper
>> bound in bytes is?
>>
>> My difficulty is that I would like to be able to use this allocation mechanism
>> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
>> that are mapped to physically contiguous memory, and the HW can use that hint to
>> coalesce the TLB entries.
>>
>> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
>> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
>> allocating 2MB pages here is going to lead to too much memory wastage?
> 
> I think it boils down to the specifics of the microarchitecture.
> 
> We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
> only reduces TLB pressure (that contiguous bit does too, I believe), but
> also saves one more memory access on page table walk.
> 
> It may or may not matter for the processor. It has to be evaluated.

I think you are saying that if the performance uplift is good, then some extra
memory wastage can be justified?

The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
use the contig bit. Roughly I guess that means going from average of 2K wastage
per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.

But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
takes average wastage from 32K to 1M. That feels a bit harder to justify.
Perhaps here, we should make a decision based on MADV_HUGEPAGE?

So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
we won't PMD-map THPs - its 512MB).

> 
> Maybe moving it to per-arch is the right way. With default in generic code
> to be ilog2(SZ_64K >> PAGE_SIZE) or something.

Yes, I agree that sounds like a good starting point for the !MADV_HUGEPAGE case.
Matthew Wilcox April 14, 2023, 4:18 p.m. UTC | #5
On Fri, Apr 14, 2023 at 05:06:49PM +0100, Ryan Roberts wrote:
> The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
> use the contig bit. Roughly I guess that means going from average of 2K wastage
> per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.
> 
> But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
> takes average wastage from 32K to 1M. That feels a bit harder to justify.
> Perhaps here, we should make a decision based on MADV_HUGEPAGE?
> 
> So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
> VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
> we won't PMD-map THPs - its 512MB).

I'm kind of hoping that all this work takes away the benefit from
CONFIG_PAGE_SIZE_64K, and we can just use 4k pages everywhere.
Ryan Roberts April 14, 2023, 4:31 p.m. UTC | #6
On 14/04/2023 17:18, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 05:06:49PM +0100, Ryan Roberts wrote:
>> The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
>> use the contig bit. Roughly I guess that means going from average of 2K wastage
>> per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.
>>
>> But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
>> takes average wastage from 32K to 1M. That feels a bit harder to justify.
>> Perhaps here, we should make a decision based on MADV_HUGEPAGE?
>>
>> So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
>> VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
>> we won't PMD-map THPs - its 512MB).
> 
> I'm kind of hoping that all this work takes away the benefit from
> CONFIG_PAGE_SIZE_64K, and we can just use 4k pages everywhere.

That sounds great. I'm not sure I share your confidence though ;-)
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cdb8c6031d0f..cc8d0b239116 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3674,4 +3674,9 @@  madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif

+/*
+ * TODO: Should this be set per-architecture?
+ */
+#define ANON_FOLIO_ORDER_MAX	4
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index ca32f59acef2..d7e34a8c46aa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3022,6 +3022,14 @@  static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
 	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
 }

+static inline int max_anon_folio_order(struct vm_area_struct *vma)
+{
+	/*
+	 * TODO: Policy for maximum folio order should likely be per-vma.
+	 */
+	return ANON_FOLIO_ORDER_MAX;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *