diff mbox series

[RFC,v1] mm/filemap: Allow arch to request folio size for exec memory

Message ID 20240111154106.3692206-1-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series [RFC,v1] mm/filemap: Allow arch to request folio size for exec memory | expand

Commit Message

Ryan Roberts Jan. 11, 2024, 3:41 p.m. UTC
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read of an arch-specified size in a
naturally aligned manner.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" changes, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low liklihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).

The reason for the low liklihood is that the current readahead algorithm
starts with an order-2 folio and increases the folio order by 2 every
time the readahead mark is hit. But most executable memory is faulted in
fairly randomly and so the readahead mark is rarely hit and most
executable folios remain order-2. This is observed impirically and
confirmed from discussion with a gnu linker expert; in general, the
linker does nothing to group temporally accessed text together
spacially. Additionally, with the current read-around approach there are
no alignment guarrantees between the file and folio. This is
insufficient for arm64's contpte mapping requirement (order-4 for 4K
base pages).

So it seems reasonable to special-case the read(ahead) logic for
executable mappings. The trade-off is performance improvement (due to
more efficient storage of the translations in iTLB) vs potential read
amplification (due to reading too much data around the fault which won't
be used), and the latter is independent of base page size. I've chosen
64K folio size for arm64 which benefits both the 4K and 16K base page
size configs and shouldn't lead to any further read-amplification since
the old read-around path was (usually) reading blocks of 128K (with the
last 32K being async).

Performance Benchmarking
------------------------

The below shows kernel compilation and speedometer javascript benchmarks
on Ampere Altra arm64 system. (The contpte patch series is applied in
the baseline).

First, confirmation that this patch causes more memory to be contained
in 64K folios (this is for all file-backed memory so includes
non-executable too):

| File-backed folios      |   Speedometer   |  Kernel Compile |
| by size as percentage   |-----------------|-----------------|
| of all mapped file mem  | before |  after | before |  after |
|=========================|========|========|========|========|
|file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
|file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
|file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
|file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
|file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
|file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
|file-thp-partial         |     0% |     0% |     0% |     0% |
|-------------------------|--------|--------|--------|--------|
|file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |

The above shows that for both use cases, the amount of file memory
backed by 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of memory that is contpte-mapped
significantly increases (last line).

And this is reflected in performance improvement:

Kernel Compilation (smaller is faster):
| kernel   |   real-time |   kern-time |   user-time |   peak memory |
|----------|-------------|-------------|-------------|---------------|
| before   |        0.0% |        0.0% |        0.0% |          0.0% |
| after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |

Speedometer (bigger is faster):
| kernel   |   runs_per_min |   peak memory |
|----------|----------------|---------------|
| before   |           0.0% |          0.0% |
| after    |           1.3% |          1.0% |

Both benchmarks show a ~1.5% improvement once the patch is applied.

Alternatives
------------

I considered (and rejected for now - but I anticipate this patch will
stimulate discussion around what the best approach is) alternative
approaches:

  - Expose a global user-controlled knob to set the preferred folio
    size; this would move policy to user space and allow (e.g.) setting
    it to PMD-size for even better iTLB utilizaiton. But this would add
    ABI, and I prefer to start with the simplest approach first. It also
    has the downside that a change wouldn't apply to memory already in
    the page cache that is in active use (e.g. libc) so we don't get the
    same level of utilization as for something that is fixed from boot.

  - Add a per-vma attribute to allow user space to specify preferred
    folio size for memory faulted from the range. (we've talked about
    such a control in the context of mTHP). The dynamic loader would
    then be responsible for adding the annotations. Again this feels
    like something that could be added later if value was demonstrated.

  - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
    This would still require dynamic linker involvement, but would
    additionally neccessitate a copy and all memory in the range would
    be synchronously faulted in, adding to application load time. It
    would work for filesystems that don't support large folios though.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

Hi all,

I originally concocted something similar to this, with Matthew's help, as a
quick proof of concept hack. Since then I've tried a few different approaches
but always came back to this as the simplest solution. I expect this will raise
a few eyebrows but given it is providing a real performance win, I hope we can
converge to something that can be upstreamed.

This depends on my contpte series to actually set the contiguous bit in the page
table.

Thanks,
Ryan


 arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
 include/linux/pgtable.h          | 12 ++++++++++++
 mm/filemap.c                     | 19 +++++++++++++++++++
 3 files changed, 43 insertions(+)

--
2.25.1

Comments

Barry Song Jan. 12, 2024, 5:23 a.m. UTC | #1
On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2. This is observed impirically and
> confirmed from discussion with a gnu linker expert; in general, the
> linker does nothing to group temporally accessed text together
> spacially. Additionally, with the current read-around approach there are
> no alignment guarrantees between the file and folio. This is
> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> base pages).
>
> So it seems reasonable to special-case the read(ahead) logic for
> executable mappings. The trade-off is performance improvement (due to
> more efficient storage of the translations in iTLB) vs potential read
> amplification (due to reading too much data around the fault which won't
> be used), and the latter is independent of base page size. I've chosen
> 64K folio size for arm64 which benefits both the 4K and 16K base page
> size configs and shouldn't lead to any further read-amplification since
> the old read-around path was (usually) reading blocks of 128K (with the
> last 32K being async).
>
> Performance Benchmarking
> ------------------------
>
> The below shows kernel compilation and speedometer javascript benchmarks
> on Ampere Altra arm64 system. (The contpte patch series is applied in
> the baseline).
>
> First, confirmation that this patch causes more memory to be contained
> in 64K folios (this is for all file-backed memory so includes
> non-executable too):
>
> | File-backed folios      |   Speedometer   |  Kernel Compile |
> | by size as percentage   |-----------------|-----------------|
> | of all mapped file mem  | before |  after | before |  after |
> |=========================|========|========|========|========|
> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> |file-thp-partial         |     0% |     0% |     0% |     0% |
> |-------------------------|--------|--------|--------|--------|
> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
>
> The above shows that for both use cases, the amount of file memory
> backed by 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of memory that is contpte-mapped
> significantly increases (last line).
>
> And this is reflected in performance improvement:
>
> Kernel Compilation (smaller is faster):
> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> |----------|-------------|-------------|-------------|---------------|
> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
>
> Speedometer (bigger is faster):
> | kernel   |   runs_per_min |   peak memory |
> |----------|----------------|---------------|
> | before   |           0.0% |          0.0% |
> | after    |           1.3% |          1.0% |
>
> Both benchmarks show a ~1.5% improvement once the patch is applied.

Hi Ryan,
you had the data regarding exec-cont-pte in cont-pte series[1], which has
already shown 1-2% improvement.

Kernel Compilation with -j8 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -4.6% |    -38.0% |     -0.4% |
| mTHP + contpte            |     -5.4% |    -37.7% |     -1.3% |
| mTHP + contpte + exefolio |     -7.4% |    -39.5% |     -3.5% |

Kernel Compilation with -j80 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -4.9% |    -36.1% |     -0.2% |
| mTHP + contpte            |     -5.8% |    -36.0% |     -1.2% |
| mTHP + contpte + exefolio |     -6.8% |    -37.0% |     -3.1% |

Speedometer (positive is faster):

| kernel                    | runs_per_min |
|:--------------------------|--------------|
| baseline                  |         0.0% |
| mTHP                      |         1.5% |
| mTHP + contpte            |         3.7% |
| mTHP + contpte + exefolio |         4.9% |


Is this 1.5% you are saying now an extra improvement after you have
mTHP + contpte + exefolio in [1]?

[1] https://lore.kernel.org/linux-mm/20231218105100.172635-1-ryan.roberts@arm.com/

>
> Alternatives
> ------------
>
> I considered (and rejected for now - but I anticipate this patch will
> stimulate discussion around what the best approach is) alternative
> approaches:
>
>   - Expose a global user-controlled knob to set the preferred folio
>     size; this would move policy to user space and allow (e.g.) setting
>     it to PMD-size for even better iTLB utilizaiton. But this would add
>     ABI, and I prefer to start with the simplest approach first. It also
>     has the downside that a change wouldn't apply to memory already in
>     the page cache that is in active use (e.g. libc) so we don't get the
>     same level of utilization as for something that is fixed from boot.
>
>   - Add a per-vma attribute to allow user space to specify preferred
>     folio size for memory faulted from the range. (we've talked about
>     such a control in the context of mTHP). The dynamic loader would
>     then be responsible for adding the annotations. Again this feels
>     like something that could be added later if value was demonstrated.
>
>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>     This would still require dynamic linker involvement, but would
>     additionally neccessitate a copy and all memory in the range would
>     be synchronously faulted in, adding to application load time. It
>     would work for filesystems that don't support large folios though.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> Hi all,
>
> I originally concocted something similar to this, with Matthew's help, as a
> quick proof of concept hack. Since then I've tried a few different approaches
> but always came back to this as the simplest solution. I expect this will raise
> a few eyebrows but given it is providing a real performance win, I hope we can
> converge to something that can be upstreamed.
>
> This depends on my contpte series to actually set the contiguous bit in the page
> table.
>
> Thanks,
> Ryan
>
>
>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>  include/linux/pgtable.h          | 12 ++++++++++++
>  mm/filemap.c                     | 19 +++++++++++++++++++
>  3 files changed, 43 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f5bf059291c3..8f8f3f7eb8d8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   */
>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. The
> + * trade-off here is performance improvement due to storing translations more
> + * effciently in the iTLB vs the potential for read amplification due to reading
> + * data from disk that won't be used. The latter is independent of base page
> + * size, so we set a page-size independent block size of 64K. This size can be
> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> + * use.
> + */
> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
> +
>  static inline bool pud_sect_supported(void)
>  {
>         return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 170925379534..57090616d09c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> + * preference and mm will not special-case executable memory in the pagecache.
> + */
> +static inline int arch_wants_exec_folio_order(void)
> +{
> +       return -1;
> +}
> +#endif
> +
>  #ifndef arch_check_zapped_pte
>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>                                          pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 67ba56ecdd32..80a76d755534 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>         }
>  #endif
>
> +       /*
> +        * Allow arch to request a preferred minimum folio order for executable
> +        * memory. This can often be beneficial to performance if (e.g.) arm64
> +        * can contpte-map the folio. Executable memory rarely benefits from
> +        * read-ahead anyway, due to its random access nature.
> +        */
> +       if (vm_flags & VM_EXEC) {
> +               int order = arch_wants_exec_folio_order();
> +
> +               if (order >= 0) {
> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +                       ra->size = 1UL << order;
> +                       ra->async_size = 0;
> +                       ractl._index &= ~((unsigned long)ra->size - 1);
> +                       page_cache_ra_order(&ractl, ra, order);
> +                       return fpin;
> +               }
> +       }
> +
>         /* If we don't want any read-ahead, don't bother */
>         if (vm_flags & VM_RAND_READ)
>                 return fpin;
> --
> 2.25.1
>

Thanks
barry
Ryan Roberts Jan. 12, 2024, 7:55 a.m. UTC | #2
On 11/01/2024 15:41, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2. This is observed impirically and
> confirmed from discussion with a gnu linker expert; in general, the
> linker does nothing to group temporally accessed text together
> spacially. Additionally, with the current read-around approach there are
> no alignment guarrantees between the file and folio. This is
> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> base pages).
> 
> So it seems reasonable to special-case the read(ahead) logic for
> executable mappings. The trade-off is performance improvement (due to
> more efficient storage of the translations in iTLB) vs potential read
> amplification (due to reading too much data around the fault which won't
> be used), and the latter is independent of base page size. I've chosen
> 64K folio size for arm64 which benefits both the 4K and 16K base page
> size configs and shouldn't lead to any further read-amplification since
> the old read-around path was (usually) reading blocks of 128K (with the
> last 32K being async).
> 
> Performance Benchmarking
> ------------------------
> 
> The below shows kernel compilation and speedometer javascript benchmarks
> on Ampere Altra arm64 system. (The contpte patch series is applied in
> the baseline).
> 
> First, confirmation that this patch causes more memory to be contained
> in 64K folios (this is for all file-backed memory so includes
> non-executable too):
> 
> | File-backed folios      |   Speedometer   |  Kernel Compile |
> | by size as percentage   |-----------------|-----------------|
> | of all mapped file mem  | before |  after | before |  after |
> |=========================|========|========|========|========|
> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> |file-thp-partial         |     0% |     0% |     0% |     0% |
> |-------------------------|--------|--------|--------|--------|
> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
> 
> The above shows that for both use cases, the amount of file memory
> backed by 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of memory that is contpte-mapped
> significantly increases (last line).
> 
> And this is reflected in performance improvement:
> 
> Kernel Compilation (smaller is faster):
> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> |----------|-------------|-------------|-------------|---------------|
> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
> 
> Speedometer (bigger is faster):
> | kernel   |   runs_per_min |   peak memory |
> |----------|----------------|---------------|
> | before   |           0.0% |          0.0% |
> | after    |           1.3% |          1.0% |
> 
> Both benchmarks show a ~1.5% improvement once the patch is applied.
> 
> Alternatives
> ------------
> 
> I considered (and rejected for now - but I anticipate this patch will
> stimulate discussion around what the best approach is) alternative
> approaches:
> 
>   - Expose a global user-controlled knob to set the preferred folio
>     size; this would move policy to user space and allow (e.g.) setting
>     it to PMD-size for even better iTLB utilizaiton. But this would add
>     ABI, and I prefer to start with the simplest approach first. It also
>     has the downside that a change wouldn't apply to memory already in
>     the page cache that is in active use (e.g. libc) so we don't get the
>     same level of utilization as for something that is fixed from boot.
> 
>   - Add a per-vma attribute to allow user space to specify preferred
>     folio size for memory faulted from the range. (we've talked about
>     such a control in the context of mTHP). The dynamic loader would
>     then be responsible for adding the annotations. Again this feels
>     like something that could be added later if value was demonstrated.
> 
>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>     This would still require dynamic linker involvement, but would
>     additionally neccessitate a copy and all memory in the range would
>     be synchronously faulted in, adding to application load time. It
>     would work for filesystems that don't support large folios though.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Hi all,
> 
> I originally concocted something similar to this, with Matthew's help, as a
> quick proof of concept hack. Since then I've tried a few different approaches
> but always came back to this as the simplest solution. I expect this will raise
> a few eyebrows but given it is providing a real performance win, I hope we can
> converge to something that can be upstreamed.
> 
> This depends on my contpte series to actually set the contiguous bit in the page
> table.

For clarity, the contpte series I referred to is at [1], and the "exefolio poc
patch" referred to in the performance section of its cover note is a version of
this patch. The dependency is not a build dependency, it's just required to get
the speedup at runtime. Without it, this change is benign.

[1] https://lore.kernel.org/linux-mm/20231218105100.172635-1-ryan.roberts@arm.com/

> 
> Thanks,
> Ryan
> 
> 
>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>  include/linux/pgtable.h          | 12 ++++++++++++
>  mm/filemap.c                     | 19 +++++++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f5bf059291c3..8f8f3f7eb8d8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   */
>  #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
> 
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. The
> + * trade-off here is performance improvement due to storing translations more
> + * effciently in the iTLB vs the potential for read amplification due to reading
> + * data from disk that won't be used. The latter is independent of base page
> + * size, so we set a page-size independent block size of 64K. This size can be
> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> + * use.
> + */
> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
> +
>  static inline bool pud_sect_supported(void)
>  {
>  	return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 170925379534..57090616d09c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
> 
> +#ifndef arch_wants_exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> + * preference and mm will not special-case executable memory in the pagecache.
> + */
> +static inline int arch_wants_exec_folio_order(void)
> +{
> +	return -1;
> +}
> +#endif
> +
>  #ifndef arch_check_zapped_pte
>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>  					 pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 67ba56ecdd32..80a76d755534 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  #endif
> 
> +	/*
> +	 * Allow arch to request a preferred minimum folio order for executable
> +	 * memory. This can often be beneficial to performance if (e.g.) arm64
> +	 * can contpte-map the folio. Executable memory rarely benefits from
> +	 * read-ahead anyway, due to its random access nature.
> +	 */
> +	if (vm_flags & VM_EXEC) {
> +		int order = arch_wants_exec_folio_order();
> +
> +		if (order >= 0) {
> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +			ra->size = 1UL << order;
> +			ra->async_size = 0;
> +			ractl._index &= ~((unsigned long)ra->size - 1);
> +			page_cache_ra_order(&ractl, ra, order);
> +			return fpin;
> +		}
> +	}
> +
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vm_flags & VM_RAND_READ)
>  		return fpin;
> --
> 2.25.1
>
Ryan Roberts Jan. 12, 2024, 7:59 a.m. UTC | #3
On 12/01/2024 05:23, Barry Song wrote:
> On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read of an arch-specified size in a
>> naturally aligned manner.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" changes, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low liklihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low liklihood is that the current readahead algorithm
>> starts with an order-2 folio and increases the folio order by 2 every
>> time the readahead mark is hit. But most executable memory is faulted in
>> fairly randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-2. This is observed impirically and
>> confirmed from discussion with a gnu linker expert; in general, the
>> linker does nothing to group temporally accessed text together
>> spacially. Additionally, with the current read-around approach there are
>> no alignment guarrantees between the file and folio. This is
>> insufficient for arm64's contpte mapping requirement (order-4 for 4K
>> base pages).
>>
>> So it seems reasonable to special-case the read(ahead) logic for
>> executable mappings. The trade-off is performance improvement (due to
>> more efficient storage of the translations in iTLB) vs potential read
>> amplification (due to reading too much data around the fault which won't
>> be used), and the latter is independent of base page size. I've chosen
>> 64K folio size for arm64 which benefits both the 4K and 16K base page
>> size configs and shouldn't lead to any further read-amplification since
>> the old read-around path was (usually) reading blocks of 128K (with the
>> last 32K being async).
>>
>> Performance Benchmarking
>> ------------------------
>>
>> The below shows kernel compilation and speedometer javascript benchmarks
>> on Ampere Altra arm64 system. (The contpte patch series is applied in
>> the baseline).
>>
>> First, confirmation that this patch causes more memory to be contained
>> in 64K folios (this is for all file-backed memory so includes
>> non-executable too):
>>
>> | File-backed folios      |   Speedometer   |  Kernel Compile |
>> | by size as percentage   |-----------------|-----------------|
>> | of all mapped file mem  | before |  after | before |  after |
>> |=========================|========|========|========|========|
>> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
>> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
>> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
>> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
>> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
>> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
>> |file-thp-partial         |     0% |     0% |     0% |     0% |
>> |-------------------------|--------|--------|--------|--------|
>> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
>>
>> The above shows that for both use cases, the amount of file memory
>> backed by 16K folios reduces and the amount backed by 64K folios
>> increases significantly. And the amount of memory that is contpte-mapped
>> significantly increases (last line).
>>
>> And this is reflected in performance improvement:
>>
>> Kernel Compilation (smaller is faster):
>> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
>> |----------|-------------|-------------|-------------|---------------|
>> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
>> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
>>
>> Speedometer (bigger is faster):
>> | kernel   |   runs_per_min |   peak memory |
>> |----------|----------------|---------------|
>> | before   |           0.0% |          0.0% |
>> | after    |           1.3% |          1.0% |
>>
>> Both benchmarks show a ~1.5% improvement once the patch is applied.
> 
> Hi Ryan,
> you had the data regarding exec-cont-pte in cont-pte series[1], which has
> already shown 1-2% improvement.
> 
> Kernel Compilation with -j8 (negative is faster):
> 
> | kernel                    | real-time | kern-time | user-time |
> |---------------------------|-----------|-----------|-----------|
> | baseline                  |      0.0% |      0.0% |      0.0% |
> | mTHP                      |     -4.6% |    -38.0% |     -0.4% |
> | mTHP + contpte            |     -5.4% |    -37.7% |     -1.3% |
> | mTHP + contpte + exefolio |     -7.4% |    -39.5% |     -3.5% |
> 
> Kernel Compilation with -j80 (negative is faster):
> 
> | kernel                    | real-time | kern-time | user-time |
> |---------------------------|-----------|-----------|-----------|
> | baseline                  |      0.0% |      0.0% |      0.0% |
> | mTHP                      |     -4.9% |    -36.1% |     -0.2% |
> | mTHP + contpte            |     -5.8% |    -36.0% |     -1.2% |
> | mTHP + contpte + exefolio |     -6.8% |    -37.0% |     -3.1% |
> 
> Speedometer (positive is faster):
> 
> | kernel                    | runs_per_min |
> |:--------------------------|--------------|
> | baseline                  |         0.0% |
> | mTHP                      |         1.5% |
> | mTHP + contpte            |         3.7% |
> | mTHP + contpte + exefolio |         4.9% |
> 
> 
> Is this 1.5% you are saying now an extra improvement after you have
> mTHP + contpte + exefolio in [1]?

The latter; it's the same ~1.5% I mentioned in [1]. This is the first time I've
posted the "exefolio" change publicly.

> 
> [1] https://lore.kernel.org/linux-mm/20231218105100.172635-1-ryan.roberts@arm.com/
> 
>>
>> Alternatives
>> ------------
>>
>> I considered (and rejected for now - but I anticipate this patch will
>> stimulate discussion around what the best approach is) alternative
>> approaches:
>>
>>   - Expose a global user-controlled knob to set the preferred folio
>>     size; this would move policy to user space and allow (e.g.) setting
>>     it to PMD-size for even better iTLB utilizaiton. But this would add
>>     ABI, and I prefer to start with the simplest approach first. It also
>>     has the downside that a change wouldn't apply to memory already in
>>     the page cache that is in active use (e.g. libc) so we don't get the
>>     same level of utilization as for something that is fixed from boot.
>>
>>   - Add a per-vma attribute to allow user space to specify preferred
>>     folio size for memory faulted from the range. (we've talked about
>>     such a control in the context of mTHP). The dynamic loader would
>>     then be responsible for adding the annotations. Again this feels
>>     like something that could be added later if value was demonstrated.
>>
>>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>>     This would still require dynamic linker involvement, but would
>>     additionally neccessitate a copy and all memory in the range would
>>     be synchronously faulted in, adding to application load time. It
>>     would work for filesystems that don't support large folios though.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Hi all,
>>
>> I originally concocted something similar to this, with Matthew's help, as a
>> quick proof of concept hack. Since then I've tried a few different approaches
>> but always came back to this as the simplest solution. I expect this will raise
>> a few eyebrows but given it is providing a real performance win, I hope we can
>> converge to something that can be upstreamed.
>>
>> This depends on my contpte series to actually set the contiguous bit in the page
>> table.
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>>  include/linux/pgtable.h          | 12 ++++++++++++
>>  mm/filemap.c                     | 19 +++++++++++++++++++
>>  3 files changed, 43 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index f5bf059291c3..8f8f3f7eb8d8 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>>   */
>>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>>
>> +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. The
>> + * trade-off here is performance improvement due to storing translations more
>> + * effciently in the iTLB vs the potential for read amplification due to reading
>> + * data from disk that won't be used. The latter is independent of base page
>> + * size, so we set a page-size independent block size of 64K. This size can be
>> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
>> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
>> + * use.
>> + */
>> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
>> +
>>  static inline bool pud_sect_supported(void)
>>  {
>>         return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 170925379534..57090616d09c 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
>> + * preference and mm will not special-case executable memory in the pagecache.
>> + */
>> +static inline int arch_wants_exec_folio_order(void)
>> +{
>> +       return -1;
>> +}
>> +#endif
>> +
>>  #ifndef arch_check_zapped_pte
>>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>>                                          pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 67ba56ecdd32..80a76d755534 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>         }
>>  #endif
>>
>> +       /*
>> +        * Allow arch to request a preferred minimum folio order for executable
>> +        * memory. This can often be beneficial to performance if (e.g.) arm64
>> +        * can contpte-map the folio. Executable memory rarely benefits from
>> +        * read-ahead anyway, due to its random access nature.
>> +        */
>> +       if (vm_flags & VM_EXEC) {
>> +               int order = arch_wants_exec_folio_order();
>> +
>> +               if (order >= 0) {
>> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> +                       ra->size = 1UL << order;
>> +                       ra->async_size = 0;
>> +                       ractl._index &= ~((unsigned long)ra->size - 1);
>> +                       page_cache_ra_order(&ractl, ra, order);
>> +                       return fpin;
>> +               }
>> +       }
>> +
>>         /* If we don't want any read-ahead, don't bother */
>>         if (vm_flags & VM_RAND_READ)
>>                 return fpin;
>> --
>> 2.25.1
>>
> 
> Thanks
> barry
Barry Song Jan. 12, 2024, 10:13 a.m. UTC | #4
On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2. This is observed impirically and
> confirmed from discussion with a gnu linker expert; in general, the
> linker does nothing to group temporally accessed text together
> spacially. Additionally, with the current read-around approach there are
> no alignment guarrantees between the file and folio. This is
> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> base pages).
>
> So it seems reasonable to special-case the read(ahead) logic for
> executable mappings. The trade-off is performance improvement (due to
> more efficient storage of the translations in iTLB) vs potential read
> amplification (due to reading too much data around the fault which won't
> be used), and the latter is independent of base page size. I've chosen
> 64K folio size for arm64 which benefits both the 4K and 16K base page
> size configs and shouldn't lead to any further read-amplification since
> the old read-around path was (usually) reading blocks of 128K (with the
> last 32K being async).
>
> Performance Benchmarking
> ------------------------
>
> The below shows kernel compilation and speedometer javascript benchmarks
> on Ampere Altra arm64 system. (The contpte patch series is applied in
> the baseline).
>
> First, confirmation that this patch causes more memory to be contained
> in 64K folios (this is for all file-backed memory so includes
> non-executable too):
>
> | File-backed folios      |   Speedometer   |  Kernel Compile |
> | by size as percentage   |-----------------|-----------------|
> | of all mapped file mem  | before |  after | before |  after |
> |=========================|========|========|========|========|
> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> |file-thp-partial         |     0% |     0% |     0% |     0% |
> |-------------------------|--------|--------|--------|--------|
> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
>
> The above shows that for both use cases, the amount of file memory
> backed by 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of memory that is contpte-mapped
> significantly increases (last line).
>
> And this is reflected in performance improvement:
>
> Kernel Compilation (smaller is faster):
> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> |----------|-------------|-------------|-------------|---------------|
> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
>
> Speedometer (bigger is faster):
> | kernel   |   runs_per_min |   peak memory |
> |----------|----------------|---------------|
> | before   |           0.0% |          0.0% |
> | after    |           1.3% |          1.0% |
>
> Both benchmarks show a ~1.5% improvement once the patch is applied.
>
> Alternatives
> ------------
>
> I considered (and rejected for now - but I anticipate this patch will
> stimulate discussion around what the best approach is) alternative
> approaches:
>
>   - Expose a global user-controlled knob to set the preferred folio
>     size; this would move policy to user space and allow (e.g.) setting
>     it to PMD-size for even better iTLB utilizaiton. But this would add
>     ABI, and I prefer to start with the simplest approach first. It also
>     has the downside that a change wouldn't apply to memory already in
>     the page cache that is in active use (e.g. libc) so we don't get the
>     same level of utilization as for something that is fixed from boot.
>
>   - Add a per-vma attribute to allow user space to specify preferred
>     folio size for memory faulted from the range. (we've talked about
>     such a control in the context of mTHP). The dynamic loader would
>     then be responsible for adding the annotations. Again this feels
>     like something that could be added later if value was demonstrated.
>
>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>     This would still require dynamic linker involvement, but would
>     additionally neccessitate a copy and all memory in the range would
>     be synchronously faulted in, adding to application load time. It
>     would work for filesystems that don't support large folios though.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> Hi all,
>
> I originally concocted something similar to this, with Matthew's help, as a
> quick proof of concept hack. Since then I've tried a few different approaches
> but always came back to this as the simplest solution. I expect this will raise
> a few eyebrows but given it is providing a real performance win, I hope we can
> converge to something that can be upstreamed.
>
> This depends on my contpte series to actually set the contiguous bit in the page
> table.
>
> Thanks,
> Ryan
>
>
>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>  include/linux/pgtable.h          | 12 ++++++++++++
>  mm/filemap.c                     | 19 +++++++++++++++++++
>  3 files changed, 43 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f5bf059291c3..8f8f3f7eb8d8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   */
>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. The
> + * trade-off here is performance improvement due to storing translations more
> + * effciently in the iTLB vs the potential for read amplification due to reading
> + * data from disk that won't be used. The latter is independent of base page
> + * size, so we set a page-size independent block size of 64K. This size can be
> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> + * use.
> + */
> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
> +
>  static inline bool pud_sect_supported(void)
>  {
>         return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 170925379534..57090616d09c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> + * preference and mm will not special-case executable memory in the pagecache.
> + */
> +static inline int arch_wants_exec_folio_order(void)
> +{
> +       return -1;
> +}
> +#endif
> +
>  #ifndef arch_check_zapped_pte
>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>                                          pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 67ba56ecdd32..80a76d755534 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>         }
>  #endif
>
> +       /*
> +        * Allow arch to request a preferred minimum folio order for executable
> +        * memory. This can often be beneficial to performance if (e.g.) arm64
> +        * can contpte-map the folio. Executable memory rarely benefits from
> +        * read-ahead anyway, due to its random access nature.
> +        */
> +       if (vm_flags & VM_EXEC) {
> +               int order = arch_wants_exec_folio_order();
> +
> +               if (order >= 0) {
> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +                       ra->size = 1UL << order;
> +                       ra->async_size = 0;
> +                       ractl._index &= ~((unsigned long)ra->size - 1);
> +                       page_cache_ra_order(&ractl, ra, order);
> +                       return fpin;
> +               }
> +       }

I don't know, but most filesystems don't support large mapping,even iomap.
This patch might negatively affect them. i feel we need to check
mapping_large_folio_support() at least.

> +
>         /* If we don't want any read-ahead, don't bother */
>         if (vm_flags & VM_RAND_READ)
>                 return fpin;
> --
> 2.25.1

Thanks
Barry
Ryan Roberts Jan. 12, 2024, 11:15 a.m. UTC | #5
On 12/01/2024 10:13, Barry Song wrote:
> On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read of an arch-specified size in a
>> naturally aligned manner.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" changes, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low liklihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low liklihood is that the current readahead algorithm
>> starts with an order-2 folio and increases the folio order by 2 every
>> time the readahead mark is hit. But most executable memory is faulted in
>> fairly randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-2. This is observed impirically and
>> confirmed from discussion with a gnu linker expert; in general, the
>> linker does nothing to group temporally accessed text together
>> spacially. Additionally, with the current read-around approach there are
>> no alignment guarrantees between the file and folio. This is
>> insufficient for arm64's contpte mapping requirement (order-4 for 4K
>> base pages).
>>
>> So it seems reasonable to special-case the read(ahead) logic for
>> executable mappings. The trade-off is performance improvement (due to
>> more efficient storage of the translations in iTLB) vs potential read
>> amplification (due to reading too much data around the fault which won't
>> be used), and the latter is independent of base page size. I've chosen
>> 64K folio size for arm64 which benefits both the 4K and 16K base page
>> size configs and shouldn't lead to any further read-amplification since
>> the old read-around path was (usually) reading blocks of 128K (with the
>> last 32K being async).
>>
>> Performance Benchmarking
>> ------------------------
>>
>> The below shows kernel compilation and speedometer javascript benchmarks
>> on Ampere Altra arm64 system. (The contpte patch series is applied in
>> the baseline).
>>
>> First, confirmation that this patch causes more memory to be contained
>> in 64K folios (this is for all file-backed memory so includes
>> non-executable too):
>>
>> | File-backed folios      |   Speedometer   |  Kernel Compile |
>> | by size as percentage   |-----------------|-----------------|
>> | of all mapped file mem  | before |  after | before |  after |
>> |=========================|========|========|========|========|
>> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
>> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
>> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
>> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
>> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
>> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
>> |file-thp-partial         |     0% |     0% |     0% |     0% |
>> |-------------------------|--------|--------|--------|--------|
>> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
>>
>> The above shows that for both use cases, the amount of file memory
>> backed by 16K folios reduces and the amount backed by 64K folios
>> increases significantly. And the amount of memory that is contpte-mapped
>> significantly increases (last line).
>>
>> And this is reflected in performance improvement:
>>
>> Kernel Compilation (smaller is faster):
>> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
>> |----------|-------------|-------------|-------------|---------------|
>> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
>> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
>>
>> Speedometer (bigger is faster):
>> | kernel   |   runs_per_min |   peak memory |
>> |----------|----------------|---------------|
>> | before   |           0.0% |          0.0% |
>> | after    |           1.3% |          1.0% |
>>
>> Both benchmarks show a ~1.5% improvement once the patch is applied.
>>
>> Alternatives
>> ------------
>>
>> I considered (and rejected for now - but I anticipate this patch will
>> stimulate discussion around what the best approach is) alternative
>> approaches:
>>
>>   - Expose a global user-controlled knob to set the preferred folio
>>     size; this would move policy to user space and allow (e.g.) setting
>>     it to PMD-size for even better iTLB utilizaiton. But this would add
>>     ABI, and I prefer to start with the simplest approach first. It also
>>     has the downside that a change wouldn't apply to memory already in
>>     the page cache that is in active use (e.g. libc) so we don't get the
>>     same level of utilization as for something that is fixed from boot.
>>
>>   - Add a per-vma attribute to allow user space to specify preferred
>>     folio size for memory faulted from the range. (we've talked about
>>     such a control in the context of mTHP). The dynamic loader would
>>     then be responsible for adding the annotations. Again this feels
>>     like something that could be added later if value was demonstrated.
>>
>>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>>     This would still require dynamic linker involvement, but would
>>     additionally neccessitate a copy and all memory in the range would
>>     be synchronously faulted in, adding to application load time. It
>>     would work for filesystems that don't support large folios though.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Hi all,
>>
>> I originally concocted something similar to this, with Matthew's help, as a
>> quick proof of concept hack. Since then I've tried a few different approaches
>> but always came back to this as the simplest solution. I expect this will raise
>> a few eyebrows but given it is providing a real performance win, I hope we can
>> converge to something that can be upstreamed.
>>
>> This depends on my contpte series to actually set the contiguous bit in the page
>> table.
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>>  include/linux/pgtable.h          | 12 ++++++++++++
>>  mm/filemap.c                     | 19 +++++++++++++++++++
>>  3 files changed, 43 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index f5bf059291c3..8f8f3f7eb8d8 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>>   */
>>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>>
>> +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. The
>> + * trade-off here is performance improvement due to storing translations more
>> + * effciently in the iTLB vs the potential for read amplification due to reading
>> + * data from disk that won't be used. The latter is independent of base page
>> + * size, so we set a page-size independent block size of 64K. This size can be
>> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
>> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
>> + * use.
>> + */
>> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
>> +
>>  static inline bool pud_sect_supported(void)
>>  {
>>         return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 170925379534..57090616d09c 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
>> + * preference and mm will not special-case executable memory in the pagecache.
>> + */
>> +static inline int arch_wants_exec_folio_order(void)
>> +{
>> +       return -1;
>> +}
>> +#endif
>> +
>>  #ifndef arch_check_zapped_pte
>>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>>                                          pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 67ba56ecdd32..80a76d755534 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>         }
>>  #endif
>>
>> +       /*
>> +        * Allow arch to request a preferred minimum folio order for executable
>> +        * memory. This can often be beneficial to performance if (e.g.) arm64
>> +        * can contpte-map the folio. Executable memory rarely benefits from
>> +        * read-ahead anyway, due to its random access nature.
>> +        */
>> +       if (vm_flags & VM_EXEC) {
>> +               int order = arch_wants_exec_folio_order();
>> +
>> +               if (order >= 0) {
>> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> +                       ra->size = 1UL << order;
>> +                       ra->async_size = 0;
>> +                       ractl._index &= ~((unsigned long)ra->size - 1);
>> +                       page_cache_ra_order(&ractl, ra, order);
>> +                       return fpin;
>> +               }
>> +       }
> 
> I don't know, but most filesystems don't support large mapping,even iomap.

True, but more are coming. For example ext4 is in the works:
https://lore.kernel.org/all/20240102123918.799062-1-yi.zhang@huaweicloud.com/

> This patch might negatively affect them. i feel we need to check
> mapping_large_folio_support() at least.

page_cache_ra_order() does this check and falls back to small folios if needed.
So correctness-wise it all works out. I guess your concern is performance due to
effectively removing the async readahead aspect? But if that is a problem, then
it's not just a problem if we are reading small folios, so I don't think the
proposed check is correct.

Perhaps an alternative would be to double ra->size and set ra->async_size to
(ra->size / 2)? That would ensure we always have 64K aligned blocks but would
give us an async portion so readahead can still happen.

I don't feel very expert with this area of the code so I might be talking
rubbish - would be great to hear from others.

> 
>> +
>>         /* If we don't want any read-ahead, don't bother */
>>         if (vm_flags & VM_RAND_READ)
>>                 return fpin;
>> --
>> 2.25.1
> 
> Thanks
> Barry
Barry Song Jan. 12, 2024, 10:13 p.m. UTC | #6
On Sat, Jan 13, 2024 at 12:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 12/01/2024 10:13, Barry Song wrote:
> > On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Change the readahead config so that if it is being requested for an
> >> executable mapping, do a synchronous read of an arch-specified size in a
> >> naturally aligned manner.
> >>
> >> On arm64 if memory is physically contiguous and naturally aligned to the
> >> "contpte" size, we can use contpte mappings, which improves utilization
> >> of the TLB. When paired with the "multi-size THP" changes, this works
> >> well to reduce dTLB pressure. However iTLB pressure is still high due to
> >> executable mappings having a low liklihood of being in the required
> >> folio size and mapping alignment, even when the filesystem supports
> >> readahead into large folios (e.g. XFS).
> >>
> >> The reason for the low liklihood is that the current readahead algorithm
> >> starts with an order-2 folio and increases the folio order by 2 every
> >> time the readahead mark is hit. But most executable memory is faulted in
> >> fairly randomly and so the readahead mark is rarely hit and most
> >> executable folios remain order-2. This is observed impirically and
> >> confirmed from discussion with a gnu linker expert; in general, the
> >> linker does nothing to group temporally accessed text together
> >> spacially. Additionally, with the current read-around approach there are
> >> no alignment guarrantees between the file and folio. This is
> >> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> >> base pages).
> >>
> >> So it seems reasonable to special-case the read(ahead) logic for
> >> executable mappings. The trade-off is performance improvement (due to
> >> more efficient storage of the translations in iTLB) vs potential read
> >> amplification (due to reading too much data around the fault which won't
> >> be used), and the latter is independent of base page size. I've chosen
> >> 64K folio size for arm64 which benefits both the 4K and 16K base page
> >> size configs and shouldn't lead to any further read-amplification since
> >> the old read-around path was (usually) reading blocks of 128K (with the
> >> last 32K being async).
> >>
> >> Performance Benchmarking
> >> ------------------------
> >>
> >> The below shows kernel compilation and speedometer javascript benchmarks
> >> on Ampere Altra arm64 system. (The contpte patch series is applied in
> >> the baseline).
> >>
> >> First, confirmation that this patch causes more memory to be contained
> >> in 64K folios (this is for all file-backed memory so includes
> >> non-executable too):
> >>
> >> | File-backed folios      |   Speedometer   |  Kernel Compile |
> >> | by size as percentage   |-----------------|-----------------|
> >> | of all mapped file mem  | before |  after | before |  after |
> >> |=========================|========|========|========|========|
> >> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> >> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> >> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> >> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> >> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> >> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> >> |file-thp-partial         |     0% |     0% |     0% |     0% |
> >> |-------------------------|--------|--------|--------|--------|
> >> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
> >>
> >> The above shows that for both use cases, the amount of file memory
> >> backed by 16K folios reduces and the amount backed by 64K folios
> >> increases significantly. And the amount of memory that is contpte-mapped
> >> significantly increases (last line).
> >>
> >> And this is reflected in performance improvement:
> >>
> >> Kernel Compilation (smaller is faster):
> >> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> >> |----------|-------------|-------------|-------------|---------------|
> >> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> >> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
> >>
> >> Speedometer (bigger is faster):
> >> | kernel   |   runs_per_min |   peak memory |
> >> |----------|----------------|---------------|
> >> | before   |           0.0% |          0.0% |
> >> | after    |           1.3% |          1.0% |
> >>
> >> Both benchmarks show a ~1.5% improvement once the patch is applied.
> >>
> >> Alternatives
> >> ------------
> >>
> >> I considered (and rejected for now - but I anticipate this patch will
> >> stimulate discussion around what the best approach is) alternative
> >> approaches:
> >>
> >>   - Expose a global user-controlled knob to set the preferred folio
> >>     size; this would move policy to user space and allow (e.g.) setting
> >>     it to PMD-size for even better iTLB utilizaiton. But this would add
> >>     ABI, and I prefer to start with the simplest approach first. It also
> >>     has the downside that a change wouldn't apply to memory already in
> >>     the page cache that is in active use (e.g. libc) so we don't get the
> >>     same level of utilization as for something that is fixed from boot.
> >>
> >>   - Add a per-vma attribute to allow user space to specify preferred
> >>     folio size for memory faulted from the range. (we've talked about
> >>     such a control in the context of mTHP). The dynamic loader would
> >>     then be responsible for adding the annotations. Again this feels
> >>     like something that could be added later if value was demonstrated.
> >>
> >>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
> >>     This would still require dynamic linker involvement, but would
> >>     additionally neccessitate a copy and all memory in the range would
> >>     be synchronously faulted in, adding to application load time. It
> >>     would work for filesystems that don't support large folios though.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>
> >> Hi all,
> >>
> >> I originally concocted something similar to this, with Matthew's help, as a
> >> quick proof of concept hack. Since then I've tried a few different approaches
> >> but always came back to this as the simplest solution. I expect this will raise
> >> a few eyebrows but given it is providing a real performance win, I hope we can
> >> converge to something that can be upstreamed.
> >>
> >> This depends on my contpte series to actually set the contiguous bit in the page
> >> table.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
> >>  include/linux/pgtable.h          | 12 ++++++++++++
> >>  mm/filemap.c                     | 19 +++++++++++++++++++
> >>  3 files changed, 43 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index f5bf059291c3..8f8f3f7eb8d8 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
> >>   */
> >>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
> >>
> >> +/*
> >> + * Request exec memory is read into pagecache in at least 64K folios. The
> >> + * trade-off here is performance improvement due to storing translations more
> >> + * effciently in the iTLB vs the potential for read amplification due to reading
> >> + * data from disk that won't be used. The latter is independent of base page
> >> + * size, so we set a page-size independent block size of 64K. This size can be
> >> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> >> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> >> + * use.
> >> + */
> >> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
> >> +
> >>  static inline bool pud_sect_supported(void)
> >>  {
> >>         return PAGE_SIZE == SZ_4K;
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 170925379534..57090616d09c 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>  }
> >>  #endif
> >>
> >> +#ifndef arch_wants_exec_folio_order
> >> +/*
> >> + * Returns preferred minimum folio order for executable file-backed memory. Must
> >> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> >> + * preference and mm will not special-case executable memory in the pagecache.
> >> + */
> >> +static inline int arch_wants_exec_folio_order(void)
> >> +{
> >> +       return -1;
> >> +}
> >> +#endif
> >> +
> >>  #ifndef arch_check_zapped_pte
> >>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
> >>                                          pte_t pte)
> >> diff --git a/mm/filemap.c b/mm/filemap.c
> >> index 67ba56ecdd32..80a76d755534 100644
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> >>         }
> >>  #endif
> >>
> >> +       /*
> >> +        * Allow arch to request a preferred minimum folio order for executable
> >> +        * memory. This can often be beneficial to performance if (e.g.) arm64
> >> +        * can contpte-map the folio. Executable memory rarely benefits from
> >> +        * read-ahead anyway, due to its random access nature.
> >> +        */
> >> +       if (vm_flags & VM_EXEC) {
> >> +               int order = arch_wants_exec_folio_order();
> >> +
> >> +               if (order >= 0) {
> >> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> >> +                       ra->size = 1UL << order;
> >> +                       ra->async_size = 0;
> >> +                       ractl._index &= ~((unsigned long)ra->size - 1);
> >> +                       page_cache_ra_order(&ractl, ra, order);
> >> +                       return fpin;
> >> +               }
> >> +       }
> >
> > I don't know, but most filesystems don't support large mapping,even iomap.
>
> True, but more are coming. For example ext4 is in the works:
> https://lore.kernel.org/all/20240102123918.799062-1-yi.zhang@huaweicloud.com/

right, hopefully more filesystems will join.

>
> > This patch might negatively affect them. i feel we need to check
> > mapping_large_folio_support() at least.
>
> page_cache_ra_order() does this check and falls back to small folios if needed.
> So correctness-wise it all works out. I guess your concern is performance due to
> effectively removing the async readahead aspect? But if that is a problem, then
> it's not just a problem if we are reading small folios, so I don't think the
> proposed check is correct.

My point is that this patch is actually changing two things.
1. readahead index/size and async_size=0
2. try to use CONT-PTE

for filesystems which support large mapping, we are getting 2 to help
improve performance; for filesystems without large_mapping, 1 has
been changed from losing read-around,
        /*
         * mmap read-around
         */
        fpin = maybe_unlock_mmap_for_io(vmf, fpin);
        ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
        ra->size = ra->ra_pages;
        ra->async_size = ra->ra_pages / 4;
        ractl._index = ra->start;
        page_cache_ra_order(&ractl, ra, 0);

We probably need data to prove this makes no regression. otherwise,
it is safer to let the code have no side effects on other file systems if
we haven't data.

>
> Perhaps an alternative would be to double ra->size and set ra->async_size to
> (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
> give us an async portion so readahead can still happen.

this might be worth to try as PMD is exactly doing this because async
can decrease
the latency of subsequent page faults.

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        /* Use the readahead code, even if readahead is disabled */
        if (vm_flags & VM_HUGEPAGE) {
                fpin = maybe_unlock_mmap_for_io(vmf, fpin);
                ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
                ra->size = HPAGE_PMD_NR;
                /*
                 * Fetch two PMD folios, so we get the chance to actually
                 * readahead, unless we've been told not to.
                 */
                if (!(vm_flags & VM_RAND_READ))
                        ra->size *= 2;
                ra->async_size = HPAGE_PMD_NR;
                page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
                return fpin;
        }
#endif

>
> I don't feel very expert with this area of the code so I might be talking
> rubbish - would be great to hear from others.
>
> >
> >> +
> >>         /* If we don't want any read-ahead, don't bother */
> >>         if (vm_flags & VM_RAND_READ)
> >>                 return fpin;
> >> --
> >> 2.25.1
> >

BTW, is it also possible that user space also wants to map some data
as cont-pte hugepage? just like we have a strong VM_HUGEPAGE
flag for PMD THP?

Thanks
Barry
Barry Song Jan. 12, 2024, 10:54 p.m. UTC | #7
On Sat, Jan 13, 2024 at 11:13 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Jan 13, 2024 at 12:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 12/01/2024 10:13, Barry Song wrote:
> > > On Fri, Jan 12, 2024 at 4:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>
> > >> Change the readahead config so that if it is being requested for an
> > >> executable mapping, do a synchronous read of an arch-specified size in a
> > >> naturally aligned manner.
> > >>
> > >> On arm64 if memory is physically contiguous and naturally aligned to the
> > >> "contpte" size, we can use contpte mappings, which improves utilization
> > >> of the TLB. When paired with the "multi-size THP" changes, this works
> > >> well to reduce dTLB pressure. However iTLB pressure is still high due to
> > >> executable mappings having a low liklihood of being in the required
> > >> folio size and mapping alignment, even when the filesystem supports
> > >> readahead into large folios (e.g. XFS).
> > >>
> > >> The reason for the low liklihood is that the current readahead algorithm
> > >> starts with an order-2 folio and increases the folio order by 2 every
> > >> time the readahead mark is hit. But most executable memory is faulted in
> > >> fairly randomly and so the readahead mark is rarely hit and most
> > >> executable folios remain order-2. This is observed impirically and
> > >> confirmed from discussion with a gnu linker expert; in general, the
> > >> linker does nothing to group temporally accessed text together
> > >> spacially. Additionally, with the current read-around approach there are
> > >> no alignment guarrantees between the file and folio. This is
> > >> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> > >> base pages).
> > >>
> > >> So it seems reasonable to special-case the read(ahead) logic for
> > >> executable mappings. The trade-off is performance improvement (due to
> > >> more efficient storage of the translations in iTLB) vs potential read
> > >> amplification (due to reading too much data around the fault which won't
> > >> be used), and the latter is independent of base page size. I've chosen
> > >> 64K folio size for arm64 which benefits both the 4K and 16K base page
> > >> size configs and shouldn't lead to any further read-amplification since
> > >> the old read-around path was (usually) reading blocks of 128K (with the
> > >> last 32K being async).
> > >>
> > >> Performance Benchmarking
> > >> ------------------------
> > >>
> > >> The below shows kernel compilation and speedometer javascript benchmarks
> > >> on Ampere Altra arm64 system. (The contpte patch series is applied in
> > >> the baseline).
> > >>
> > >> First, confirmation that this patch causes more memory to be contained
> > >> in 64K folios (this is for all file-backed memory so includes
> > >> non-executable too):
> > >>
> > >> | File-backed folios      |   Speedometer   |  Kernel Compile |
> > >> | by size as percentage   |-----------------|-----------------|
> > >> | of all mapped file mem  | before |  after | before |  after |
> > >> |=========================|========|========|========|========|
> > >> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> > >> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> > >> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> > >> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> > >> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> > >> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> > >> |file-thp-partial         |     0% |     0% |     0% |     0% |
> > >> |-------------------------|--------|--------|--------|--------|
> > >> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
> > >>
> > >> The above shows that for both use cases, the amount of file memory
> > >> backed by 16K folios reduces and the amount backed by 64K folios
> > >> increases significantly. And the amount of memory that is contpte-mapped
> > >> significantly increases (last line).
> > >>
> > >> And this is reflected in performance improvement:
> > >>
> > >> Kernel Compilation (smaller is faster):
> > >> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> > >> |----------|-------------|-------------|-------------|---------------|
> > >> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> > >> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
> > >>
> > >> Speedometer (bigger is faster):
> > >> | kernel   |   runs_per_min |   peak memory |
> > >> |----------|----------------|---------------|
> > >> | before   |           0.0% |          0.0% |
> > >> | after    |           1.3% |          1.0% |
> > >>
> > >> Both benchmarks show a ~1.5% improvement once the patch is applied.
> > >>
> > >> Alternatives
> > >> ------------
> > >>
> > >> I considered (and rejected for now - but I anticipate this patch will
> > >> stimulate discussion around what the best approach is) alternative
> > >> approaches:
> > >>
> > >>   - Expose a global user-controlled knob to set the preferred folio
> > >>     size; this would move policy to user space and allow (e.g.) setting
> > >>     it to PMD-size for even better iTLB utilizaiton. But this would add
> > >>     ABI, and I prefer to start with the simplest approach first. It also
> > >>     has the downside that a change wouldn't apply to memory already in
> > >>     the page cache that is in active use (e.g. libc) so we don't get the
> > >>     same level of utilization as for something that is fixed from boot.
> > >>
> > >>   - Add a per-vma attribute to allow user space to specify preferred
> > >>     folio size for memory faulted from the range. (we've talked about
> > >>     such a control in the context of mTHP). The dynamic loader would
> > >>     then be responsible for adding the annotations. Again this feels
> > >>     like something that could be added later if value was demonstrated.
> > >>
> > >>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
> > >>     This would still require dynamic linker involvement, but would
> > >>     additionally neccessitate a copy and all memory in the range would
> > >>     be synchronously faulted in, adding to application load time. It
> > >>     would work for filesystems that don't support large folios though.
> > >>
> > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > >> ---
> > >>
> > >> Hi all,
> > >>
> > >> I originally concocted something similar to this, with Matthew's help, as a
> > >> quick proof of concept hack. Since then I've tried a few different approaches
> > >> but always came back to this as the simplest solution. I expect this will raise
> > >> a few eyebrows but given it is providing a real performance win, I hope we can
> > >> converge to something that can be upstreamed.
> > >>
> > >> This depends on my contpte series to actually set the contiguous bit in the page
> > >> table.
> > >>
> > >> Thanks,
> > >> Ryan
> > >>
> > >>
> > >>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
> > >>  include/linux/pgtable.h          | 12 ++++++++++++
> > >>  mm/filemap.c                     | 19 +++++++++++++++++++
> > >>  3 files changed, 43 insertions(+)
> > >>
> > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > >> index f5bf059291c3..8f8f3f7eb8d8 100644
> > >> --- a/arch/arm64/include/asm/pgtable.h
> > >> +++ b/arch/arm64/include/asm/pgtable.h
> > >> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
> > >>   */
> > >>  #define arch_wants_old_prefaulted_pte  cpu_has_hw_af
> > >>
> > >> +/*
> > >> + * Request exec memory is read into pagecache in at least 64K folios. The
> > >> + * trade-off here is performance improvement due to storing translations more
> > >> + * effciently in the iTLB vs the potential for read amplification due to reading
> > >> + * data from disk that won't be used. The latter is independent of base page
> > >> + * size, so we set a page-size independent block size of 64K. This size can be
> > >> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> > >> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> > >> + * use.
> > >> + */
> > >> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
> > >> +
> > >>  static inline bool pud_sect_supported(void)
> > >>  {
> > >>         return PAGE_SIZE == SZ_4K;
> > >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > >> index 170925379534..57090616d09c 100644
> > >> --- a/include/linux/pgtable.h
> > >> +++ b/include/linux/pgtable.h
> > >> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
> > >>  }
> > >>  #endif
> > >>
> > >> +#ifndef arch_wants_exec_folio_order
> > >> +/*
> > >> + * Returns preferred minimum folio order for executable file-backed memory. Must
> > >> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> > >> + * preference and mm will not special-case executable memory in the pagecache.
> > >> + */
> > >> +static inline int arch_wants_exec_folio_order(void)
> > >> +{
> > >> +       return -1;
> > >> +}
> > >> +#endif
> > >> +
> > >>  #ifndef arch_check_zapped_pte
> > >>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
> > >>                                          pte_t pte)
> > >> diff --git a/mm/filemap.c b/mm/filemap.c
> > >> index 67ba56ecdd32..80a76d755534 100644
> > >> --- a/mm/filemap.c
> > >> +++ b/mm/filemap.c
> > >> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> > >>         }
> > >>  #endif
> > >>
> > >> +       /*
> > >> +        * Allow arch to request a preferred minimum folio order for executable
> > >> +        * memory. This can often be beneficial to performance if (e.g.) arm64
> > >> +        * can contpte-map the folio. Executable memory rarely benefits from
> > >> +        * read-ahead anyway, due to its random access nature.
> > >> +        */
> > >> +       if (vm_flags & VM_EXEC) {
> > >> +               int order = arch_wants_exec_folio_order();
> > >> +
> > >> +               if (order >= 0) {
> > >> +                       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > >> +                       ra->size = 1UL << order;
> > >> +                       ra->async_size = 0;
> > >> +                       ractl._index &= ~((unsigned long)ra->size - 1);
> > >> +                       page_cache_ra_order(&ractl, ra, order);
> > >> +                       return fpin;
> > >> +               }
> > >> +       }
> > >
> > > I don't know, but most filesystems don't support large mapping,even iomap.
> >
> > True, but more are coming. For example ext4 is in the works:
> > https://lore.kernel.org/all/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>
> right, hopefully more filesystems will join.
>
> >
> > > This patch might negatively affect them. i feel we need to check
> > > mapping_large_folio_support() at least.
> >
> > page_cache_ra_order() does this check and falls back to small folios if needed.
> > So correctness-wise it all works out. I guess your concern is performance due to
> > effectively removing the async readahead aspect? But if that is a problem, then
> > it's not just a problem if we are reading small folios, so I don't think the
> > proposed check is correct.
>
> My point is that this patch is actually changing two things.
> 1. readahead index/size and async_size=0
> 2. try to use CONT-PTE
>
> for filesystems which support large mapping, we are getting 2 to help
> improve performance; for filesystems without large_mapping, 1 has
> been changed from losing read-around,
>         /*
>          * mmap read-around
>          */
>         fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>         ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>         ra->size = ra->ra_pages;
>         ra->async_size = ra->ra_pages / 4;
>         ractl._index = ra->start;
>         page_cache_ra_order(&ractl, ra, 0);
>
> We probably need data to prove this makes no regression. otherwise,
> it is safer to let the code have no side effects on other file systems if
> we haven't data.
>
> >
> > Perhaps an alternative would be to double ra->size and set ra->async_size to
> > (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
> > give us an async portion so readahead can still happen.
>
> this might be worth to try as PMD is exactly doing this because async
> can decrease
> the latency of subsequent page faults.
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         /* Use the readahead code, even if readahead is disabled */
>         if (vm_flags & VM_HUGEPAGE) {
>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>                 ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>                 ra->size = HPAGE_PMD_NR;
>                 /*
>                  * Fetch two PMD folios, so we get the chance to actually
>                  * readahead, unless we've been told not to.
>                  */
>                 if (!(vm_flags & VM_RAND_READ))
>                         ra->size *= 2;
>                 ra->async_size = HPAGE_PMD_NR;
>                 page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>                 return fpin;
>         }
> #endif
>

BTW, rather than simply always reading backwards,  we did something very
"ugly" to simulate "read-around" for CONT-PTE exec before[1]

if page faults happen in the first half of cont-pte, we read this 64KiB
and its previous 64KiB. otherwise, we read it and its next 64KiB.

struct file *do_cont_pte_sync_mmap_readahead(struct vm_fault *vmf)
{
        ...
        unsigned long haddr = vmf->address & HPAGE_CONT_PTE_MASK;
        ...

        if (vmf->address <= haddr + HPAGE_CONT_PTE_SIZE / 2) {
                /* Readahead a hugepage forward */
                if (haddr - HPAGE_CONT_PTE_SIZE < vmf->vma->vm_start)
                        goto no_readahead;
                ra->start = ALIGN_DOWN(vmf->pgoff, HPAGE_CONT_PTE_NR)
- HPAGE_CONT_PTE_NR;
        } else {
                /* Readahead a hugepage backwards */
                if (haddr + 2 * HPAGE_CONT_PTE_SIZE > vmf->vma->vm_end)
                        goto no_readahead;
                ra->start = ALIGN_DOWN(vmf->pgoff, HPAGE_CONT_PTE_NR);
        }
        ra->size = 2 * HPAGE_CONT_PTE_NR;
         ...
}

the reason is if we are faulting in the 0nd and 1st subpage at address X,
it seems more sensible to read
aligned(X)- 64KiB           ~                 aligned(X) + 64KiB
than
aligned(X)         ~                 aligned(X) + 128KiB
?

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c#L2338

Thanks
Barry
Matthew Wilcox Jan. 12, 2024, 11:04 p.m. UTC | #8
On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote:
> > > Perhaps an alternative would be to double ra->size and set ra->async_size to
> > > (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
> > > give us an async portion so readahead can still happen.
> >
> > this might be worth to try as PMD is exactly doing this because async
> > can decrease
> > the latency of subsequent page faults.
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         /* Use the readahead code, even if readahead is disabled */
> >         if (vm_flags & VM_HUGEPAGE) {
> >                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> >                 ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> >                 ra->size = HPAGE_PMD_NR;
> >                 /*
> >                  * Fetch two PMD folios, so we get the chance to actually
> >                  * readahead, unless we've been told not to.
> >                  */
> >                 if (!(vm_flags & VM_RAND_READ))
> >                         ra->size *= 2;
> >                 ra->async_size = HPAGE_PMD_NR;
> >                 page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
> >                 return fpin;
> >         }
> > #endif
> >
> 
> BTW, rather than simply always reading backwards,  we did something very
> "ugly" to simulate "read-around" for CONT-PTE exec before[1]
> 
> if page faults happen in the first half of cont-pte, we read this 64KiB
> and its previous 64KiB. otherwise, we read it and its next 64KiB.

I don't think that makes sense.  The CPU executes instructions forwards,
not "around".  I honestly think we should treat text as "random access"
because function A calls function B and functions A and B might well be
very far apart from each other.  The only time I see you actually
getting "readahead" hits is if a function is split across two pages (for
whatever size of page), but that's a false hit!  The function is not,
generally, 64kB long, so doing readahead is no more likely to bring in
the next page of text that we want than reading any other random page.

Unless somebody finds the GNU Rope source code from 1998, or recreates it:
https://lwn.net/1998/1029/als/rope.html
Then we might actually have some locality.

Did you actually benchmark what you did?  Is there really some locality
between the code at offset 256-288kB in the file and then in the range
192kB-256kB?
Barry Song Jan. 13, 2024, 12:11 a.m. UTC | #9
On Sat, Jan 13, 2024 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote:
> > > > Perhaps an alternative would be to double ra->size and set ra->async_size to
> > > > (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
> > > > give us an async portion so readahead can still happen.
> > >
> > > this might be worth to try as PMD is exactly doing this because async
> > > can decrease
> > > the latency of subsequent page faults.
> > >
> > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >         /* Use the readahead code, even if readahead is disabled */
> > >         if (vm_flags & VM_HUGEPAGE) {
> > >                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > >                 ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> > >                 ra->size = HPAGE_PMD_NR;
> > >                 /*
> > >                  * Fetch two PMD folios, so we get the chance to actually
> > >                  * readahead, unless we've been told not to.
> > >                  */
> > >                 if (!(vm_flags & VM_RAND_READ))
> > >                         ra->size *= 2;
> > >                 ra->async_size = HPAGE_PMD_NR;
> > >                 page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
> > >                 return fpin;
> > >         }
> > > #endif
> > >
> >
> > BTW, rather than simply always reading backwards,  we did something very
> > "ugly" to simulate "read-around" for CONT-PTE exec before[1]
> >
> > if page faults happen in the first half of cont-pte, we read this 64KiB
> > and its previous 64KiB. otherwise, we read it and its next 64KiB.
>
> I don't think that makes sense.  The CPU executes instructions forwards,
> not "around".  I honestly think we should treat text as "random access"
> because function A calls function B and functions A and B might well be
> very far apart from each other.  The only time I see you actually
> getting "readahead" hits is if a function is split across two pages (for
> whatever size of page), but that's a false hit!  The function is not,
> generally, 64kB long, so doing readahead is no more likely to bring in
> the next page of text that we want than reading any other random page.
>

it seems you are in favor of Ryan's modification even for filesystems
which don't support large mapping?

> Unless somebody finds the GNU Rope source code from 1998, or recreates it:
> https://lwn.net/1998/1029/als/rope.html
> Then we might actually have some locality.
>
> Did you actually benchmark what you did?  Is there really some locality
> between the code at offset 256-288kB in the file and then in the range
> 192kB-256kB?

I really didn't have benchmark data, at that point I was like,
instinctively didn’t
want to break the logic of read-around, so made the code just that.
The info your provide makes me re-think if the read-around code is necessary,
thanks!

was using filesystems without large-mapping support but worked around
the problem by
1. preparing 16*n normals pages
2. insert normal pages into xa
3. let filesystem read 16 normal pages
4. after all IO completion, transform 16 pages into mTHP and reinsert
mTHP to xa

that was very painful and finally made no improvement probably because
of due to various sync overhead. so  ran away and didn't dig more data.

Thanks
Barry
Ryan Roberts Jan. 15, 2024, 9:33 a.m. UTC | #10
On 13/01/2024 00:11, Barry Song wrote:
> On Sat, Jan 13, 2024 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote:
>>>>> Perhaps an alternative would be to double ra->size and set ra->async_size to
>>>>> (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
>>>>> give us an async portion so readahead can still happen.
>>>>
>>>> this might be worth to try as PMD is exactly doing this because async
>>>> can decrease
>>>> the latency of subsequent page faults.
>>>>
>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>         /* Use the readahead code, even if readahead is disabled */
>>>>         if (vm_flags & VM_HUGEPAGE) {
>>>>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>>                 ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>>>>                 ra->size = HPAGE_PMD_NR;
>>>>                 /*
>>>>                  * Fetch two PMD folios, so we get the chance to actually
>>>>                  * readahead, unless we've been told not to.
>>>>                  */
>>>>                 if (!(vm_flags & VM_RAND_READ))
>>>>                         ra->size *= 2;
>>>>                 ra->async_size = HPAGE_PMD_NR;
>>>>                 page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>>>>                 return fpin;
>>>>         }
>>>> #endif
>>>>
>>>
>>> BTW, rather than simply always reading backwards,  we did something very
>>> "ugly" to simulate "read-around" for CONT-PTE exec before[1]
>>>
>>> if page faults happen in the first half of cont-pte, we read this 64KiB
>>> and its previous 64KiB. otherwise, we read it and its next 64KiB.

I actually tried something very similar to this while prototyping. I found that
it was about 10% less effective at getting text into 64K folios as the approach
I posted. I didn't investigate why, as I came to the conclusion that text
unlikely benefits from readahead anyway.

>>
>> I don't think that makes sense.  The CPU executes instructions forwards,
>> not "around".  I honestly think we should treat text as "random access"
>> because function A calls function B and functions A and B might well be
>> very far apart from each other.  The only time I see you actually
>> getting "readahead" hits is if a function is split across two pages (for
>> whatever size of page), but that's a false hit!  The function is not,
>> generally, 64kB long, so doing readahead is no more likely to bring in
>> the next page of text that we want than reading any other random page.
>>
> 
> it seems you are in favor of Ryan's modification even for filesystems
> which don't support large mapping?
> 
>> Unless somebody finds the GNU Rope source code from 1998, or recreates it:
>> https://lwn.net/1998/1029/als/rope.html
>> Then we might actually have some locality.
>>
>> Did you actually benchmark what you did?  Is there really some locality
>> between the code at offset 256-288kB in the file and then in the range
>> 192kB-256kB?
> 
> I really didn't have benchmark data, at that point I was like,
> instinctively didn’t
> want to break the logic of read-around, so made the code just that.
> The info your provide makes me re-think if the read-around code is necessary,
> thanks!

As a quick experiment, I modified my thpmaps script to collect data *only* for
executable mappings. This is run *without* my change:

| File-backed exec folios |    Speedometer | Kernel Compile |
|=========================|================|================|
|file-thp-aligned-16kB    |            56% |            46% |
|file-thp-aligned-32kB    |             2% |             3% |
|file-thp-aligned-64kB    |             4% |             5% |
|file-thp-unaligned-16kB  |             0% |             3% |
|file-thp-unaligned-128kB |             2% |             0% |
|file-thp-partial         |             0% |             0% |

It's implied that the rest of the memory (up to 100%) is small (single page)
folios. I think the only reason we would see small folios is if we would
otherwise run off the end of the file?

If so, then I think any text in folios > 16K is a rough proxy for how effective
readahead is for text: Not very.

Intuitively, I agree with Matthew that readahead doesn't make much sense for
text, and this rough data seems to agree.


> 
> was using filesystems without large-mapping support but worked around
> the problem by
> 1. preparing 16*n normals pages
> 2. insert normal pages into xa
> 3. let filesystem read 16 normal pages
> 4. after all IO completion, transform 16 pages into mTHP and reinsert
> mTHP to xa

I had a go at something like this too, but was doing it in the dynamic loader
and having it do MADV_COLLAPSE to generate PMD-sized THPs for the text. I
actaully found this to be even faster for the use cases I was measuring. But of
course its using more memory due to the 2M page size, and I expect it is slowing
down app load time because it is potentially reading in a lot more text than is
actually faulting. Ultimately I think the better strategy is to make the
filesystems large folio capable.

> 
> that was very painful and finally made no improvement probably because
> of due to various sync overhead. so  ran away and didn't dig more data.
> 
> Thanks
> Barry
Ryan Roberts Jan. 17, 2024, 3:10 p.m. UTC | #11
On 11/01/2024 15:41, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2. This is observed impirically and
> confirmed from discussion with a gnu linker expert; in general, the
> linker does nothing to group temporally accessed text together
> spacially. Additionally, with the current read-around approach there are
> no alignment guarrantees between the file and folio. This is
> insufficient for arm64's contpte mapping requirement (order-4 for 4K
> base pages).
> 
> So it seems reasonable to special-case the read(ahead) logic for
> executable mappings. The trade-off is performance improvement (due to
> more efficient storage of the translations in iTLB) vs potential read
> amplification (due to reading too much data around the fault which won't
> be used), and the latter is independent of base page size. I've chosen
> 64K folio size for arm64 which benefits both the 4K and 16K base page
> size configs and shouldn't lead to any further read-amplification since
> the old read-around path was (usually) reading blocks of 128K (with the
> last 32K being async).
> 
> Performance Benchmarking
> ------------------------
> 
> The below shows kernel compilation and speedometer javascript benchmarks
> on Ampere Altra arm64 system. (The contpte patch series is applied in
> the baseline).
> 
> First, confirmation that this patch causes more memory to be contained
> in 64K folios (this is for all file-backed memory so includes
> non-executable too):
> 
> | File-backed folios      |   Speedometer   |  Kernel Compile |
> | by size as percentage   |-----------------|-----------------|
> | of all mapped file mem  | before |  after | before |  after |
> |=========================|========|========|========|========|
> |file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
> |file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
> |file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
> |file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
> |file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
> |file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
> |file-thp-partial         |     0% |     0% |     0% |     0% |
> |-------------------------|--------|--------|--------|--------|
> |file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |
> 
> The above shows that for both use cases, the amount of file memory
> backed by 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of memory that is contpte-mapped
> significantly increases (last line).
> 
> And this is reflected in performance improvement:
> 
> Kernel Compilation (smaller is faster):
> | kernel   |   real-time |   kern-time |   user-time |   peak memory |
> |----------|-------------|-------------|-------------|---------------|
> | before   |        0.0% |        0.0% |        0.0% |          0.0% |
> | after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |
> 
> Speedometer (bigger is faster):
> | kernel   |   runs_per_min |   peak memory |
> |----------|----------------|---------------|
> | before   |           0.0% |          0.0% |
> | after    |           1.3% |          1.0% |
> 
> Both benchmarks show a ~1.5% improvement once the patch is applied.
> 
> Alternatives
> ------------
> 
> I considered (and rejected for now - but I anticipate this patch will
> stimulate discussion around what the best approach is) alternative
> approaches:
> 
>   - Expose a global user-controlled knob to set the preferred folio
>     size; this would move policy to user space and allow (e.g.) setting
>     it to PMD-size for even better iTLB utilizaiton. But this would add
>     ABI, and I prefer to start with the simplest approach first. It also
>     has the downside that a change wouldn't apply to memory already in
>     the page cache that is in active use (e.g. libc) so we don't get the
>     same level of utilization as for something that is fixed from boot.
> 
>   - Add a per-vma attribute to allow user space to specify preferred
>     folio size for memory faulted from the range. (we've talked about
>     such a control in the context of mTHP). The dynamic loader would
>     then be responsible for adding the annotations. Again this feels
>     like something that could be added later if value was demonstrated.
> 
>   - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
>     This would still require dynamic linker involvement, but would
>     additionally neccessitate a copy and all memory in the range would
>     be synchronously faulted in, adding to application load time. It
>     would work for filesystems that don't support large folios though.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Hi all,
> 
> I originally concocted something similar to this, with Matthew's help, as a
> quick proof of concept hack. Since then I've tried a few different approaches
> but always came back to this as the simplest solution. I expect this will raise
> a few eyebrows but given it is providing a real performance win, I hope we can
> converge to something that can be upstreamed.
> 
> This depends on my contpte series to actually set the contiguous bit in the page
> table.
> 
> Thanks,
> Ryan
> 
> 
>  arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
>  include/linux/pgtable.h          | 12 ++++++++++++
>  mm/filemap.c                     | 19 +++++++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f5bf059291c3..8f8f3f7eb8d8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   */
>  #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
> 
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. The
> + * trade-off here is performance improvement due to storing translations more
> + * effciently in the iTLB vs the potential for read amplification due to reading
> + * data from disk that won't be used. The latter is independent of base page
> + * size, so we set a page-size independent block size of 64K. This size can be
> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> + * use.
> + */
> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)

Just noticed this errant "void". I originally had this as an inline function,
but changed it to a macro to align with the other "arch_wants" macros above. So
I missed it when refactoring. Anyway, it should be benign but I'll fix it in the
next version, if we keep going in this direction.

> +
>  static inline bool pud_sect_supported(void)
>  {
>  	return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 170925379534..57090616d09c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
> 
> +#ifndef arch_wants_exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> + * preference and mm will not special-case executable memory in the pagecache.
> + */
> +static inline int arch_wants_exec_folio_order(void)
> +{
> +	return -1;
> +}
> +#endif
> +
>  #ifndef arch_check_zapped_pte
>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>  					 pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 67ba56ecdd32..80a76d755534 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  #endif
> 
> +	/*
> +	 * Allow arch to request a preferred minimum folio order for executable
> +	 * memory. This can often be beneficial to performance if (e.g.) arm64
> +	 * can contpte-map the folio. Executable memory rarely benefits from
> +	 * read-ahead anyway, due to its random access nature.
> +	 */
> +	if (vm_flags & VM_EXEC) {
> +		int order = arch_wants_exec_folio_order();
> +
> +		if (order >= 0) {
> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +			ra->size = 1UL << order;
> +			ra->async_size = 0;
> +			ractl._index &= ~((unsigned long)ra->size - 1);
> +			page_cache_ra_order(&ractl, ra, order);
> +			return fpin;
> +		}
> +	}
> +
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vm_flags & VM_RAND_READ)
>  		return fpin;
> --
> 2.25.1
>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f5bf059291c3..8f8f3f7eb8d8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1143,6 +1143,18 @@  static inline void update_mmu_cache_range(struct vm_fault *vmf,
  */
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af

+/*
+ * Request exec memory is read into pagecache in at least 64K folios. The
+ * trade-off here is performance improvement due to storing translations more
+ * effciently in the iTLB vs the potential for read amplification due to reading
+ * data from disk that won't be used. The latter is independent of base page
+ * size, so we set a page-size independent block size of 64K. This size can be
+ * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
+ * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
+ * use.
+ */
+#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT)
+
 static inline bool pud_sect_supported(void)
 {
 	return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 170925379534..57090616d09c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -428,6 +428,18 @@  static inline bool arch_has_hw_pte_young(void)
 }
 #endif

+#ifndef arch_wants_exec_folio_order
+/*
+ * Returns preferred minimum folio order for executable file-backed memory. Must
+ * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
+ * preference and mm will not special-case executable memory in the pagecache.
+ */
+static inline int arch_wants_exec_folio_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef arch_check_zapped_pte
 static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
 					 pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 67ba56ecdd32..80a76d755534 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3115,6 +3115,25 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 #endif

+	/*
+	 * Allow arch to request a preferred minimum folio order for executable
+	 * memory. This can often be beneficial to performance if (e.g.) arm64
+	 * can contpte-map the folio. Executable memory rarely benefits from
+	 * read-ahead anyway, due to its random access nature.
+	 */
+	if (vm_flags & VM_EXEC) {
+		int order = arch_wants_exec_folio_order();
+
+		if (order >= 0) {
+			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+			ra->size = 1UL << order;
+			ra->async_size = 0;
+			ractl._index &= ~((unsigned long)ra->size - 1);
+			page_cache_ra_order(&ractl, ra, order);
+			return fpin;
+		}
+	}
+
 	/* If we don't want any read-ahead, don't bother */
 	if (vm_flags & VM_RAND_READ)
 		return fpin;