mbox series

[0/3] mm, thp: introduce a new sysfs interface to facilitate file THP for .text

Message ID 20211009092658.59665-1-rongwei.wang@linux.alibaba.com (mailing list archive)
Headers show
Series mm, thp: introduce a new sysfs interface to facilitate file THP for .text | expand

Message

Rongwei Wang Oct. 9, 2021, 9:26 a.m. UTC
Hi, all

Recently, our team focus on huge pages of executable binary files
and share libraries, refer to these huge pages as 'hugetext' in
the next description. The hugetext indeed to improve the performance
of application, e.g. mysql. It has been shown in [1][2]. And with
the increase of the text section, the improvement will be more
obvious. Base on [1][2], we make some improvement to make file-backed
THP more usability and easy for applications.

In current kernel, ref[1] introduced READ_ONLY_THP_FOR_FS, and ref[2]
add the support for share libraries based on the previous one. However,
Until now, hugetext is not convenient to use at present. For example,
we need to explicitly madvise MADV_HUGEPAGE for .text and set
"transparent_hugepage/enabled" to always or madvise . On the other
hand, hugetext requires 2M alignment of vma->vm_start and vma->vm_pgoff,
which is not guaranteed by kernel or loader.

Our design:
To solve the drawback mentioned above of file THP in using, we have
mainly improved two points that shows below.
(1) introduce a new sysfs interface "transparent_hugepage/hugetext_enabled"
in order to automatically (i.e., transparently) enable file THP for
suitable .text vmas. The usage belows:

    to disable hugetext:
    $ echo 0 > /sys/kernel/mm/transparent_hugepage/hugetext_enabled

    to enable hugetext:
    $ echo 1 > /sys/kernel/mm/transparent_hugepage/hugetext_enabled

    to enable or disable in boot options: hugetext=1 or hugetext=0

Q: Why not add a new option, e.g., "text_always", in addition to
"always", "madvise", and "never" to "transparent_hugepage/enabled" ?

A: A new option to "transparent_hugepage/enabled" cannot handle such
scenario, where THP always for .text, and madivse/never for others
(e.g., anon vma).

The .text is usually small in size. In our production environment, at
most 10G out of 500G total memory is used as .text. The .text is also
performance critical. More important, We don't want to change the
user's default behavior too much. So we think that a new independent
sysfs interface for file THP is worthy.

(2) make vm_start of .text 2M align with vm_pgoff, especially
for PIE/PIC binaries and shared libraries.

For binaries that are compiled with '--pie -fPIC' and with LOAD
alignment smaller than 2M (typically 4K, 64K), change
maximum_alignment to 2M.

For shared libraries, ld.so seems not to consider p_align well, as
shown below.
$ readelf -l /usr/lib64/libc-2.17.so
LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
               0x00000000001c2fe8 0x00000000001c2fe8  R E    200000
$ cat /proc/1/smaps
7fecc4072000-7fecc4235000 r-xp 00000000 08:03 655802  /usr/lib64/libc-2.17.so

Finally, why this feasure is implemented in kernel, not in userspace, or
ld.so?

Userspace methods like libhugetlbfs have various disadvantages:
 * require recompiling applications;
 * the anonymous mapping cannot be shared;
 * debugging is not convenient.

To madvise MADV_HUGEPAGE for .text in ld.so has been suggested in the
glibc mailing list[3], but there was no response.

Finally, considering that this feature requires very little code and
is not too difficult to implement based on the existing file-backed
THP support, it was finally chosen to be implemented in the kernel.

Thanks!

Reference:
[1] https://patchwork.kernel.org/project/linux-mm/cover/20190801184244.3169074-1-songliubraving@fb.com/
[2] https://patchwork.kernel.org/project/linux-fsdevel/patch/20210406000930.3455850-1-cfijalkovich@google.com/
[3] https://sourceware.org/pipermail/libc-alpha/2021-February/122334.html

Rongwei Wang (3):
  mm, thp: support binaries transparent use of file THP
  mm, thp: make mapping address of libraries THP align
  mm, thp: make mapping address of PIC binaries THP align

 fs/binfmt_elf.c            |  5 +++
 include/linux/huge_mm.h    | 36 +++++++++++++++++++
 include/linux/khugepaged.h |  9 +++++
 mm/Kconfig                 | 11 ++++++
 mm/huge_memory.c           | 72 ++++++++++++++++++++++++++++++++++++++
 mm/khugepaged.c            |  4 +++
 mm/memory.c                | 12 +++++++
 mm/mmap.c                  | 18 ++++++++++
 8 files changed, 167 insertions(+)

Comments

Christoph Hellwig Oct. 11, 2021, 8:06 a.m. UTC | #1
Can we please just get proper pagecache THP (through folios) merged
instead of piling hacks over hacks here?  The whole readonly THP already
was more than painful enough due to all the hacks involved.
Matthew Wilcox Oct. 12, 2021, 1:50 a.m. UTC | #2
On Mon, Oct 11, 2021 at 09:06:37AM +0100, Christoph Hellwig wrote:
> Can we please just get proper pagecache THP (through folios) merged
> instead of piling hacks over hacks here?  The whole readonly THP already
> was more than painful enough due to all the hacks involved.

This was my initial reaction too.

But read the patches.  They're nothing to do with the implementation of
THP / folios in the page cache.  They're all to make sure that mappings
are PMD aligned.

I think there's a lot to criticise in the patches (eg, a system-wide
setting is probably a bad idea.  and a lot of this stuff seems to
be fixing userspace bugs in the kernel).  But let's criticise what's
actually in the patches, because these are problems that exist regardless
of RO_THP vs folios.
Rongwei Wang Oct. 12, 2021, 7:04 a.m. UTC | #3
On 10/12/21 9:50 AM, Matthew Wilcox wrote:
> On Mon, Oct 11, 2021 at 09:06:37AM +0100, Christoph Hellwig wrote:
>> Can we please just get proper pagecache THP (through folios) merged
>> instead of piling hacks over hacks here?  The whole readonly THP already
>> was more than painful enough due to all the hacks involved.
> 
> This was my initial reaction too.
> 
> But read the patches.  They're nothing to do with the implementation of
> THP / folios in the page cache.  They're all to make sure that mappings
> are PMD aligned.
Hi, Matthew

In fact, we had thought about realizing this by handling page cache 
directly. And then, we found that we just need to align the mapping 
address and make khugepaged can scan these 'mm_struct' base on 
READ_ONLY_THP_FOR_FS.

> 
> I think there's a lot to criticise in the patches (eg, a system-wide
> setting is probably a bad idea.  and a lot of this stuff seems to
At the beginning, we don't introduce the new sysfs interface, just 
re-use 'transparent_hugepage/enabled'. But In some production system, they
disable the THP directly, especially those applications that are 
sensitive to THP. So, Considering these scenarios, we had to design a 
new sysfs interface ('transparent_hugepage/hugetext_enabled').

And if you have other idea, we are willing to take to improve these patches.

Thanks!

> be fixing userspace bugs in the kernel).  But let's criticise what's
> actually in the patches, because these are problems that exist regardless
> of RO_THP vs folios.
>