diff mbox series

[V5] x86/mm: Tracking linear mapping split events

Message ID 20210128104934.2916679-1-saravanand@fb.com (mailing list archive)
State New, archived
Headers show
Series [V5] x86/mm: Tracking linear mapping split events | expand

Commit Message

Saravanan D Jan. 28, 2021, 10:49 a.m. UTC
To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Documentation regarding linear mapping split events added to admin-guide
as requested in V3 of the patch.

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
 Documentation/admin-guide/mm/index.rst        |  1 +
 arch/x86/mm/pat/set_memory.c                  |  8 +++
 include/linux/vm_event_item.h                 |  4 ++
 mm/vmstat.c                                   |  4 ++
 5 files changed, 76 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst

Comments

Matthew Wilcox Jan. 28, 2021, 3:04 p.m. UTC | #1
On Thu, Jan 28, 2021 at 02:49:34AM -0800, Saravanan D wrote:
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

You didn't answer my question.

Is this tracing of userspace programs causing splits, or is it kernel
tracing?  Also, we have lots of kinds of tracing these days; are you
referring to kprobes?  tracepoints?  ftrace?  Something else?
Zi Yan Jan. 28, 2021, 4:33 p.m. UTC | #2
On 28 Jan 2021, at 5:49, Saravanan D wrote:

> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

It is interesting to see this statement saying splitting kernel direct mappings
causes performance loss, when Zhengjun (cc’d) from Intel recently posted
a kernel direct mapping performance report[1] saying 1GB mappings are good
but not much better than 2MB and 4KB mappings.

I would love to hear the stories from both sides. Or maybe I misunderstand
anything.


[1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>
> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
>
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>  Documentation/admin-guide/mm/index.rst        |  1 +
>  arch/x86/mm/pat/set_memory.c                  |  8 +++
>  include/linux/vm_event_item.h                 |  4 ++
>  mm/vmstat.c                                   |  4 ++
>  5 files changed, 76 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> +	direct_map_level2_splits xxx
> +	direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> +	are 2M/4M hugepage split events
> +direct_map_level3_splits
> +	are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> +	DirectMap4k:    xxxxx kB
> +	DirectMap2M:    yyyyy kB
> +	DirectMap1G:    zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 4k pages
> +DirectMap2M
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 2M pages
> +DirectMap1G
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
>     soft-dirty
>     transhuge
>     userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..a7b3c5f1d316 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>
>  #include <asm/e820/api.h>
>  #include <asm/processor.h>
> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>  		return;
>
>  	direct_pages_count[level]--;
> +	if (system_state == SYSTEM_RUNNING) {
> +		if (level == PG_LEVEL_2M)
> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> +		else if (level == PG_LEVEL_1G)
> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> +	}
>  	direct_pages_count[level - 1] += PTRS_PER_PTE;
>  }
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_X86
> +	"direct_map_level2_splits",
> +	"direct_map_level3_splits",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> -- 
> 2.24.1


—
Best Regards,
Yan Zi
Dave Hansen Jan. 28, 2021, 4:41 p.m. UTC | #3
On 1/28/21 8:33 AM, Zi Yan wrote:
>> One of the many lasting (as we don't coalesce back) sources for
>> huge page splits is tracing as the granular page
>> attribute/permission changes would force the kernel to split code
>> segments mapped to huge pages to smaller ones thereby increasing
>> the probability of TLB miss/reload even after tracing has been
>> stopped.
> It is interesting to see this statement saying splitting kernel
> direct mappings causes performance loss, when Zhengjun (cc’d) from
> Intel recently posted a kernel direct mapping performance report[1]
> saying 1GB mappings are good but not much better than 2MB and 4KB
> mappings.

No, that's not what the report said.

*Overall*, there is no clear winner between 4k, 2M and 1G.  In other
words, no one page size is best for *ALL* workloads.

There were *ABSOLUTELY* individual workloads in those tests that saw
significant deltas between the direct map sizes.  There are also
real-world workloads that feel the impact here.
Zi Yan Jan. 28, 2021, 4:56 p.m. UTC | #4
On 28 Jan 2021, at 11:41, Dave Hansen wrote:

> On 1/28/21 8:33 AM, Zi Yan wrote:
>>> One of the many lasting (as we don't coalesce back) sources for
>>> huge page splits is tracing as the granular page
>>> attribute/permission changes would force the kernel to split code
>>> segments mapped to huge pages to smaller ones thereby increasing
>>> the probability of TLB miss/reload even after tracing has been
>>> stopped.
>> It is interesting to see this statement saying splitting kernel
>> direct mappings causes performance loss, when Zhengjun (cc’d) from
>> Intel recently posted a kernel direct mapping performance report[1]
>> saying 1GB mappings are good but not much better than 2MB and 4KB
>> mappings.
>
> No, that's not what the report said.
>
> *Overall*, there is no clear winner between 4k, 2M and 1G.  In other
> words, no one page size is best for *ALL* workloads.
>
> There were *ABSOLUTELY* individual workloads in those tests that saw
> significant deltas between the direct map sizes.  There are also
> real-world workloads that feel the impact here.

Yes, it is what I understand from the report. But this patch says
“
Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
The splintering of huge direct pages into smaller ones does result in
a measurable performance hit caused by frequent TLB miss and reloads.
”,

indicating large mappings (2MB, 1GB) are generally better. It is
different from what the report said, right?

The above text could be improved to make sure readers get both sides
of the story and not get afraid of performance loss after seeing
a lot of direct_map_xxx_splits events.



—
Best Regards,
Yan Zi
Song Liu Jan. 28, 2021, 4:59 p.m. UTC | #5
> On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@nvidia.com> wrote:
> 
> On 28 Jan 2021, at 5:49, Saravanan D wrote:
> 
>> To help with debugging the sluggishness caused by TLB miss/reload,
>> we introduce monotonic lifetime hugepage split event counts since
>> system state: SYSTEM_RUNNING to be displayed as part of
>> /proc/vmstat in x86 servers
>> 
>> The lifetime split event information will be displayed at the bottom of
>> /proc/vmstat
>> ....
>> swap_ra 0
>> swap_ra_hit 0
>> direct_map_level2_splits 94
>> direct_map_level3_splits 4
>> nr_unstable 0
>> ....
>> 
>> One of the many lasting (as we don't coalesce back) sources for huge page
>> splits is tracing as the granular page attribute/permission changes would
>> force the kernel to split code segments mapped to huge pages to smaller
>> ones thereby increasing the probability of TLB miss/reload even after
>> tracing has been stopped.
> 
> It is interesting to see this statement saying splitting kernel direct mappings
> causes performance loss, when Zhengjun (cc’d) from Intel recently posted
> a kernel direct mapping performance report[1] saying 1GB mappings are good
> but not much better than 2MB and 4KB mappings.
> 
> I would love to hear the stories from both sides. Or maybe I misunderstand
> anything.

We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page
table entry into 512x 4kB ones. This split caused ~1% performance regression. 
That instance was fixed in [1]. 

Saravanan, could you please share more information about the split. Is it 
possible to avoid the split? If not, can we regroup after tracing is disabled?

We have the split-and-regroup logic for application .text on THP. When uprobe 
is attached to the THP text, we have to split the 2MB page table entry. So we
introduced mechanism to regroup the 2MB page table entry when all uprobes are
removed from the THP [2]. 

Thanks,
Song

[1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")
[2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes")

> 
> 
> [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>> 
>> Documentation regarding linear mapping split events added to admin-guide
>> as requested in V3 of the patch.
>> 
>> Signed-off-by: Saravanan D <saravanand@fb.com>
>> ---
>> .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>> Documentation/admin-guide/mm/index.rst        |  1 +
>> arch/x86/mm/pat/set_memory.c                  |  8 +++
>> include/linux/vm_event_item.h                 |  4 ++
>> mm/vmstat.c                                   |  4 ++
>> 5 files changed, 76 insertions(+)
>> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>> 
>> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> new file mode 100644
>> index 000000000000..298751391deb
>> --- /dev/null
>> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> @@ -0,0 +1,59 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================
>> +Direct Mapping Splits
>> +=====================
>> +
>> +Kernel maps all of physical memory in linear/direct mapped pages with
>> +translation of virtual kernel address to physical address is achieved
>> +through a simple subtraction of offset. CPUs maintain a cache of these
>> +translations on fast caches called TLBs. CPU architectures like x86 allow
>> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
>> +various page table levels.
>> +
>> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
>> +The splintering of huge direct pages into smaller ones does result in
>> +a measurable performance hit caused by frequent TLB miss and reloads.
>> +
>> +One of the many lasting (as we don't coalesce back) sources for huge page
>> +splits is tracing as the granular page attribute/permission changes would
>> +force the kernel to split code segments mapped to hugepages to smaller
>> +ones thus increasing the probability of TLB miss/reloads even after
>> +tracing has been stopped.
>> +
>> +On x86 systems, we can track the splitting of huge direct mapped pages
>> +through lifetime event counters in ``/proc/vmstat``
>> +
>> +	direct_map_level2_splits xxx
>> +	direct_map_level3_splits yyy
>> +
>> +where:
>> +
>> +direct_map_level2_splits
>> +	are 2M/4M hugepage split events
>> +direct_map_level3_splits
>> +	are 1G hugepage split events
>> +
>> +The distribution of direct mapped system memory in various page sizes
>> +post splits can be viewed through ``/proc/meminfo`` whose output
>> +will include the following lines depending upon supporting CPU
>> +architecture
>> +
>> +	DirectMap4k:    xxxxx kB
>> +	DirectMap2M:    yyyyy kB
>> +	DirectMap1G:    zzzzz kB
>> +
>> +where:
>> +
>> +DirectMap4k
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 4k pages
>> +DirectMap2M
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 2M pages
>> +DirectMap1G
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 1G pages
>> +
>> +
>> +-- Saravanan D, Jan 27, 2021
>> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
>> index 4b14d8b50e9e..9439780f3f07 100644
>> --- a/Documentation/admin-guide/mm/index.rst
>> +++ b/Documentation/admin-guide/mm/index.rst
>> @@ -38,3 +38,4 @@ the Linux memory management.
>>    soft-dirty
>>    transhuge
>>    userfaultfd
>> +   direct_mapping_splits
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 16f878c26667..a7b3c5f1d316 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -16,6 +16,8 @@
>> #include <linux/pci.h>
>> #include <linux/vmalloc.h>
>> #include <linux/libnvdimm.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/kernel.h>
>> 
>> #include <asm/e820/api.h>
>> #include <asm/processor.h>
>> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>> 		return;
>> 
>> 	direct_pages_count[level]--;
>> +	if (system_state == SYSTEM_RUNNING) {
>> +		if (level == PG_LEVEL_2M)
>> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
>> +		else if (level == PG_LEVEL_1G)
>> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
>> +	}
>> 	direct_pages_count[level - 1] += PTRS_PER_PTE;
>> }
>> 
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 18e75974d4e3..7c06c2bdc33b 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> #ifdef CONFIG_SWAP
>> 		SWAP_RA,
>> 		SWAP_RA_HIT,
>> +#endif
>> +#ifdef CONFIG_X86
>> +		DIRECT_MAP_LEVEL2_SPLIT,
>> +		DIRECT_MAP_LEVEL3_SPLIT,
>> #endif
>> 		NR_VM_EVENT_ITEMS
>> };
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index f8942160fc95..a43ac4ac98a2 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>> 	"swap_ra",
>> 	"swap_ra_hit",
>> #endif
>> +#ifdef CONFIG_X86
>> +	"direct_map_level2_splits",
>> +	"direct_map_level3_splits",
>> +#endif
>> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>> };
>> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
>> -- 
>> 2.24.1
> 
> 
> —
> Best Regards,
> Yan Zi
Dave Hansen Jan. 28, 2021, 7:17 p.m. UTC | #6
On 1/28/21 2:49 AM, Saravanan D wrote:
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.

Eek.  There really doesn't appear to be a place in Documentation/ that
we've documented vmstat entries.

Maybe you can start:

	Documentation/admin-guide/mm/vmstat.rst

Also, I don't think we need background on the direct map or TLBs here.
Just get to the point and describe what the files do, don't justify why
they are there.

I also agree with Willy that you should qualify some of the strong
statements (if they remain) in your changelog and documentation>

This:

	Maintaining huge direct mapped pages
	greatly reduces TLB miss pressure.

for instance isn't universally true.  There were CPUs with a very small
number of 1G TLB entries.  Using 1G pages on those systems often led to
*GREATER* TLB pressure and lower performance.
Saravanan D Jan. 28, 2021, 7:49 p.m. UTC | #7
Hi Mathew,

> Is this tracing of userspace programs causing splits, or is it kernel
> tracing?  Also, we have lots of kinds of tracing these days; are you
> referring to kprobes?  tracepoints?  ftrace?  Something else?

It has to be kernel tracing (kprobes, tracepoints) as we are dealing with 
direct mapping splits.

Kernel's direct mapping
`` ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct
 mapping of all physical memory (page_offset_base)``

The kernel text range
``ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel
text mapping, mapped to physical address 0``

Source : Documentation/x86/x86_64/mm.rst

Kernel code segment points to the same physical addresses already mapped 
in the direct mapping range (0x20000000 = 512 MB)

When we enable kernel tracing, we would have to modify attributes/permissions 
of the text segment pages that are direct mapped causing them to split.

When we track the direct_pages_count[] in arch/x86/mm/pat/set_memory.c
There are only splits from higher levels. They never coalesce back.

Splits when we turn on dynamic tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Thanks,
Saravanan D
Saravanan D Jan. 28, 2021, 9:20 p.m. UTC | #8
Hi Dave,
> 
> Eek.  There really doesn't appear to be a place in Documentation/ that
> we've documented vmstat entries.
> 
> Maybe you can start:
> 
> 	Documentation/admin-guide/mm/vmstat.rst
> 
I was also very surprised that there does not exist documentation for
vmstat, that lead me to add a page in admin-guide which now requires lot
of caveats.

Starting a new documentation for vmstat goes beyond the scope of this patch.
I am inclined to remove Documentation from the next version [V6] of the patch.

I presume that a detailed commit log [V6] explaining why direct mapped kernel
page splis will never coalesce, how kernel tracing causes some of those
splits and why it is worth tracking them can do the job.

Proposed [V6] Commit Log:
>>>
To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same 
physical addresses that have been already mapped in the kernel's 
direct mapping range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify attributes/permissions
of the text segment hugepages that are direct mapped causing them to split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on
dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....
<<<

Thanks,
Saravanan D
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
new file mode 100644
index 000000000000..298751391deb
--- /dev/null
+++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
@@ -0,0 +1,59 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Direct Mapping Splits
+=====================
+
+Kernel maps all of physical memory in linear/direct mapped pages with
+translation of virtual kernel address to physical address is achieved
+through a simple subtraction of offset. CPUs maintain a cache of these
+translations on fast caches called TLBs. CPU architectures like x86 allow
+direct mapping large portions of memory into hugepages (2M, 1G, etc) in
+various page table levels.
+
+Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
+The splintering of huge direct pages into smaller ones does result in
+a measurable performance hit caused by frequent TLB miss and reloads.
+
+One of the many lasting (as we don't coalesce back) sources for huge page
+splits is tracing as the granular page attribute/permission changes would
+force the kernel to split code segments mapped to hugepages to smaller
+ones thus increasing the probability of TLB miss/reloads even after
+tracing has been stopped.
+
+On x86 systems, we can track the splitting of huge direct mapped pages
+through lifetime event counters in ``/proc/vmstat``
+
+	direct_map_level2_splits xxx
+	direct_map_level3_splits yyy
+
+where:
+
+direct_map_level2_splits
+	are 2M/4M hugepage split events
+direct_map_level3_splits
+	are 1G hugepage split events
+
+The distribution of direct mapped system memory in various page sizes
+post splits can be viewed through ``/proc/meminfo`` whose output
+will include the following lines depending upon supporting CPU
+architecture
+
+	DirectMap4k:    xxxxx kB
+	DirectMap2M:    yyyyy kB
+	DirectMap1G:    zzzzz kB
+
+where:
+
+DirectMap4k
+	is the total amount of direct mapped memory (in kB)
+	accessed through 4k pages
+DirectMap2M
+	is the total amount of direct mapped memory (in kB)
+	accessed through 2M pages
+DirectMap1G
+	is the total amount of direct mapped memory (in kB)
+	accessed through 1G pages
+
+
+-- Saravanan D, Jan 27, 2021
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 4b14d8b50e9e..9439780f3f07 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -38,3 +38,4 @@  the Linux memory management.
    soft-dirty
    transhuge
    userfaultfd
+   direct_mapping_splits
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..a7b3c5f1d316 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@ 
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -91,6 +93,12 @@  static void split_page_count(int level)
 		return;
 
 	direct_pages_count[level]--;
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@  const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */