Message ID | 20210128104934.2916679-1-saravanand@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V5] x86/mm: Tracking linear mapping split events | expand |
On Thu, Jan 28, 2021 at 02:49:34AM -0800, Saravanan D wrote: > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. You didn't answer my question. Is this tracing of userspace programs causing splits, or is it kernel tracing? Also, we have lots of kinds of tracing these days; are you referring to kprobes? tracepoints? ftrace? Something else?
On 28 Jan 2021, at 5:49, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic lifetime hugepage split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_level2_splits 94 > direct_map_level3_splits 4 > nr_unstable 0 > .... > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. It is interesting to see this statement saying splitting kernel direct mappings causes performance loss, when Zhengjun (cc’d) from Intel recently posted a kernel direct mapping performance report[1] saying 1GB mappings are good but not much better than 2MB and 4KB mappings. I would love to hear the stories from both sides. Or maybe I misunderstand anything. [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > Documentation regarding linear mapping split events added to admin-guide > as requested in V3 of the patch. > > Signed-off-by: Saravanan D <saravanand@fb.com> > --- > .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 1 + > arch/x86/mm/pat/set_memory.c | 8 +++ > include/linux/vm_event_item.h | 4 ++ > mm/vmstat.c | 4 ++ > 5 files changed, 76 insertions(+) > create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst > > diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst > new file mode 100644 > index 000000000000..298751391deb > --- /dev/null > +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst > @@ -0,0 +1,59 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Direct Mapping Splits > +===================== > + > +Kernel maps all of physical memory in linear/direct mapped pages with > +translation of virtual kernel address to physical address is achieved > +through a simple subtraction of offset. CPUs maintain a cache of these > +translations on fast caches called TLBs. CPU architectures like x86 allow > +direct mapping large portions of memory into hugepages (2M, 1G, etc) in > +various page table levels. > + > +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. > +The splintering of huge direct pages into smaller ones does result in > +a measurable performance hit caused by frequent TLB miss and reloads. > + > +One of the many lasting (as we don't coalesce back) sources for huge page > +splits is tracing as the granular page attribute/permission changes would > +force the kernel to split code segments mapped to hugepages to smaller > +ones thus increasing the probability of TLB miss/reloads even after > +tracing has been stopped. > + > +On x86 systems, we can track the splitting of huge direct mapped pages > +through lifetime event counters in ``/proc/vmstat`` > + > + direct_map_level2_splits xxx > + direct_map_level3_splits yyy > + > +where: > + > +direct_map_level2_splits > + are 2M/4M hugepage split events > +direct_map_level3_splits > + are 1G hugepage split events > + > +The distribution of direct mapped system memory in various page sizes > +post splits can be viewed through ``/proc/meminfo`` whose output > +will include the following lines depending upon supporting CPU > +architecture > + > + DirectMap4k: xxxxx kB > + DirectMap2M: yyyyy kB > + DirectMap1G: zzzzz kB > + > +where: > + > +DirectMap4k > + is the total amount of direct mapped memory (in kB) > + accessed through 4k pages > +DirectMap2M > + is the total amount of direct mapped memory (in kB) > + accessed through 2M pages > +DirectMap1G > + is the total amount of direct mapped memory (in kB) > + accessed through 1G pages > + > + > +-- Saravanan D, Jan 27, 2021 > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index 4b14d8b50e9e..9439780f3f07 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -38,3 +38,4 @@ the Linux memory management. > soft-dirty > transhuge > userfaultfd > + direct_mapping_splits > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 16f878c26667..a7b3c5f1d316 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -16,6 +16,8 @@ > #include <linux/pci.h> > #include <linux/vmalloc.h> > #include <linux/libnvdimm.h> > +#include <linux/vmstat.h> > +#include <linux/kernel.h> > > #include <asm/e820/api.h> > #include <asm/processor.h> > @@ -91,6 +93,12 @@ static void split_page_count(int level) > return; > > direct_pages_count[level]--; > + if (system_state == SYSTEM_RUNNING) { > + if (level == PG_LEVEL_2M) > + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); > + else if (level == PG_LEVEL_1G) > + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); > + } > direct_pages_count[level - 1] += PTRS_PER_PTE; > } > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 18e75974d4e3..7c06c2bdc33b 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > +#endif > +#ifdef CONFIG_X86 > + DIRECT_MAP_LEVEL2_SPLIT, > + DIRECT_MAP_LEVEL3_SPLIT, > #endif > NR_VM_EVENT_ITEMS > }; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index f8942160fc95..a43ac4ac98a2 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { > "swap_ra", > "swap_ra_hit", > #endif > +#ifdef CONFIG_X86 > + "direct_map_level2_splits", > + "direct_map_level3_splits", > +#endif > #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ > }; > #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ > -- > 2.24.1 — Best Regards, Yan Zi
On 1/28/21 8:33 AM, Zi Yan wrote: >> One of the many lasting (as we don't coalesce back) sources for >> huge page splits is tracing as the granular page >> attribute/permission changes would force the kernel to split code >> segments mapped to huge pages to smaller ones thereby increasing >> the probability of TLB miss/reload even after tracing has been >> stopped. > It is interesting to see this statement saying splitting kernel > direct mappings causes performance loss, when Zhengjun (cc’d) from > Intel recently posted a kernel direct mapping performance report[1] > saying 1GB mappings are good but not much better than 2MB and 4KB > mappings. No, that's not what the report said. *Overall*, there is no clear winner between 4k, 2M and 1G. In other words, no one page size is best for *ALL* workloads. There were *ABSOLUTELY* individual workloads in those tests that saw significant deltas between the direct map sizes. There are also real-world workloads that feel the impact here.
On 28 Jan 2021, at 11:41, Dave Hansen wrote: > On 1/28/21 8:33 AM, Zi Yan wrote: >>> One of the many lasting (as we don't coalesce back) sources for >>> huge page splits is tracing as the granular page >>> attribute/permission changes would force the kernel to split code >>> segments mapped to huge pages to smaller ones thereby increasing >>> the probability of TLB miss/reload even after tracing has been >>> stopped. >> It is interesting to see this statement saying splitting kernel >> direct mappings causes performance loss, when Zhengjun (cc’d) from >> Intel recently posted a kernel direct mapping performance report[1] >> saying 1GB mappings are good but not much better than 2MB and 4KB >> mappings. > > No, that's not what the report said. > > *Overall*, there is no clear winner between 4k, 2M and 1G. In other > words, no one page size is best for *ALL* workloads. > > There were *ABSOLUTELY* individual workloads in those tests that saw > significant deltas between the direct map sizes. There are also > real-world workloads that feel the impact here. Yes, it is what I understand from the report. But this patch says “ Maintaining huge direct mapped pages greatly reduces TLB miss pressure. The splintering of huge direct pages into smaller ones does result in a measurable performance hit caused by frequent TLB miss and reloads. ”, indicating large mappings (2MB, 1GB) are generally better. It is different from what the report said, right? The above text could be improved to make sure readers get both sides of the story and not get afraid of performance loss after seeing a lot of direct_map_xxx_splits events. — Best Regards, Yan Zi
> On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@nvidia.com> wrote: > > On 28 Jan 2021, at 5:49, Saravanan D wrote: > >> To help with debugging the sluggishness caused by TLB miss/reload, >> we introduce monotonic lifetime hugepage split event counts since >> system state: SYSTEM_RUNNING to be displayed as part of >> /proc/vmstat in x86 servers >> >> The lifetime split event information will be displayed at the bottom of >> /proc/vmstat >> .... >> swap_ra 0 >> swap_ra_hit 0 >> direct_map_level2_splits 94 >> direct_map_level3_splits 4 >> nr_unstable 0 >> .... >> >> One of the many lasting (as we don't coalesce back) sources for huge page >> splits is tracing as the granular page attribute/permission changes would >> force the kernel to split code segments mapped to huge pages to smaller >> ones thereby increasing the probability of TLB miss/reload even after >> tracing has been stopped. > > It is interesting to see this statement saying splitting kernel direct mappings > causes performance loss, when Zhengjun (cc’d) from Intel recently posted > a kernel direct mapping performance report[1] saying 1GB mappings are good > but not much better than 2MB and 4KB mappings. > > I would love to hear the stories from both sides. Or maybe I misunderstand > anything. We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page table entry into 512x 4kB ones. This split caused ~1% performance regression. That instance was fixed in [1]. Saravanan, could you please share more information about the split. Is it possible to avoid the split? If not, can we regroup after tracing is disabled? We have the split-and-regroup logic for application .text on THP. When uprobe is attached to the THP text, we have to split the 2MB page table entry. So we introduced mechanism to regroup the 2MB page table entry when all uprobes are removed from the THP [2]. Thanks, Song [1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text") [2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes") > > > [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ >> >> Documentation regarding linear mapping split events added to admin-guide >> as requested in V3 of the patch. >> >> Signed-off-by: Saravanan D <saravanand@fb.com> >> --- >> .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ >> Documentation/admin-guide/mm/index.rst | 1 + >> arch/x86/mm/pat/set_memory.c | 8 +++ >> include/linux/vm_event_item.h | 4 ++ >> mm/vmstat.c | 4 ++ >> 5 files changed, 76 insertions(+) >> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst >> >> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> new file mode 100644 >> index 000000000000..298751391deb >> --- /dev/null >> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> @@ -0,0 +1,59 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +===================== >> +Direct Mapping Splits >> +===================== >> + >> +Kernel maps all of physical memory in linear/direct mapped pages with >> +translation of virtual kernel address to physical address is achieved >> +through a simple subtraction of offset. CPUs maintain a cache of these >> +translations on fast caches called TLBs. CPU architectures like x86 allow >> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in >> +various page table levels. >> + >> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. >> +The splintering of huge direct pages into smaller ones does result in >> +a measurable performance hit caused by frequent TLB miss and reloads. >> + >> +One of the many lasting (as we don't coalesce back) sources for huge page >> +splits is tracing as the granular page attribute/permission changes would >> +force the kernel to split code segments mapped to hugepages to smaller >> +ones thus increasing the probability of TLB miss/reloads even after >> +tracing has been stopped. >> + >> +On x86 systems, we can track the splitting of huge direct mapped pages >> +through lifetime event counters in ``/proc/vmstat`` >> + >> + direct_map_level2_splits xxx >> + direct_map_level3_splits yyy >> + >> +where: >> + >> +direct_map_level2_splits >> + are 2M/4M hugepage split events >> +direct_map_level3_splits >> + are 1G hugepage split events >> + >> +The distribution of direct mapped system memory in various page sizes >> +post splits can be viewed through ``/proc/meminfo`` whose output >> +will include the following lines depending upon supporting CPU >> +architecture >> + >> + DirectMap4k: xxxxx kB >> + DirectMap2M: yyyyy kB >> + DirectMap1G: zzzzz kB >> + >> +where: >> + >> +DirectMap4k >> + is the total amount of direct mapped memory (in kB) >> + accessed through 4k pages >> +DirectMap2M >> + is the total amount of direct mapped memory (in kB) >> + accessed through 2M pages >> +DirectMap1G >> + is the total amount of direct mapped memory (in kB) >> + accessed through 1G pages >> + >> + >> +-- Saravanan D, Jan 27, 2021 >> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst >> index 4b14d8b50e9e..9439780f3f07 100644 >> --- a/Documentation/admin-guide/mm/index.rst >> +++ b/Documentation/admin-guide/mm/index.rst >> @@ -38,3 +38,4 @@ the Linux memory management. >> soft-dirty >> transhuge >> userfaultfd >> + direct_mapping_splits >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c >> index 16f878c26667..a7b3c5f1d316 100644 >> --- a/arch/x86/mm/pat/set_memory.c >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -16,6 +16,8 @@ >> #include <linux/pci.h> >> #include <linux/vmalloc.h> >> #include <linux/libnvdimm.h> >> +#include <linux/vmstat.h> >> +#include <linux/kernel.h> >> >> #include <asm/e820/api.h> >> #include <asm/processor.h> >> @@ -91,6 +93,12 @@ static void split_page_count(int level) >> return; >> >> direct_pages_count[level]--; >> + if (system_state == SYSTEM_RUNNING) { >> + if (level == PG_LEVEL_2M) >> + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); >> + else if (level == PG_LEVEL_1G) >> + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); >> + } >> direct_pages_count[level - 1] += PTRS_PER_PTE; >> } >> >> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >> index 18e75974d4e3..7c06c2bdc33b 100644 >> --- a/include/linux/vm_event_item.h >> +++ b/include/linux/vm_event_item.h >> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >> #ifdef CONFIG_SWAP >> SWAP_RA, >> SWAP_RA_HIT, >> +#endif >> +#ifdef CONFIG_X86 >> + DIRECT_MAP_LEVEL2_SPLIT, >> + DIRECT_MAP_LEVEL3_SPLIT, >> #endif >> NR_VM_EVENT_ITEMS >> }; >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index f8942160fc95..a43ac4ac98a2 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { >> "swap_ra", >> "swap_ra_hit", >> #endif >> +#ifdef CONFIG_X86 >> + "direct_map_level2_splits", >> + "direct_map_level3_splits", >> +#endif >> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ >> }; >> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ >> -- >> 2.24.1 > > > — > Best Regards, > Yan Zi
On 1/28/21 2:49 AM, Saravanan D wrote: > +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst > @@ -0,0 +1,59 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Direct Mapping Splits > +===================== > + > +Kernel maps all of physical memory in linear/direct mapped pages with > +translation of virtual kernel address to physical address is achieved > +through a simple subtraction of offset. CPUs maintain a cache of these > +translations on fast caches called TLBs. CPU architectures like x86 allow > +direct mapping large portions of memory into hugepages (2M, 1G, etc) in > +various page table levels. > + > +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. > +The splintering of huge direct pages into smaller ones does result in > +a measurable performance hit caused by frequent TLB miss and reloads. Eek. There really doesn't appear to be a place in Documentation/ that we've documented vmstat entries. Maybe you can start: Documentation/admin-guide/mm/vmstat.rst Also, I don't think we need background on the direct map or TLBs here. Just get to the point and describe what the files do, don't justify why they are there. I also agree with Willy that you should qualify some of the strong statements (if they remain) in your changelog and documentation> This: Maintaining huge direct mapped pages greatly reduces TLB miss pressure. for instance isn't universally true. There were CPUs with a very small number of 1G TLB entries. Using 1G pages on those systems often led to *GREATER* TLB pressure and lower performance.
Hi Mathew, > Is this tracing of userspace programs causing splits, or is it kernel > tracing? Also, we have lots of kinds of tracing these days; are you > referring to kprobes? tracepoints? ftrace? Something else? It has to be kernel tracing (kprobes, tracepoints) as we are dealing with direct mapping splits. Kernel's direct mapping `` ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)`` The kernel text range ``ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0`` Source : Documentation/x86/x86_64/mm.rst Kernel code segment points to the same physical addresses already mapped in the direct mapping range (0x20000000 = 512 MB) When we enable kernel tracing, we would have to modify attributes/permissions of the text segment pages that are direct mapped causing them to split. When we track the direct_pages_count[] in arch/x86/mm/pat/set_memory.c There are only splits from higher levels. They never coalesce back. Splits when we turn on dynamic tracing .... cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 784 direct_map_level3_splits 12 bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }' cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 789 direct_map_level3_splits 12 .... Thanks, Saravanan D
Hi Dave, > > Eek. There really doesn't appear to be a place in Documentation/ that > we've documented vmstat entries. > > Maybe you can start: > > Documentation/admin-guide/mm/vmstat.rst > I was also very surprised that there does not exist documentation for vmstat, that lead me to add a page in admin-guide which now requires lot of caveats. Starting a new documentation for vmstat goes beyond the scope of this patch. I am inclined to remove Documentation from the next version [V6] of the patch. I presume that a detailed commit log [V6] explaining why direct mapped kernel page splis will never coalesce, how kernel tracing causes some of those splits and why it is worth tracking them can do the job. Proposed [V6] Commit Log: >>> To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic hugepage [direct mapped] split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_level2_splits 94 direct_map_level3_splits 4 nr_unstable 0 .... One of the many lasting sources of direct hugepage splits is kernel tracing (kprobes, tracepoints). Note that the kernel's code segment [512 MB] points to the same physical addresses that have been already mapped in the kernel's direct mapping range. Source : Documentation/x86/x86_64/mm.rst When we enable kernel tracing, the kernel has to modify attributes/permissions of the text segment hugepages that are direct mapped causing them to split. Kernel's direct mapped hugepages do not coalesce back after split and remain in place for the remainder of the lifetime. An instance of direct page splits when we turn on dynamic kernel tracing .... cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 784 direct_map_level3_splits 12 bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }' cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 789 direct_map_level3_splits 12 .... <<< Thanks, Saravanan D
diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst new file mode 100644 index 000000000000..298751391deb --- /dev/null +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst @@ -0,0 +1,59 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Direct Mapping Splits +===================== + +Kernel maps all of physical memory in linear/direct mapped pages with +translation of virtual kernel address to physical address is achieved +through a simple subtraction of offset. CPUs maintain a cache of these +translations on fast caches called TLBs. CPU architectures like x86 allow +direct mapping large portions of memory into hugepages (2M, 1G, etc) in +various page table levels. + +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. +The splintering of huge direct pages into smaller ones does result in +a measurable performance hit caused by frequent TLB miss and reloads. + +One of the many lasting (as we don't coalesce back) sources for huge page +splits is tracing as the granular page attribute/permission changes would +force the kernel to split code segments mapped to hugepages to smaller +ones thus increasing the probability of TLB miss/reloads even after +tracing has been stopped. + +On x86 systems, we can track the splitting of huge direct mapped pages +through lifetime event counters in ``/proc/vmstat`` + + direct_map_level2_splits xxx + direct_map_level3_splits yyy + +where: + +direct_map_level2_splits + are 2M/4M hugepage split events +direct_map_level3_splits + are 1G hugepage split events + +The distribution of direct mapped system memory in various page sizes +post splits can be viewed through ``/proc/meminfo`` whose output +will include the following lines depending upon supporting CPU +architecture + + DirectMap4k: xxxxx kB + DirectMap2M: yyyyy kB + DirectMap1G: zzzzz kB + +where: + +DirectMap4k + is the total amount of direct mapped memory (in kB) + accessed through 4k pages +DirectMap2M + is the total amount of direct mapped memory (in kB) + accessed through 2M pages +DirectMap1G + is the total amount of direct mapped memory (in kB) + accessed through 1G pages + + +-- Saravanan D, Jan 27, 2021 diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 4b14d8b50e9e..9439780f3f07 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -38,3 +38,4 @@ the Linux memory management. soft-dirty transhuge userfaultfd + direct_mapping_splits diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..a7b3c5f1d316 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -16,6 +16,8 @@ #include <linux/pci.h> #include <linux/vmalloc.h> #include <linux/libnvdimm.h> +#include <linux/vmstat.h> +#include <linux/kernel.h> #include <asm/e820/api.h> #include <asm/processor.h> @@ -91,6 +93,12 @@ static void split_page_count(int level) return; direct_pages_count[level]--; + if (system_state == SYSTEM_RUNNING) { + if (level == PG_LEVEL_2M) + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); + else if (level == PG_LEVEL_1G) + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); + } direct_pages_count[level - 1] += PTRS_PER_PTE; } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..7c06c2bdc33b 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif +#ifdef CONFIG_X86 + DIRECT_MAP_LEVEL2_SPLIT, + DIRECT_MAP_LEVEL3_SPLIT, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index f8942160fc95..a43ac4ac98a2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { "swap_ra", "swap_ra_hit", #endif +#ifdef CONFIG_X86 + "direct_map_level2_splits", + "direct_map_level3_splits", +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic lifetime hugepage split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_level2_splits 94 direct_map_level3_splits 4 nr_unstable 0 .... One of the many lasting (as we don't coalesce back) sources for huge page splits is tracing as the granular page attribute/permission changes would force the kernel to split code segments mapped to huge pages to smaller ones thereby increasing the probability of TLB miss/reload even after tracing has been stopped. Documentation regarding linear mapping split events added to admin-guide as requested in V3 of the patch. Signed-off-by: Saravanan D <saravanand@fb.com> --- .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ Documentation/admin-guide/mm/index.rst | 1 + arch/x86/mm/pat/set_memory.c | 8 +++ include/linux/vm_event_item.h | 4 ++ mm/vmstat.c | 4 ++ 5 files changed, 76 insertions(+) create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst