Message ID | 20240804080107.21094-4-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Introduce a new sysctl knob vm.pcp_batch_scale_max | expand |
Yafang Shao <laoar.shao@gmail.com> writes: [snip] > > Why introduce a systl knob? > =========================== > > From the above data, it's clear that different CPU types have varying > allocation latencies concerning zone->lock contention. Typically, people > don't release individual kernel packages for each type of x86_64 CPU. > > Furthermore, for latency-insensitive applications, we can keep the default > setting for better throughput. Do you have any data to prove that the default setting is better for throughput? If so, that will be a strong support for your patch. > In our production environment, we set this > value to 0 for applications running on Kubernetes servers while keeping it > at the default value of 5 for other applications like big data. It's not > common to release individual kernel packages for each application. > [snip] -- Best Regards, Huang, Ying
On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > [snip] > > > > > Why introduce a systl knob? > > =========================== > > > > From the above data, it's clear that different CPU types have varying > > allocation latencies concerning zone->lock contention. Typically, people > > don't release individual kernel packages for each type of x86_64 CPU. > > > > Furthermore, for latency-insensitive applications, we can keep the default > > setting for better throughput. > > Do you have any data to prove that the default setting is better for > throughput? If so, that will be a strong support for your patch. No, I don't. The primary reason we can't change the default value from 5 to 0 across our fleet of servers is that you initially set it to 5. The sysadmins believe you had a strong reason for setting it to 5 by default; otherwise, it would be considered careless for the upstream kernel. I also believe you must have had a solid justification for setting the default value to 5; otherwise, why would you have submitted your patches? > > > In our production environment, we set this > > value to 0 for applications running on Kubernetes servers while keeping it > > at the default value of 5 for other applications like big data. It's not > > common to release individual kernel packages for each application. > > > > [snip] > > -- > Best Regards, > Huang, Ying -- Regards Yafang
Yafang Shao <laoar.shao@gmail.com> writes: > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> [snip] >> >> > >> > Why introduce a systl knob? >> > =========================== >> > >> > From the above data, it's clear that different CPU types have varying >> > allocation latencies concerning zone->lock contention. Typically, people >> > don't release individual kernel packages for each type of x86_64 CPU. >> > >> > Furthermore, for latency-insensitive applications, we can keep the default >> > setting for better throughput. >> >> Do you have any data to prove that the default setting is better for >> throughput? If so, that will be a strong support for your patch. > > No, I don't. The primary reason we can't change the default value from > 5 to 0 across our fleet of servers is that you initially set it to 5. > The sysadmins believe you had a strong reason for setting it to 5 by > default; otherwise, it would be considered careless for the upstream > kernel. I also believe you must have had a solid justification for > setting the default value to 5; otherwise, why would you have > submitted your patches? In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long latency"), I tried my best to run test on the machines available with a micro-benchmark (will-it-scale/page_fault1) which exercises kernel page allocator heavily. From the data in commit, larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not much. The 99% alloc/free latency can be kept within about 100us with CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. But, we can always improve the default value with more data, on more types of machines and with more types of benchmarks, etc. Your data suggest smaller default value because you have data to show that larger default value has the latency spike issue (as large as tens ms) for some practical workloads. Which weren't tested previously. In contrast, we don't have strong data to show the throughput advantages of larger CONFIG_PCP_BATCH_SCALE_MAX value. So, I suggest to use a smaller default value for CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best choice. >> >> > In our production environment, we set this >> > value to 0 for applications running on Kubernetes servers while keeping it >> > at the default value of 5 for other applications like big data. It's not >> > common to release individual kernel packages for each application. >> > >> >> [snip] >> -- Best Regards, Huang, Ying
On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> [snip] > >> > >> > > >> > Why introduce a systl knob? > >> > =========================== > >> > > >> > From the above data, it's clear that different CPU types have varying > >> > allocation latencies concerning zone->lock contention. Typically, people > >> > don't release individual kernel packages for each type of x86_64 CPU. > >> > > >> > Furthermore, for latency-insensitive applications, we can keep the default > >> > setting for better throughput. > >> > >> Do you have any data to prove that the default setting is better for > >> throughput? If so, that will be a strong support for your patch. > > > > No, I don't. The primary reason we can't change the default value from > > 5 to 0 across our fleet of servers is that you initially set it to 5. > > The sysadmins believe you had a strong reason for setting it to 5 by > > default; otherwise, it would be considered careless for the upstream > > kernel. I also believe you must have had a solid justification for > > setting the default value to 5; otherwise, why would you have > > submitted your patches? > > In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to > avoid too long latency"), I tried my best to run test on the machines > available with a micro-benchmark (will-it-scale/page_fault1) which > exercises kernel page allocator heavily. From the data in commit, > larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not > much. The 99% alloc/free latency can be kept within about 100us with > CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. > > But, we can always improve the default value with more data, on more > types of machines and with more types of benchmarks, etc. > > Your data suggest smaller default value because you have data to show > that larger default value has the latency spike issue (as large as tens > ms) for some practical workloads. Which weren't tested previously. In > contrast, we don't have strong data to show the throughput advantages of > larger CONFIG_PCP_BATCH_SCALE_MAX value. > > So, I suggest to use a smaller default value for > CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the > data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best > choice. Which smaller default value would be better? How can we ensure that other workloads, which we haven't tested, will work well with this new default value? If you have a better default value in mind, would you consider sending a patch for it? I would be happy to test it with my test case. -- Regards Yafang
Yafang Shao <laoar.shao@gmail.com> writes: > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> [snip] >> >> >> >> > >> >> > Why introduce a systl knob? >> >> > =========================== >> >> > >> >> > From the above data, it's clear that different CPU types have varying >> >> > allocation latencies concerning zone->lock contention. Typically, people >> >> > don't release individual kernel packages for each type of x86_64 CPU. >> >> > >> >> > Furthermore, for latency-insensitive applications, we can keep the default >> >> > setting for better throughput. >> >> >> >> Do you have any data to prove that the default setting is better for >> >> throughput? If so, that will be a strong support for your patch. >> > >> > No, I don't. The primary reason we can't change the default value from >> > 5 to 0 across our fleet of servers is that you initially set it to 5. >> > The sysadmins believe you had a strong reason for setting it to 5 by >> > default; otherwise, it would be considered careless for the upstream >> > kernel. I also believe you must have had a solid justification for >> > setting the default value to 5; otherwise, why would you have >> > submitted your patches? >> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to >> avoid too long latency"), I tried my best to run test on the machines >> available with a micro-benchmark (will-it-scale/page_fault1) which >> exercises kernel page allocator heavily. From the data in commit, >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not >> much. The 99% alloc/free latency can be kept within about 100us with >> CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. >> >> But, we can always improve the default value with more data, on more >> types of machines and with more types of benchmarks, etc. >> >> Your data suggest smaller default value because you have data to show >> that larger default value has the latency spike issue (as large as tens >> ms) for some practical workloads. Which weren't tested previously. In >> contrast, we don't have strong data to show the throughput advantages of >> larger CONFIG_PCP_BATCH_SCALE_MAX value. >> >> So, I suggest to use a smaller default value for >> CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best >> choice. > > Which smaller default value would be better? This depends on further test results. > How can we ensure that other workloads, which we haven't tested, will > work well with this new default value? We cannot. We can only depends on the data available. If there are new data available in the future, we can make the change accordingly. > If you have a better default value in mind, would you consider sending > a patch for it? I would be happy to test it with my test case. If you can test the value 1, 2, 3, and 4 with your workload, that will be very helpful! Both allocation latency and total free time (if possible) are valuable. -- Best Regards, Huang, Ying
On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> [snip] > >> >> > >> >> > > >> >> > Why introduce a systl knob? > >> >> > =========================== > >> >> > > >> >> > From the above data, it's clear that different CPU types have varying > >> >> > allocation latencies concerning zone->lock contention. Typically, people > >> >> > don't release individual kernel packages for each type of x86_64 CPU. > >> >> > > >> >> > Furthermore, for latency-insensitive applications, we can keep the default > >> >> > setting for better throughput. > >> >> > >> >> Do you have any data to prove that the default setting is better for > >> >> throughput? If so, that will be a strong support for your patch. > >> > > >> > No, I don't. The primary reason we can't change the default value from > >> > 5 to 0 across our fleet of servers is that you initially set it to 5. > >> > The sysadmins believe you had a strong reason for setting it to 5 by > >> > default; otherwise, it would be considered careless for the upstream > >> > kernel. I also believe you must have had a solid justification for > >> > setting the default value to 5; otherwise, why would you have > >> > submitted your patches? > >> > >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to > >> avoid too long latency"), I tried my best to run test on the machines > >> available with a micro-benchmark (will-it-scale/page_fault1) which > >> exercises kernel page allocator heavily. From the data in commit, > >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not > >> much. The 99% alloc/free latency can be kept within about 100us with > >> CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. > >> > >> But, we can always improve the default value with more data, on more > >> types of machines and with more types of benchmarks, etc. > >> > >> Your data suggest smaller default value because you have data to show > >> that larger default value has the latency spike issue (as large as tens > >> ms) for some practical workloads. Which weren't tested previously. In > >> contrast, we don't have strong data to show the throughput advantages of > >> larger CONFIG_PCP_BATCH_SCALE_MAX value. > >> > >> So, I suggest to use a smaller default value for > >> CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the > >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best > >> choice. > > > > Which smaller default value would be better? > > This depends on further test results. I believe you agree with me that you can't test all workloads. > > > How can we ensure that other workloads, which we haven't tested, will > > work well with this new default value? > > We cannot. We can only depends on the data available. If there are > new data available in the future, we can make the change accordingly. So, your solution is to change the hardcoded value for untested workloads and then release the kernel package again? > > > If you have a better default value in mind, would you consider sending > > a patch for it? I would be happy to test it with my test case. > > If you can test the value 1, 2, 3, and 4 with your workload, that will > be very helpful! Both allocation latency and total free time (if > possible) are valuable. You know I can't verify it with all workloads, right? You have so much data to verify, which indicates uncertainty about any default value. Why not make it tunable and let the user choose the value they prefer?
Yafang Shao <laoar.shao@gmail.com> writes: > On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> [snip] >> >> >> >> >> >> > >> >> >> > Why introduce a systl knob? >> >> >> > =========================== >> >> >> > >> >> >> > From the above data, it's clear that different CPU types have varying >> >> >> > allocation latencies concerning zone->lock contention. Typically, people >> >> >> > don't release individual kernel packages for each type of x86_64 CPU. >> >> >> > >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default >> >> >> > setting for better throughput. >> >> >> >> >> >> Do you have any data to prove that the default setting is better for >> >> >> throughput? If so, that will be a strong support for your patch. >> >> > >> >> > No, I don't. The primary reason we can't change the default value from >> >> > 5 to 0 across our fleet of servers is that you initially set it to 5. >> >> > The sysadmins believe you had a strong reason for setting it to 5 by >> >> > default; otherwise, it would be considered careless for the upstream >> >> > kernel. I also believe you must have had a solid justification for >> >> > setting the default value to 5; otherwise, why would you have >> >> > submitted your patches? >> >> >> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to >> >> avoid too long latency"), I tried my best to run test on the machines >> >> available with a micro-benchmark (will-it-scale/page_fault1) which >> >> exercises kernel page allocator heavily. From the data in commit, >> >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not >> >> much. The 99% alloc/free latency can be kept within about 100us with >> >> CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. >> >> >> >> But, we can always improve the default value with more data, on more >> >> types of machines and with more types of benchmarks, etc. >> >> >> >> Your data suggest smaller default value because you have data to show >> >> that larger default value has the latency spike issue (as large as tens >> >> ms) for some practical workloads. Which weren't tested previously. In >> >> contrast, we don't have strong data to show the throughput advantages of >> >> larger CONFIG_PCP_BATCH_SCALE_MAX value. >> >> >> >> So, I suggest to use a smaller default value for >> >> CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the >> >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best >> >> choice. >> > >> > Which smaller default value would be better? >> >> This depends on further test results. > > I believe you agree with me that you can't test all workloads. > >> >> > How can we ensure that other workloads, which we haven't tested, will >> > work well with this new default value? >> >> We cannot. We can only depends on the data available. If there are >> new data available in the future, we can make the change accordingly. > > So, your solution is to change the hardcoded value for untested > workloads and then release the kernel package again? > >> >> > If you have a better default value in mind, would you consider sending >> > a patch for it? I would be happy to test it with my test case. >> >> If you can test the value 1, 2, 3, and 4 with your workload, that will >> be very helpful! Both allocation latency and total free time (if >> possible) are valuable. > > You know I can't verify it with all workloads, right? > You have so much data to verify, which indicates uncertainty about any > default value. Why not make it tunable and let the user choose the > value they prefer? We only make decision based on data available. In theory, we cannot test all workloads, because there will be new workloads in the future. If we have data to show that smaller value will cause performance regressions for some reasonable workloads, we can make it user tunable. -- Best Regards, Huang, Ying
On Mon, Aug 5, 2024 at 1:04 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> [snip] > >> >> >> > >> >> >> > > >> >> >> > Why introduce a systl knob? > >> >> >> > =========================== > >> >> >> > > >> >> >> > From the above data, it's clear that different CPU types have varying > >> >> >> > allocation latencies concerning zone->lock contention. Typically, people > >> >> >> > don't release individual kernel packages for each type of x86_64 CPU. > >> >> >> > > >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default > >> >> >> > setting for better throughput. > >> >> >> > >> >> >> Do you have any data to prove that the default setting is better for > >> >> >> throughput? If so, that will be a strong support for your patch. > >> >> > > >> >> > No, I don't. The primary reason we can't change the default value from > >> >> > 5 to 0 across our fleet of servers is that you initially set it to 5. > >> >> > The sysadmins believe you had a strong reason for setting it to 5 by > >> >> > default; otherwise, it would be considered careless for the upstream > >> >> > kernel. I also believe you must have had a solid justification for > >> >> > setting the default value to 5; otherwise, why would you have > >> >> > submitted your patches? > >> >> > >> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to > >> >> avoid too long latency"), I tried my best to run test on the machines > >> >> available with a micro-benchmark (will-it-scale/page_fault1) which > >> >> exercises kernel page allocator heavily. From the data in commit, > >> >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not > >> >> much. The 99% alloc/free latency can be kept within about 100us with > >> >> CONFIG_PCP_BATCH_SCALE_MAX == 5. So, we chose 5 as default value. > >> >> > >> >> But, we can always improve the default value with more data, on more > >> >> types of machines and with more types of benchmarks, etc. > >> >> > >> >> Your data suggest smaller default value because you have data to show > >> >> that larger default value has the latency spike issue (as large as tens > >> >> ms) for some practical workloads. Which weren't tested previously. In > >> >> contrast, we don't have strong data to show the throughput advantages of > >> >> larger CONFIG_PCP_BATCH_SCALE_MAX value. > >> >> > >> >> So, I suggest to use a smaller default value for > >> >> CONFIG_PCP_BATCH_SCALE_MAX. But, we may need more test to check the > >> >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best > >> >> choice. > >> > > >> > Which smaller default value would be better? > >> > >> This depends on further test results. > > > > I believe you agree with me that you can't test all workloads. > > > >> > >> > How can we ensure that other workloads, which we haven't tested, will > >> > work well with this new default value? > >> > >> We cannot. We can only depends on the data available. If there are > >> new data available in the future, we can make the change accordingly. > > > > So, your solution is to change the hardcoded value for untested > > workloads and then release the kernel package again? > > > >> > >> > If you have a better default value in mind, would you consider sending > >> > a patch for it? I would be happy to test it with my test case. > >> > >> If you can test the value 1, 2, 3, and 4 with your workload, that will > >> be very helpful! Both allocation latency and total free time (if > >> possible) are valuable. > > > > You know I can't verify it with all workloads, right? > > You have so much data to verify, which indicates uncertainty about any > > default value. Why not make it tunable and let the user choose the > > value they prefer? > > We only make decision based on data available. In theory, we cannot > test all workloads, because there will be new workloads in the future. > If we have data to show that smaller value will cause performance > regressions for some reasonable workloads, we can make it user tunable. The issue arises when a new workload is discovered; you have to release a new kernel package for it. If that's your expectation, why not make it tunable from the start? Had you made it tunable in your original commit, we wouldn't be having this non-intuitive discussion repeatedly. Which came first, the chicken or the egg?
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..4971289dfb79 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm: - page-cluster - page_lock_unfairness - panic_on_oom +- pcp_batch_scale_max - percpu_pagelist_high_fraction - stat_interval - stat_refresh @@ -883,6 +884,22 @@ panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. +pcp_batch_scale_max +=================== + +In page allocator, PCP (Per-CPU pageset) is refilled and drained in +batches. The batch number is scaled automatically to improve page +allocation/free throughput. But too large scale factor may hurt +latency. This option sets the upper limit of scale factor to limit +the maximum latency. + +The range for this parameter spans from 0 to 6, with a default value of 5. +The value assigned to 'N' signifies that during each refilling or draining +process, a maximum of (batch << N) pages will be involved, where "batch" +represents the default batch size automatically computed by the kernel for +each zone. + + percpu_pagelist_high_fraction ============================= diff --git a/mm/Kconfig b/mm/Kconfig index 7b716ac80272..14f64b4f744a 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -690,17 +690,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA -config PCP_BATCH_SCALE_MAX - int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" - default 5 - range 0 6 - help - In page allocator, PCP (Per-CPU pageset) is refilled and drained in - batches. The batch number is scaled automatically to improve page - allocation/free throughput. But too large scale factor may hurt - latency. This option sets the upper limit of scale factor to limit - the maximum latency. - config PHYS_ADDR_T_64BIT def_bool 64BIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5a842cc13314..bf0c94a0b659 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,6 +273,8 @@ int min_free_kbytes = 1024; int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; +static int pcp_batch_scale_max = 5; +static int sysctl_6 = 6; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -2391,7 +2393,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) count = pcp->count; if (count) { int to_drain = min(count, - pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); + pcp->batch << pcp_batch_scale_max); free_pcppages_bulk(zone, to_drain, pcp, 0); count -= to_drain; @@ -2519,7 +2521,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); + return min(pcp->count, batch << pcp_batch_scale_max); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2551,7 +2553,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return 0; if (unlikely(free_high)) { - pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), + pcp->high = max(high - (batch << pcp_batch_scale_max), high_min); return 0; } @@ -2621,9 +2623,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) + if (pcp->free_count < (batch << pcp_batch_scale_max)) pcp->free_count = min(pcp->free_count + (1 << order), - batch << CONFIG_PCP_BATCH_SCALE_MAX); + batch << pcp_batch_scale_max); high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2964,7 +2966,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) * subsequent allocation of order-0 pages without any freeing. */ if (batch <= max_nr_alloc && - pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) + pcp->alloc_factor < pcp_batch_scale_max) pcp->alloc_factor++; batch = min(batch, max_nr_alloc); } @@ -6341,6 +6343,15 @@ static struct ctl_table page_alloc_sysctl_table[] = { .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, .extra1 = SYSCTL_ZERO, }, + { + .procname = "pcp_batch_scale_max", + .data = &pcp_batch_scale_max, + .maxlen = sizeof(pcp_batch_scale_max), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &sysctl_6, + }, { .procname = "lowmem_reserve_ratio", .data = &sysctl_lowmem_reserve_ratio,
Larger page allocation/freeing batch number may cause longer run time of code holding zone->lock. If zone->lock is heavily contended at the same time, latency spikes may occur even for casual page allocation/freeing. Although reducing the batch number cannot make zone->lock contended lighter, it can reduce the latency spikes effectively. To demonstrate this, I wrote a Python script: import mmap size = 6 * 1024**3 while True: mm = mmap.mmap(-1, size) mm[:] = b'\xff' * size mm.close() Run this script 10 times in parallel and measure the allocation latency by measuring the duration of rmqueue_bulk() with the BCC tools funclatency[0]: funclatency -T -i 600 rmqueue_bulk Here are the results for both AMD and Intel CPUs. AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server ===================================================================== - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 12 | | 1024 -> 2047 : 9116 | | 2048 -> 4095 : 2004 | | 4096 -> 8191 : 2497 | | 8192 -> 16383 : 2127 | | 16384 -> 32767 : 2483 | | 32768 -> 65535 : 10102 | | 65536 -> 131071 : 212730 |******************* | 131072 -> 262143 : 314692 |***************************** | 262144 -> 524287 : 430058 |****************************************| 524288 -> 1048575 : 224032 |******************** | 1048576 -> 2097151 : 73567 |****** | 2097152 -> 4194303 : 17079 |* | 4194304 -> 8388607 : 3900 | | 8388608 -> 16777215 : 750 | | 16777216 -> 33554431 : 88 | | 33554432 -> 67108863 : 2 | | avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242 The avg alloc latency can be 449us, and the max latency can be higher than 30ms. - Value set to 0 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 92 | | 1024 -> 2047 : 8594 | | 2048 -> 4095 : 2042818 |****** | 4096 -> 8191 : 8737624 |************************** | 8192 -> 16383 : 13147872 |****************************************| 16384 -> 32767 : 8799951 |************************** | 32768 -> 65535 : 2879715 |******** | 65536 -> 131071 : 659600 |** | 131072 -> 262143 : 204004 | | 262144 -> 524287 : 78246 | | 524288 -> 1048575 : 30800 | | 1048576 -> 2097151 : 12251 | | 2097152 -> 4194303 : 2950 | | 4194304 -> 8388607 : 78 | | avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636 The avg was reduced significantly to 19us, and the max latency is reduced to less than 8ms. - Conclusion On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce latency. Latency-sensitive applications will benefit from this tuning. However, I don't have access to other types of AMD CPUs, so I was unable to test it on different AMD models. Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes ============================================================ - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 2419 | | 1024 -> 2047 : 34499 |* | 2048 -> 4095 : 4272 | | 4096 -> 8191 : 9035 | | 8192 -> 16383 : 4374 | | 16384 -> 32767 : 2963 | | 32768 -> 65535 : 6407 | | 65536 -> 131071 : 884806 |****************************************| 131072 -> 262143 : 145931 |****** | 262144 -> 524287 : 13406 | | 524288 -> 1048575 : 1874 | | 1048576 -> 2097151 : 249 | | 2097152 -> 4194303 : 28 | | avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263 - Conclusion This Intel CPU works fine with the default setting. Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node ============================================================== Using the cpuset cgroup, we can restrict the test script to run on NUMA node 0 only. - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 46 | | 512 -> 1023 : 695 | | 1024 -> 2047 : 19950 |* | 2048 -> 4095 : 1788 | | 4096 -> 8191 : 3392 | | 8192 -> 16383 : 2569 | | 16384 -> 32767 : 2619 | | 32768 -> 65535 : 3809 | | 65536 -> 131071 : 616182 |****************************************| 131072 -> 262143 : 295587 |******************* | 262144 -> 524287 : 75357 |**** | 524288 -> 1048575 : 15471 |* | 1048576 -> 2097151 : 2939 | | 2097152 -> 4194303 : 243 | | 4194304 -> 8388607 : 3 | | avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651 The zone->lock contention becomes severe when there is only a single NUMA node. The average latency is approximately 144us, with the maximum latency exceeding 4ms. - Value set to 0 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 24 | | 512 -> 1023 : 2686 | | 1024 -> 2047 : 10246 | | 2048 -> 4095 : 4061529 |********* | 4096 -> 8191 : 16894971 |****************************************| 8192 -> 16383 : 6279310 |************** | 16384 -> 32767 : 1658240 |*** | 32768 -> 65535 : 445760 |* | 65536 -> 131071 : 110817 | | 131072 -> 262143 : 20279 | | 262144 -> 524287 : 4176 | | 524288 -> 1048575 : 436 | | 1048576 -> 2097151 : 8 | | 2097152 -> 4194303 : 2 | | avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508 After setting it to 0, the avg latency is reduced to around 8us, and the max latency is less than 4ms. - Conclusion On this Intel CPU, this tuning doesn't help much. Latency-sensitive applications work well with the default setting. It is worth noting that all the above data were tested using the upstream kernel. Why introduce a systl knob? =========================== From the above data, it's clear that different CPU types have varying allocation latencies concerning zone->lock contention. Typically, people don't release individual kernel packages for each type of x86_64 CPU. Furthermore, for latency-insensitive applications, we can keep the default setting for better throughput. In our production environment, we set this value to 0 for applications running on Kubernetes servers while keeping it at the default value of 5 for other applications like big data. It's not common to release individual kernel packages for each application. Future work =========== To ultimately mitigate the zone->lock contention issue, several suggestions have been proposed. One approach involves dividing large zones into multi smaller zones, as suggested by Matthew[1], while another entails splitting the zone->lock using a mechanism similar to memory arenas and shifting away from relying solely on zone_id to identify the range of free lists a particular page belongs to, as suggested by Mel[2]. However, implementing these solutions is likely to necessitate a more extended development effort. Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [0] Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [1] Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [2] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> --- Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++ mm/Kconfig | 11 ----------- mm/page_alloc.c | 23 +++++++++++++++++------ 3 files changed, 34 insertions(+), 17 deletions(-)