Message ID | 20220104202227.2903605-1-yuzhao@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Multigenerational LRU Framework | expand |
On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > TLDR > ==== > The current page reclaim is too expensive in terms of CPU usage and it > often makes poor choices about what to evict. This patchset offers an > alternative solution that is performant, versatile and > straightforward. <snipped> > Summery > ======= > The facts are: > 1. The independent lab results and the real-world applications > indicate substantial improvements; there are no known regressions. > 2. Thrashing prevention, working set estimation and proactive reclaim > work out of the box; there are no equivalent solutions. > 3. There is a lot of new code; nobody has demonstrated smaller changes > with similar effects. > > Our options, accordingly, are: > 1. Given the amount of evidence, the reported improvements will likely > materialize for a wide range of workloads. > 2. Gauging the interest from the past discussions [14][15][16], the > new features will likely be put to use for both personal computers > and data centers. > 3. Based on Google's track record, the new code will likely be well > maintained in the long term. It'd be more difficult if not > impossible to achieve similar effects on top of the existing > design. Hi Andrew, Linus, Can you please take a look at this patchset and let me know if it's 5.17 material? My goal is to get it merged asap so that users can reap the benefits and I can push the sequels. Please examine the data provided -- I think the unprecedented coverage and the magnitude of the improvements warrant a green light. Thanks!
On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <yuzhao@google.com> wrote: > > My goal is to get it merged asap so that users can reap the benefits > and I can push the sequels. Please examine the data provided -- I > think the unprecedented coverage and the magnitude of the improvements > warrant a green light. I'll leave this to Andrew. I had some stylistic nits, but all the actual complexity is in that aging and eviction, and while I looked at the patches, I certainly couldn't make much of a judgement on them. The proof is in the numbers, and they look fine, but who knows what happens when others test it. I don't see anything that looks worrisome per se, I just see the silly small things that made me go "Eww". Linus
Fio / pmem benchmark with MGLRU TLDR ==== With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian (distribution) access and Gaussian (distribution) access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. Background ========== Many applications running on warehouse-scale computers heavily use POSIX read(2)/write(2) and page cache, e.g., Apache Kafka, a distributed streaming application used by "more than 80% of all Fortune 100 companies" [1] and PostgreSQL, "the world's most advanced open source relational database" [2]. Intel DC Persistent Memory, as an affordable alternative to DRAM, can deliver large capacity and data persistence. Specifically, the device used in this benchmark can achieve up to 36 GiB/s and 15 GiB/s throughput, respectively, for sequential and random read access. Our research group at the University of Rochester focuses on the intersection of computer architecture and system software. My current research interest is memory management on tiered memory systems. Matrix ====== Kernels: version [+ patchset] * Baseline: 5.15 * Patched: 5.15 + MGLRU Access patterns (4KB read): * Random (uniform) * Zipfian (theta 0.8; the recommended range is 0-2) * Gaussian (deviation 40; the possible range is 0-100) Concurrency conditions (the average number of jobs per CPU): * 1 * 2 Total file size (GB): 400 (~2x memory capacity) Total configurations: 12 Data points per configuration: 10 Total run duration (minutes) per data point: ~30 Notes ----- 1. All files were stored on pmem. Each job had the exclusive access to a single file. 2. Due to the hardware limitation when accessing remote pmem [3], numactl was used to bind the fio processes to the local pmem. Only one of the two NUMA nodes was used during the benchmark. 3. During dry runs, we observed that the throughput doesn't improve beyond 2 jobs per CPU for random access. Moreover, the patched kernel showed consistent improvements over the baseline kernel when using 3 or 4 jobs per CPU. 4. We wanted to simulate the real-world scenarios and therefore used default swap configuration (on). Moreover, we didn't observe any negative impact on performance with dry runs that disabled swap. Procedure ========= <for each kernel> grub2-reboot <baseline, patched> <for each concurrency condition> <generate test files> <for each access pattern> <for each data point> <reboot> <run fio> Hardware -------- Memory (GiB per socket): 192 CPU (# per socket): 40 Pmem (GiB per socket): 768 Fio --- $ fio -version fio-3.28 $ numactl --cpubind=0 --membind=0 fio --name=randread \ --directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \ --time_based --numjobs={40, 80} --ioengine=io_uring \ --ramp_time=20m --runtime=10m --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution={random, zipf:0.8, normal:40} \ --direct=0 --norandommap --group_reporting Results ======= Throughput ---------- The patched kernel achieved substantially higher throughput for all three access patterns and two concurrency conditions. Specifically, comparing the patched with the baseline kernel, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian access, and Gaussian access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. +---------------------+---------------+---------------+ | Mean MiB/s [95% CI] | 1 job / CPU | 2 jobs / CPU | +---------------------+---------------+---------------+ | Random access | 8411 / 11742 | 8417 / 12267 | | | [3275, 3387] | [3562, 4137] | +---------------------+---------------+---------------+ | Zipfian access | 14576 / 15360 | 12932 / 14181 | | | [600, 967] | [1220, 1279] | +---------------------+---------------+---------------+ | Gaussian access | 14564 / 15993 | 11513 / 14037 | | | [1348, 1508] | [2417, 2631] | +---------------------+---------------+---------------+ Table 1. Throughput comparison between the baseline and the patched kernels The patched kernel exhibited less degradation in throughput when running more concurrent jobs. Comparing 2 jobs per CPU with 1 job per CPU, fio achieved 95% CIs [-11.54, -11.02]%, [-16.91, -12.01]% and [-21.61, -20.30]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when using the baseline kernel; 95% CIs [2.04, 6.92]%, [-8.86, -6.48]% and [-12.83, -11.64]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when using the patched kernel. There were no statistically significant changes in throughput for the rest of the test matrix. +---------------------+-----------------+----------------+ | Mean MiB/s [95% CI] | Baseline kernel | Patched kernel | +---------------------+-----------------+----------------+ | Random access | 8411 / 8417 | 11741 / 12267 | | | [-55, 69] | [239, 812] | +---------------------+-----------------+----------------+ | Zipfian access | 14576 / 12932 | 15360/ 14181 | | | [-1682, -1607] | [-1361, -996] | +---------------------+-----------------+----------------+ | Gaussian access | 14565 / 11513 | 15993 / 14037 | | | [-3147, -2957] | [-2051, -1861] | +---------------------+-----------------+----------------+ Table 2. Throughput comparison between 1 job per CPU and 2 jobs per CPU Tail Latency ------------ Comparing the patched with the baseline kernel, fio experienced 95% CIs [-41.77, -40.35]% and [6.64, 13.95]% higher latency at the 99th percentile, respectively, for random access and Gaussian access, when the average number of jobs per CPU is 1; 95% CIs [-41.97, -40.59]%, [-47.74, -47.04]% and [-51.32, -50.27]% higher latency at the 99th percentile, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. There were no statistically significant changes in latency at the 99th percentile for the rest of the test matrix. +------------------------------+----------------+------------------+ | 99th percentile latency (us) | 1 job / CPU | 2 jobs / CPU | +------------------------------+----------------+------------------+ | Random access | 12466 / 7347 | 25560 / 15008 | | | [-5207, -5030] | [-10729, -10375] | +------------------------------+----------------+------------------+ | Zipfian access | 3395 / 3382 | 14563 / 7661 | | | [-131, 105] | [-6953,-6850] | +------------------------------+----------------+------------------+ | Gaussian access | 3280 / 3618 | 15611 / 7681 | | | [217, 457] | [-8012, -7848] | +------------------------------+----------------+------------------+ Table 3. Comparison of the 99th percentile latency between the baseline and the patched kernels (lower is better) Metrics collected during each run are available at: https://github.com/zhaishuang1/MglruPerf/tree/master A peek at 5.16-rc6 ------------------ We also ran the benchmark on 5.16-rc6 with swap off. However, we haven't collected enough data points to establish a 95% CI. Here are a few numbers we've collected: +----------------+------------+----------+----------------+----------+ | Access pattern | Jobs / CPU | 5.16-rc6 | 5.16-rc6-mglru | % change | +----------------+------------+----------+----------------+----------+ | Random access | 1 | 7467 | 10440 | 39.8% | +----------------+------------+----------+----------------+----------+ | Random access | 2 | 7504 | 13417 | 78.8% | +----------------+------------+----------+----------------+----------+ | Random access | 3 | 7511 | 13954 | 85.8% | +----------------+------------+----------+----------------+----------+ | Random access | 4 | 7542 | 13925 | 84.6% | +----------------+------------+----------+----------------+----------+ Reference ========= [1] https://kafka.apache.org/documentation/#design_filesystem [2] https://www.postgresql.org/docs/11/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY [3] System Evaluation of the Intel Optane byte-addressable NVM, MEMSYS 2019. Appendix ======== Throughput ---------- $ cat raw_data_fio.r v <- c( # baseline 40 procs random 8467.89, 8428.34, 8383.32, 8253.12, 8464.65, 8307.42, 8424.78, 8434.44, 8474.88, 8468.26, # baseline 40 procs zipf 14570.44, 14598.03, 14550.74, 14640.29, 14591.4, 14573.35, 14503.18, 14613.39, 14598.61, 14522.27, # baseline 40 procs gaussian 14504.95, 14427.23, 14652.19, 14519.47, 14557.97, 14617.92, 14555.87, 14446.94, 14678.12, 14688.33, # baseline 80 procs random 8427.51, 8267.23, 8437.48, 8432.37, 8441.4, 8454.26, 8413.13, 8412.44, 8444.36, 8444.32, # baseline 80 procs zipf 12980.12, 12946.43, 12911.95, 12925.83, 12952.75, 12841.44, 12920.35, 12924.19, 12944.38, 12967.72, # baseline 80 procs gaussian 11666.29, 11624.72, 11454.82, 11482.36, 11462.24, 11379.46, 11691.5, 11471.19, 11402.08, 11494.13, # patched 40 procs random 11706.69, 11778.1, 11774.07, 11750.07, 11744.97, 11766.65, 11727.79, 11708.41, 11745.3, 11716.45, # patched 40 procs zipf 15498.31, 14647.94, 15423.35, 15467.32, 15467.05, 15342.49, 15511.34, 15414.06, 15401.1, 15431.57, # patched 40 procs gaussian 15957.86, 15957.13, 16022.69, 16035.85, 16150.2, 15904.5, 15943.36, 16036.78, 16025.95, 15900.56, # patched 80 procs random 12568.51, 11772.25, 11622.15, 12057.66, 11971.72, 12693.36, 12399.71, 12553.23, 12242.74, 12793.34, # patched 80 procs zipf 14194.78, 14213.61, 14148.66, 14182.35, 14183.91, 14192.23, 14163.2, 14179.7, 14162.12, 14196.34, # patched 80 procs gaussian 14084.86, 13706.34, 14089.42, 14058.4, 14096.74, 14108.06, 14043.41, 14072.15, 14088.44, 14024.51 ) a <- array(v, dim = c(10, 3, 2, 2)) # baseline vs patched for (concurr in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("concurr%d dist%d: no significance", concurr, dist) } else { s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1]) } print(s) } } # low concurr vs high concurr for (kern in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("kern%d dist%d: no significance", kern, dist) } else { s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1]) } print(s) } } $ R -q -s -f raw_data_fio.r Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -132.15, df = 11.177, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3386.514 -3275.766 sample estimates: mean of x mean of y 8410.71 11741.85 [1] "concurr1 dist1: [38.95, 40.26]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -9.5917, df = 9.4797, p-value = 3.463e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -967.8353 -600.7307 sample estimates: mean of x mean of y 14576.17 15360.45 [1] "concurr1 dist2: [4.12, 6.64]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -37.744, df = 17.33, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1508.328 -1348.850 sample estimates: mean of x mean of y 14564.90 15993.49 [1] "concurr1 dist3: [9.26, 10.36]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -30.144, df = 9.3334, p-value = 1.281e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4137.381 -3562.653 sample estimates: mean of x mean of y 8417.45 12267.47 [1] "concurr2 dist1: [42.32, 49.15]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -92.164, df = 13.276, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1279.417 -1220.931 sample estimates: mean of x mean of y 12931.52 14181.69 [1] "concurr2 dist2: [9.44, 9.89]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -49.453, df = 17.863, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2631.656 -2417.052 sample estimates: mean of x mean of y 11512.88 14037.23 [1] "concurr2 dist3: [20.99, 22.86]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = -0.22947, df = 16.403, p-value = 0.8213 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -68.88155 55.40155 sample estimates: mean of x mean of y 8410.71 8417.45 [1] "kern1 dist1: no significance" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 91.86, df = 17.875, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1607.021 1682.287 sample estimates: mean of x mean of y 14576.17 12931.52 [1] "kern1 dist2: [-11.54, -11.02]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 67.477, df = 17.539, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2956.815 3147.225 sample estimates: mean of x mean of y 14564.90 11512.88 [1] "kern1 dist3: [-21.61, -20.30]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = -4.1443, df = 9.0781, p-value = 0.002459 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -812.1507 -239.0833 sample estimates: mean of x mean of y 11741.85 12267.47 [1] "kern2 dist1: [2.04, 6.92]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 14.566, df = 9.1026, p-value = 1.291e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 996.0064 1361.5196 sample estimates: mean of x mean of y 15360.45 14181.69 [1] "kern2 dist2: [-8.86, -6.48]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 43.826, df = 15.275, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1861.263 2051.247 sample estimates: mean of x mean of y 15993.49 14037.23 [1] "kern2 dist3: [-12.83, -11.64]%" 99th Percentile Latency ----------------------- $ cat raw_data_fio_lat.r v <- c( # baseline 40 procs random 12649, 12387, 12518, 12518, 12518, 12387, 12518, 12518, 12387, 12256, # baseline 40 procs zipf 3458, 3294, 3425, 3294, 3294, 3359, 3752, 3326, 3294, 3458, # baseline 40 procs gaussian 3326, 3458, 3195, 3392, 3326, 3228, 3228, 3326, 3130, 3195, # baseline 80 procs random 25560, 26084, 25560, 25560, 25297, 25297, 25822, 25560, 25560, 25297, # baseline 80 procs zipf 14484, 14615, 14615, 14484, 14484, 14615, 14615, 14615, 14615, 14484, # baseline 80 procs gaussian 15664, 15664, 15533, 15533, 15533, 15664, 15795, 15533, 15664, 15533, # patched 40 procs random 7439, 7242, 7373, 7373, 7373, 7439, 7242, 7308, 7308, 7373, # patched 40 procs zipf 3261, 3425, 3392, 3294, 3359, 3556, 3228, 3490, 3458, 3359, # patched 40 procs gaussian 3687, 3523, 3556, 3523, 3752, 3654, 3884, 3490, 3392, 3720, # patched 80 procs random 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, # patched 80 procs zipf 7701, 7635, 7701, 7701, 7635, 7635, 7701, 7635, 7635, 7635, # patched 80 procs gaussian 7635, 7898, 7701, 7635, 7635, 7635, 7635, 7635, 7701, 7701 ) a <- array(v, dim = c(10, 3, 2, 2)) # baseline vs patched for (concurr in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("concurr%d dist%d: no significance", concurr, dist) } else { s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1]) } print(s) } } $ R -q -s -f raw_data_fio_lat.r Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 123.52, df = 15.287, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 5030.417 5206.783 sample estimates: mean of x mean of y 12465.6 7347.0 [1] "concurr1 dist1: [-41.77, -40.35]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 0.23667, df = 16.437, p-value = 0.8158 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -104.7812 131.1812 sample estimates: mean of x mean of y 3395.4 3382.2 [1] "concurr1 dist2: no significance" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -5.9754, df = 16.001, p-value = 1.94e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -457.5065 -217.8935 sample estimates: mean of x mean of y 3280.4 3618.1 [1] "concurr1 dist3: [6.64, 13.95]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 134.89, df = 9, p-value = 3.437e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 10374.74 10728.66 sample estimates: mean of x mean of y 25559.7 15008.0 [1] "concurr2 dist1: [-41.97, -40.59]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 288.1, df = 13.292, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 6849.566 6952.834 sample estimates: mean of x mean of y 14562.6 7661.4 [1] "concurr2 dist2: [-47.74, -47.04]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 203.64, df = 17.798, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7848.616 8012.384 sample estimates: mean of x mean of y 15611.6 7681.1 [1] "concurr2 dist3: [-51.32, -50.27]%"
Hi Yu, On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote: > TLDR > ==== > The current page reclaim is too expensive in terms of CPU usage and it > often makes poor choices about what to evict. This patchset offers an > alternative solution that is performant, versatile and > straightforward. > [...] > Summery > ======= > The facts are: > 1. The independent lab results and the real-world applications > indicate substantial improvements; there are no known regressions. So impressive results! > 2. Thrashing prevention, working set estimation and proactive reclaim > work out of the box; there are no equivalent solutions. I think similar works are already available out of the box with the latest mainline tree, though it might be suboptimal in some cases. First, you can do thrashing prevention using DAMON-based Operation Scheme (DAMOS)[1] with MADV_COLD action. Second, for working set estimation, you can either use the DAMOS again with statistics action, or the damon_aggregated tracepoint[2]. The DAMON user space tool[3] helps the tracepoint analysis and visualization. Finally, for the proactive reclaim, you can again use the DAMOS with MADV_PAGEOUT action, or simply the DAMON-based proactive reclaim module (DAMON_RECLAIM)[4]. Nevertheless, as noted above, current DAMON based solutions might be suboptimal for some cases. First of all, DAMON currently doesn't provide page granularity monitoring. Though its monitoring results were useful for our users' production usages, there could be different requirements and situations. Secondly, the DAMON-based thrashing prevention wouldn't reduce the CPU usage of the reclamation logic's access scanning. So, to me, MGLRU patchset looks providing something that DAMON doesn't provide, but also something that DAMON is already providing. Specifically, the efficient page granularity access scanning is what DAMON doesn't provide for now. However, the utilization of the access information for LRU list manipulation (thrashing prevention) and proactive reclamation is similar to what DAMON (specifically, DAMOS) provides. Also, this patchset is reducing the reclamation logic's CPU usage using the efficient page granularity access scanning. IMHO, we might be able to reduce the duplicates by integrating MGLRU in DAMON. What I'm saying is, we could 1) introduce the efficient page granularity access scanning, 2) reduce the reclamation logic's CPU usage by making it to use the efficient page granularity access scanning, and 3) extend DAMON for page granularity monitoring with the efficient access sacanning[5]. Then, users could get the benefit of MGLRU by using DAMOS but setting it to use your efficient page granularity access scanning. To make it more simple, we can extend existing kernel logics to use DAMON in the way, or implement a new kernel module. Additional advantages of this approach would be 1) reducing the changes to the existing code, and 2) making the efficient page granularity access information be utilized for more general cases. Of course, the integration might not be so simple as seems to me now. We could put DAMON and MGLRU together as those are for now, and let users select what they really want. I think it's up to you. I didn't read this patchset thoroughly yet, so I might missing many things. If so, please feel free to let me know. [1] https://docs.kernel.org/admin-guide/mm/damon/usage.html#schemes [2] https://docs.kernel.org/admin-guide/mm/damon/usage.html#tracepoint-for-monitoring-results [3] https://github.com/awslabs/damo [4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html [5] https://docs.kernel.org/vm/damon/design.html#configurable-layers Thanks, SJ [...]
On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote: > Hi Yu, > > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote: > > > TLDR > > ==== > > The current page reclaim is too expensive in terms of CPU usage and it > > often makes poor choices about what to evict. This patchset offers an > > alternative solution that is performant, versatile and > > straightforward. > > > [...] > > Summery > > ======= > > The facts are: > > 1. The independent lab results and the real-world applications > > indicate substantial improvements; there are no known regressions. > > So impressive results! > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > work out of the box; there are no equivalent solutions. > > I think similar works are already available out of the box with the latest > mainline tree, though it might be suboptimal in some cases. Ok, I will sound harsh because I hate it when people challenge facts while having no idea what they are talking about. Our jobs are help the leadership make best decisions by providing them with facts, not feeding them crap. Don't get me wrong -- you are welcome to start another thread and have a casual discussion with me. But this thread is not for that; it's for the leadership and stakeholder to make a decision. Check who are in "To" and "Cc" and what my request is. > I didn't read this patchset thoroughly yet, so I might missing many things. If > so, please feel free to let me know. Yes, apparently you didn't read this patchset thoroughly, and you have missed all things that matter to this thread. > First, you can do thrashing prevention using DAMON-based Operation Scheme > (DAMOS)[1] with MADV_COLD action. Here is thrashing prevention really means, from patch 8: +Personal computers +------------------ +:Thrashing prevention: Write ``N`` to + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of + ``N`` milliseconds from getting evicted. The OOM killer is invoked if + this working set can't be kept in memory. Based on the average human + detectable lag (~100ms), ``N=1000`` usually eliminates intolerable + lags due to thrashing. Larger values like ``N=3000`` make lags less + noticeable at the cost of more OOM kills. It's about when to trigger OOM kills. Got it? Or probably you don't understand what MADV_COLD is either? > Second, for working set estimation, you can either use the DAMOS > again with statistics action, or the damon_aggregated tracepoint[2]. This is you are suggesting: TRACE_EVENT(damon_aggregated, TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u", __entry->target_id, __entry->nr_regions, __entry->start, __entry->end, __entry->nr_accesses) Now read my doc again: +Data centers +------------ +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following + format: + memcg memcg_id memcg_path + node node_id Have you heard of something called memcg? And NUMA node? How exactly can this tracepoint provide information about different memcgs and NUMA node? > The DAMON user space tool[3] helps the tracepoint analysis and > visualization. What does "work out of box" mean? Should every Linux desktop, laptop and phone user install this tool? > Finally, for the proactive reclaim, you can again use the DAMOS > with MADV_PAGEOUT action How exactly does MADV_PAGEOUT find pages that are NOT mapped in page tables? Let me tell you another fact: they are usually the cheapest to reclaim. > or simply the DAMON-based proactive reclaim module (DAMON_RECLAIM)[4]. > [4] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html How many knob does DAMON_RECLAIM have? 14? I lost count. > Of course, the integration might not be so simple as seems to me now. Look, I'm open to your suggestion. I probably should have been nicer. So I'm sorry. I just don't appreciate alternative facts.
On Wed, Jan 05, 2022 at 03:53:07AM -0700, Yu Zhao wrote: > Look, I'm open to your suggestion. I probably should have been nicer. > So I'm sorry. I just don't appreciate alternative facts. Yes, you should've been *much* nicer. I'm reading lkml for pretty much 20 years now and you just made my eyebrows go up - something which pretty much never happens these days. So you need to check yourself before replying. Looking at git history, you're not a newbie so you've probably picked up - at least from the sidelines - all those code of conduct discussions. And I'm not going to point you to it - I'm sure you can find it yourself and peruse it at your own convenience. Long story short: we all try to be civil to each other now, even if it is hard sometimes.
On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <yuzhao@google.com> wrote: > On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote: > > Hi Yu, > > > > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote: [...] > > I think similar works are already available out of the box with the latest > > mainline tree, though it might be suboptimal in some cases. > > Ok, I will sound harsh because I hate it when people challenge facts > while having no idea what they are talking about. > > Our jobs are help the leadership make best decisions by providing them > with facts, not feeding them crap. I was using the word "similar", to represent this is only for a rough concept level similarity, rather than detailed facts. But, seems it was not enough, sorry. Anyway, I will not talk more and thus disturb you having the important discussion with leaders here, as you are asking. > > Don't get me wrong -- you are welcome to start another thread and have > a casual discussion with me. But this thread is not for that; it's for > the leadership and stakeholder to make a decision. Check who are in > "To" and "Cc" and what my request is. Haha. Ok, good luck for you. Thanks, SJ [...]
On Wed, Jan 05, 2022 at 11:25:27AM +0000, SeongJae Park wrote: > On Wed, 5 Jan 2022 03:53:07 -0700 Yu Zhao <yuzhao@google.com> wrote: > > > On Wed, Jan 05, 2022 at 08:55:34AM +0000, SeongJae Park wrote: > > > Hi Yu, > > > > > > On Tue, 4 Jan 2022 13:22:19 -0700 Yu Zhao <yuzhao@google.com> wrote: > [...] > > > I think similar works are already available out of the box with the latest > > > mainline tree, though it might be suboptimal in some cases. > > > > Ok, I will sound harsh because I hate it when people challenge facts > > while having no idea what they are talking about. > > > > Our jobs are help the leadership make best decisions by providing them > > with facts, not feeding them crap. > > I was using the word "similar", to represent this is only for a rough concept > level similarity, rather than detailed facts. But, seems it was not enough, > sorry. Anyway, I will not talk more and thus disturb you having the important > discussion with leaders here, as you are asking. First of all, I want to apologize. I detested what I read, and I still don't like "a rough concept level similarity" sitting next to a factual statement. But as Borislav has reminded me, my tone did cross the line. I should have had used an objective approach to express my (very) different views. I hope that's all water under the bridge now. And I do plan to carry on with what I should have had done.
On Tue, Jan 04, 2022 at 01:43:13PM -0800, Linus Torvalds wrote: > On Tue, Jan 4, 2022 at 12:30 PM Yu Zhao <yuzhao@google.com> wrote: > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. > > I'll leave this to Andrew. I had some stylistic nits, but all the > actual complexity is in that aging and eviction, and while I looked at > the patches, I certainly couldn't make much of a judgement on them. > > The proof is in the numbers, and they look fine, but who knows what > happens when others test it. I don't see anything that looks worrisome > per se, I just see the silly small things that made me go "Eww". I appreciate your time, I'll address all your comments togather with others' in the next spin, after I hear from Andrew. (I'm assuming he will have comments too.)
On Tue 04-01-22 13:30:00, Yu Zhao wrote: [...] > Hi Andrew, Linus, > > Can you please take a look at this patchset and let me know if it's > 5.17 material? I am still not done with the review and have seen at least few problems that would need to be addressed. But more fundamentally I believe there are really some important questions to be answered. First and foremost this is a major addition to the memory reclaim and there should be a wider consensus that we really want to go that way. The patchset doesn't have a single ack nor reviewed-by AFAICS. I haven't seen a lot of discussion since v2 (http://lkml.kernel.org/r/20210413065633.2782273-1-yuzhao@google.com) nor do I see any clarification on how concerns raised there have been addressed or at least how they are planned to be addressed. Johannes has made some excellent points http://lkml.kernel.org/r/YHcpzZYD2fQyWvEQ@cmpxchg.org. Let me quote for reference part of it I find the most important: : Realistically, I think incremental changes are unavoidable to get this : merged upstream. : : Not just in the sense that they need to be smaller changes, but also : in the sense that they need to replace old code. It would be : impossible to maintain both, focus development and testing resources, : and provide a reasonably stable experience with both systems tugging : at a complicated shared code base. : : On the other hand, the existing code also has billions of hours of : production testing and tuning. We can't throw this all out overnight - : it needs to be surgical and the broader consequences of each step need : to be well understood. : : We also have millions of servers relying on being able to do upgrades : for drivers and fixes in other subsystems that we can't put on hold : until we stabilized a new reclaim implementation from scratch. Fully agreed on all points here. I do appreciate there is a lot of work behind this patchset and I also do understand it has gained a considerable amount of testing as well. Your numbers are impressive but my experience tells me that it is equally important to understand the worst case behavior and there is not really much mentioned about those in changelogs. We also shouldn't ignore costs the code is adding. One of them would be a further page flags depletion. We have been hitting problems on that front for years and many features had to be reworked to bypass a lack of space in page->flags. I will be looking more into the code (especially the memcg side of it) but I really believe that a consensus on above Johannes' points need to be found first before this work can move forward. Thanks!
On Fri, Jan 07, 2022 at 10:38:18AM +0100, Michal Hocko wrote: > On Tue 04-01-22 13:30:00, Yu Zhao wrote: > [...] > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > I am still not done with the review and have seen at least few problems > that would need to be addressed. > > But more fundamentally I believe there are really some important > questions to be answered. First and foremost this is a major addition > to the memory reclaim and there should be a wider consensus that we > really want to go that way. The patchset doesn't have a single ack nor > reviewed-by AFAICS. I haven't seen a lot of discussion since v2 > (http://lkml.kernel.org/r/20210413065633.2782273-1-yuzhao@google.com) > nor do I see any clarification on how concerns raised there have been > addressed or at least how they are planned to be addressed. > > Johannes has made some excellent points > http://lkml.kernel.org/r/YHcpzZYD2fQyWvEQ@cmpxchg.org. Let me quote > for reference part of it I find the most important: > : Realistically, I think incremental changes are unavoidable to get this > : merged upstream. > : > : Not just in the sense that they need to be smaller changes, but also > : in the sense that they need to replace old code. It would be > : impossible to maintain both, focus development and testing resources, > : and provide a reasonably stable experience with both systems tugging > : at a complicated shared code base. > : > : On the other hand, the existing code also has billions of hours of > : production testing and tuning. We can't throw this all out overnight - > : it needs to be surgical and the broader consequences of each step need > : to be well understood. > : > : We also have millions of servers relying on being able to do upgrades > : for drivers and fixes in other subsystems that we can't put on hold > : until we stabilized a new reclaim implementation from scratch. > > Fully agreed on all points here. > > I do appreciate there is a lot of work behind this patchset and I > also do understand it has gained a considerable amount of testing as > well. Your numbers are impressive but my experience tells me that it is > equally important to understand the worst case behavior and there is not > really much mentioned about those in changelogs. > > We also shouldn't ignore costs the code is adding. One of them would be > a further page flags depletion. We have been hitting problems on that > front for years and many features had to be reworked to bypass a lack of > space in page->flags. > > I will be looking more into the code (especially the memcg side of it) > but I really believe that a consensus on above Johannes' points need to > be found first before this work can move forward. Thanks for the summary. I appreciate your time and I agree your assessment is fair. So I've acknowledged your concerns, and you've acknowledged my numbers (the performance improvements) are impressive. Now we are in agreement, cheers. Next, I argue that the benefits of this patchset outweigh its risks, because, drawing from my past experience, 1. There have been many larger and/or riskier patchsets taken; I'll assemble a list if you disagree. And this patchset is fully guarded by #ifdef; Linus has also assessed on this point. 2. There have been none that came with the testing/benchmarking coverage as this one did. Please point me to some if I'm mistaken, and I'll gladly match them. The numbers might not materialize in the real world; the code is not perfect; and many other risks... But all the top eight open source memory hogs were covered, which is unprecedented; memcached and fio showed significant improvements and it only takes a few commands to see for yourselves. Regarding the acks and the reviewed-bys, I certainly can ask people who have reaped the benefits of this patchset to do them, if it's required. But I see less fun in that. I prefer to provide empirical evidence and convince people who are on the other side of the aisle.
Note that with vm.swappiness=0, the vm.watermark_scale_factor value does not affect the swap possibility: MGLRU ignores sc->file_is_tiny. With the classic 2-gen LRU swapping goes well at swappiness=0 and high vm.watermark_scale_factor, which is expected according to the documentation: "At 0, the kernel will not initiate swap until the amount of free and file-backed pages is less than the high watermark in a zone." [1] With vm.swappiness=0, no swap occurs with any vm.watermark_scale_factor value with MGLRU. I used to see in practice (with MGLRU v3) the impossibility of swapping when vm.swappiness=0 and vm.watermark_scale_factor=1000. At a minimum, this will require updating the documentation for vm.swappiness. BTW, why MGLRU doesn't use something like sc->file_is_tiny? [1] https://github.com/torvalds/linux/blob/v5.16/Documentation/admin-guide/sysctl/vm.rst#swappiness
On Fri 07-01-22 11:45:40, Yu Zhao wrote: [...] > Next, I argue that the benefits of this patchset outweigh its risks, > because, drawing from my past experience, > 1. There have been many larger and/or riskier patchsets taken; I'll > assemble a list if you disagree. No question about that. Changes in the reclaim path are paved with failures and reverts and fine tuning on top of existing fine tuning. The difference from your patchset is that they tend to be much much smaller and go incremental and therefore easier to review. > And this patchset is fully guarded > by #ifdef; Linus has also assessed on this point. I appreciate you made the new behavior an opt-in and therefore existing workloads are less likely to regress. I do not think ifdefs help all that much, though, because a) realistically the config will likely be enabled for most distribution kernels and b) the parallel reclaim implementation adds a maintenance overhead regardless of those ifdef. The later point is especially worrying because the memory reclaim is a complex and hard to review beast already. Any future changes would need to consider both reclaim algorithms of course. Hence I argue we really need a wider consensus this is the right direction we want to pursue. > 2. There have been none that came with the testing/benchmarking > coverage as this one did. Please point me to some if I'm mistaken, > and I'll gladly match them. I do appreciate your numbers but you should realize that this is an area that is really hard to get any conclusive testing for. We keep learning about fallouts on workloads we haven't really anticipated or where the runtime effects happen to disagree with our intuition. So while those numbers are nice there are other important aspects to consider like the maintenance cost for example. > The numbers might not materialize in the real world; the code is not > perfect; and many other risks... But all the top eight open source > memory hogs were covered, which is unprecedented; memcached and fio > showed significant improvements and it only takes a few commands to > see for yourselves. > > Regarding the acks and the reviewed-bys, I certainly can ask people > who have reaped the benefits of this patchset to do them, if it's > required. But I see less fun in that. I prefer to provide empirical > evidence and convince people who are on the other side of the aisle. I like to hear from users who benefit from your work and that certainly gives more credit to it. But it will be the MM community to maintain the code and address future issues. We do not have a dedicated maintainer for the memory reclaim but certainly there are people who have helped shaping the existing code and have learned a lot from the past issues - like Johannes, Rik, Mel just to name few. If I were you I would be really looking into finding an agreement with them. I myself can help you with memcg and oom side of the things (we already have discussions about those). Thanks!
On Mon, Jan 10, 2022 at 04:39:51PM +0100, Michal Hocko wrote: > On Fri 07-01-22 11:45:40, Yu Zhao wrote: > [...] > > Next, I argue that the benefits of this patchset outweigh its risks, > > because, drawing from my past experience, > > 1. There have been many larger and/or riskier patchsets taken; I'll > > assemble a list if you disagree. > > No question about that. Changes in the reclaim path are paved with > failures and reverts and fine tuning on top of existing fine tuning. > The difference from your patchset is that they tend to be much much > smaller and go incremental and therefore easier to review. No argument here. > > And this patchset is fully guarded > > by #ifdef; Linus has also assessed on this point. > > I appreciate you made the new behavior an opt-in and therefore existing > workloads are less likely to regress. I do not think ifdefs help > all that much, though, because a) realistically the config will > likely be enabled for most distribution kernels There is also a runtime kill switch. > b) the parallel > reclaim implementation adds a maintenance overhead regardless of those > ifdef. The later point is especially worrying because the memory reclaim > is a complex and hard to review beast already. Any future changes would > need to consider both reclaim algorithms of course. A perfectly legitimate concern. If this patchset is taken: 1. There will be refactoring that makes the long-term maintenance as affordable as possible, i.e., similar to the SL.B model, but can also make runtime switch. 2. There will also be optimizations for mmu notifier (KVM), THP, etc. 3. Most importantly, Google will be committing more resource on this. And that's why we need to hear a decision -- our resource planning depends on it. > Hence I argue we really need a wider consensus this is the right > direction we want to pursue. We've been doing our best to get this consensus -- we invited all the stakeholders to meetings a long time ago -- but unfortunately we couldn't move the needle. I agree consensus is important. But, IMO, progress is even more important. And personally, I'd rather try something wrong than do nothing. > > 2. There have been none that came with the testing/benchmarking > > coverage as this one did. Please point me to some if I'm mistaken, > > and I'll gladly match them. > > I do appreciate your numbers but you should realize that this is an area > that is really hard to get any conclusive testing for. Fully agreed. That's why we started a new initiative, and we hope more people will following these practices: 1. All results in this area should be reported with at least standard deviations, or preferably confidence intervals. 2. Real applications should be benchmarked (with synthetic load generator), not just synthetic benchmarks (not real applications). 3. A wide range of devices should be covered, i.e., servers, desktops, laptops and phones. I'm very confident to say our benchmark reports were hold to the highest standards. We have worked with MariaDB (company), EnterpriseDB (Postgres), Redis (company), etc. on these reports. They have copies of these reports (PDF version): https://linux-mm.googlesource.com/benchmarks/ We welcome any expert in those applications to examine our reports, and we'll be happy to run any other benchmarks or same benchmarks with different configurations that anybody thinks it's important and we've missed. > We keep learning > about fallouts on workloads we haven't really anticipated or where the > runtime effects happen to disagree with our intuition. So while those > numbers are nice there are other important aspects to consider like the > maintenance cost for example. I assume we agree this is not an easy decision. Can I also assume we agree that this decision should be make within a reasonable time frame? > > The numbers might not materialize in the real world; the code is not > > perfect; and many other risks... But all the top eight open source > > memory hogs were covered, which is unprecedented; memcached and fio > > showed significant improvements and it only takes a few commands to > > see for yourselves. > > > > Regarding the acks and the reviewed-bys, I certainly can ask people > > who have reaped the benefits of this patchset to do them, if it's > > required. But I see less fun in that. I prefer to provide empirical > > evidence and convince people who are on the other side of the aisle. > > I like to hear from users who benefit from your work and that certainly > gives more credit to it. But it will be the MM community to maintain the > code and address future issues. I'll ask downstream kernel maintainers (from different distros) that have taken this patchset to ACK. I'll ask credible testers who are professionals, researchers, contributors to other subsystems to provide Test-by's. There are many other individual testers I may not be able to acknowledge their efforts, e.g., my coworker just sent this to me: Using that v5 for some time and confirm that difference under heavy load and memory pressure is significant." https://www.phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022#post1301275 I'll leave the reviews in your capable hands. As I said, I prefer to convince people with empirical evidence. > We do not have a dedicated maintainer for the memory reclaim but > certainly there are people who have helped shaping the existing code and > have learned a lot from the past issues - like Johannes, Rik, Mel just > to name few. If I were you I would be really looking into finding an > agreement with them. I myself can help you with memcg and oom side of > the things (we already have discussions about those). Unfortunately people have different priorities. As I said, we tried to get all the stakeholders in the same (conference) room so that we can make some good progress. But we failed. Rest assured, we'll keep trying. But please understand we need to do cost control and therefore we can't keep investing in this effort forever. So I think it's not unreasonable, after I've addressed all pending comments, to ask for some clear instructions from the leadership: Yes No Or something specific Thanks!
> > > 2. There have been none that came with the testing/benchmarking > > > coverage as this one did. Please point me to some if I'm mistaken, > > > and I'll gladly match them. > > > > I do appreciate your numbers but you should realize that this is an area > > that is really hard to get any conclusive testing for. > > Fully agreed. That's why we started a new initiative, and we hope more > people will following these practices: > 1. All results in this area should be reported with at least standard > deviations, or preferably confidence intervals. > 2. Real applications should be benchmarked (with synthetic load > generator), not just synthetic benchmarks (not real applications). > 3. A wide range of devices should be covered, i.e., servers, desktops, > laptops and phones. > > I'm very confident to say our benchmark reports were hold to the > highest standards. We have worked with MariaDB (company), EnterpriseDB > (Postgres), Redis (company), etc. on these reports. They have copies > of these reports (PDF version): > https://linux-mm.googlesource.com/benchmarks/ > > We welcome any expert in those applications to examine our reports, > and we'll be happy to run any other benchmarks or same benchmarks with > different configurations that anybody thinks it's important and we've > missed. I really think this gets at the heart of the issue with mm development, and is one of the reasons it's been extra frustrating to not have an MM conf for the past couple of years; I think sorting out how we measure & proceed on changes would be easier done f2f. E.g. concluding with a consensus that if something doesn't regress on X, Y, and Z, and has reasonably maintainable and readable code, we should merge it and try it out. But since f2f isn't an option until 2052 at the earliest... I understand the desire for an "incremental approach that gets us from A->B". In the abstract it sounds great. However, with a change like this one, I think it's highly likely that such a path would be littered with regressions both large and small, and would probably be more difficult to reason about than the relatively clean design of MGLRU. On top of that, I don't think we'll get the kind of user feedback we need for something like this *without* merging it. Yu has done a tremendous job collecting data here (and the results are really incredible), but I think we can all agree that without extensive testing in the field with all sorts of weird codes, we're not going to find the problematic behaviors we're concerned about. So unless we want to eschew big mm changes entirely (we shouldn't! look at net or scheduling for how important big rewrites are to progress), I think we should be open to experimenting with new stuff. We can always revert if things get too unwieldy. None of this is to say that there may not be lots more comments on the code or potential fixes/changes to incorporate before merging; I'm mainly arguing about the mindset we should have to changes like this, not all the stuff the community is already really good at (i.e. testing and reviewing code on a nuts & bolts level). Thanks, Jesse
On Mon, Jan 10, 2022 at 2:46 PM Jesse Barnes <jsbarnes@google.com> wrote: > > So unless we want to eschew big mm changes entirely (we shouldn't! > look at net or scheduling for how important big rewrites are to > progress), I think we should be open to experimenting with new stuff. So I personally think this is worth going with, partly simply due to the reported improvements that have been measured. But also to a large extent because the whole notion of doing multi-generational LRU isn't exactly some wackadoodle crazy thing. We already do active vs inactive, the whole multi-generational thing just doesn't seem to be so "far out". But yes, numbers talk, and I get the feeling that we just need to try it. Maybe not 5.17, but.. Linus
On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > TLDR > > ==== > > The current page reclaim is too expensive in terms of CPU usage and it > > often makes poor choices about what to evict. This patchset offers an > > alternative solution that is performant, versatile and > > straightforward. > > <snipped> > > > Summery > > ======= > > The facts are: > > 1. The independent lab results and the real-world applications > > indicate substantial improvements; there are no known regressions. > > 2. Thrashing prevention, working set estimation and proactive reclaim > > work out of the box; there are no equivalent solutions. > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > with similar effects. > > > > Our options, accordingly, are: > > 1. Given the amount of evidence, the reported improvements will likely > > materialize for a wide range of workloads. > > 2. Gauging the interest from the past discussions [14][15][16], the > > new features will likely be put to use for both personal computers > > and data centers. > > 3. Based on Google's track record, the new code will likely be well > > maintained in the long term. It'd be more difficult if not > > impossible to achieve similar effects on top of the existing > > design. > > Hi Andrew, Linus, > > Can you please take a look at this patchset and let me know if it's > 5.17 material? > > My goal is to get it merged asap so that users can reap the benefits > and I can push the sequels. Please examine the data provided -- I > think the unprecedented coverage and the magnitude of the improvements > warrant a green light. Downstream kernel maintainers who have been carrying MGLRU for more than 3 versions, can you please provide your Acked-by tags? Having this patchset in the mainline will make your job easier :) Alexandre - the XanMod Kernel maintainer https://xanmod.org Brian - the Chrome OS kernel memory maintainer https://www.chromium.org Jan - the Arch Linux Zen kernel maintainer https://archlinux.org Steven - the Liquorix kernel maintainer https://liquorix.net Suleiman - the ARCVM (Android downstream) kernel memory maintainer https://chromium.googlesource.com/chromiumos/third_party/kernel Also my gratitude to those who have helped test MGLRU: Daniel - researcher at Michigan Tech benchmarked memcached Holger - who has been testing/patching/contributing to various subsystems since ~2008 Shuang - researcher at University of Rochester benchmarked fio and provided a report Sofia - EDI https://www.edi.works benchmarked the top eight memory hogs and provided reports Can you please provide your Tested-by tags? This will ensure the credit for your contributions. Thanks!
On 2022-01-11 09:41, Yu Zhao wrote: > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: >> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: >>> TLDR >>> ==== >>> The current page reclaim is too expensive in terms of CPU usage and it >>> often makes poor choices about what to evict. This patchset offers an >>> alternative solution that is performant, versatile and >>> straightforward. >> >> <snipped> >> >>> Summery >>> ======= >>> The facts are: >>> 1. The independent lab results and the real-world applications >>> indicate substantial improvements; there are no known regressions. >>> 2. Thrashing prevention, working set estimation and proactive reclaim >>> work out of the box; there are no equivalent solutions. >>> 3. There is a lot of new code; nobody has demonstrated smaller changes >>> with similar effects. >>> >>> Our options, accordingly, are: >>> 1. Given the amount of evidence, the reported improvements will likely >>> materialize for a wide range of workloads. >>> 2. Gauging the interest from the past discussions [14][15][16], the >>> new features will likely be put to use for both personal computers >>> and data centers. >>> 3. Based on Google's track record, the new code will likely be well >>> maintained in the long term. It'd be more difficult if not >>> impossible to achieve similar effects on top of the existing >>> design. >> >> Hi Andrew, Linus, >> >> Can you please take a look at this patchset and let me know if it's >> 5.17 material? >> >> My goal is to get it merged asap so that users can reap the benefits >> and I can push the sequels. Please examine the data provided -- I >> think the unprecedented coverage and the magnitude of the improvements >> warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel > > Also my gratitude to those who have helped test MGLRU: > > Daniel - researcher at Michigan Tech > benchmarked memcached > > Holger - who has been testing/patching/contributing to various > subsystems since ~2008 > > Shuang - researcher at University of Rochester > benchmarked fio and provided a report > > Sofia - EDI https://www.edi.works > benchmarked the top eight memory hogs and provided reports > > Can you please provide your Tested-by tags? This will ensure the credit > for your contributions. > > Thanks! > Have been pounding on this "in production" on several different machines (server, desktop, laptop) and 5.15.x without any issues, so: Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Looking forward to seeing this in mainline! cheers, Holger
In some of my benchmarks MGLRU really gave unrivaled performance. I assume the adoption of MGLRU into the kernel would save billions of dollars and greatly reduce carbon dioxide emissions. However, there are also cases where MGLRU loses. There are cases where MGLRU does not achieve the performance that the classic LRU gives (at least I got such results when testing MGLRU before[1], but I did not report them here). As a Linux user, I would like to see both variants of LRU in the kernel, so that it is possible to switch to the suitable variant when needed: none of the LRU variants allowed me to squeeze the maximum for all cases. I hope to test MGLRU v6 later and show you some of its weaknesses and anomalies with specific logs and benchmarks. [1] I didn't have enough time and energy to decipher the results at that time: https://github.com/hakavlad/cache-tests/tree/main/mg-LRU-v3_vs_classic-LRU (but you can try to guess what it all means)
On Mon 10-01-22 14:46:08, Jesse Barnes wrote: > > > > 2. There have been none that came with the testing/benchmarking > > > > coverage as this one did. Please point me to some if I'm mistaken, > > > > and I'll gladly match them. > > > > > > I do appreciate your numbers but you should realize that this is an area > > > that is really hard to get any conclusive testing for. > > > > Fully agreed. That's why we started a new initiative, and we hope more > > people will following these practices: > > 1. All results in this area should be reported with at least standard > > deviations, or preferably confidence intervals. > > 2. Real applications should be benchmarked (with synthetic load > > generator), not just synthetic benchmarks (not real applications). > > 3. A wide range of devices should be covered, i.e., servers, desktops, > > laptops and phones. > > > > I'm very confident to say our benchmark reports were hold to the > > highest standards. We have worked with MariaDB (company), EnterpriseDB > > (Postgres), Redis (company), etc. on these reports. They have copies > > of these reports (PDF version): > > https://linux-mm.googlesource.com/benchmarks/ > > > > We welcome any expert in those applications to examine our reports, > > and we'll be happy to run any other benchmarks or same benchmarks with > > different configurations that anybody thinks it's important and we've > > missed. > > I really think this gets at the heart of the issue with mm > development, and is one of the reasons it's been extra frustrating to > not have an MM conf for the past couple of years; I think sorting out > how we measure & proceed on changes would be easier done f2f. E.g. > concluding with a consensus that if something doesn't regress on X, Y, > and Z, and has reasonably maintainable and readable code, we should > merge it and try it out. I am fully with you on that! I hope we can have LSFMM this year finally. > But since f2f isn't an option until 2052 at the earliest... Let's be more optimistic than that ;) > I understand the desire for an "incremental approach that gets us from > A->B". In the abstract it sounds great. However, with a change like > this one, I think it's highly likely that such a path would be > littered with regressions both large and small, and would probably be > more difficult to reason about than the relatively clean design of > MGLRU. There are certainly things that do not make much sense to split up of course. On the other hand the patchset is making a lot of decisions and assumptions that are neither documented in the code nor in the changelog. From my past experience these are really problematic from a long term maintenance POV. We are struggling with those already because changelog tend to be much more coarse in the past yet the code stays with us and we have been really "great" at not touching many of those because "something might break". This results in the complexity grow and further maintenance burden. > On top of that, I don't think we'll get the kind of user > feedback we need for something like this *without* merging it. Yu has > done a tremendous job collecting data here (and the results are really > incredible), but I think we can all agree that without extensive > testing in the field with all sorts of weird codes, we're not going to > find the problematic behaviors we're concerned about. This is understood. > So unless we want to eschew big mm changes entirely (we shouldn't! > look at net or scheduling for how important big rewrites are to > progress), I think we should be open to experimenting with new stuff. > We can always revert if things get too unwieldy. As long as the patchset doesn't include new user visible interfaces which have proven to be really hard to revert. > None of this is to say that there may not be lots more comments on the > code or potential fixes/changes to incorporate before merging; I'm > mainly arguing about the mindset we should have to changes like this, > not all the stuff the community is already really good at (i.e. > testing and reviewing code on a nuts & bolts level). From my reading of this and previous discussions I have gathered that there was no opposition just for the sake of it. There have been very specific questions regarding the implementation and/or future plans to address issues expressed in the past. So far I have only managed to check the memcg and oom integration finding some issues there. All of them should be fixable reasonably easily but it also points that a deep dive into this is really necessary. I have also raised questions about future maintainability of the resulting code. As you could have noticed the review power in the MM community is lacking behind and we tend to have more code producers than reviewers and maintainers. Not to mention other things like page flags depletion which is something we have been struggling for quite some time already. All that being said there is a lot of work for such a large change to be merged.
Yu Zhao wrote: > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: >> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: >>> TLDR >>> ==== >>> The current page reclaim is too expensive in terms of CPU usage and it >>> often makes poor choices about what to evict. This patchset offers an >>> alternative solution that is performant, versatile and >>> straightforward. >> >> <snipped> >> >>> Summery >>> ======= >>> The facts are: >>> 1. The independent lab results and the real-world applications >>> indicate substantial improvements; there are no known regressions. >>> 2. Thrashing prevention, working set estimation and proactive reclaim >>> work out of the box; there are no equivalent solutions. >>> 3. There is a lot of new code; nobody has demonstrated smaller changes >>> with similar effects. >>> >>> Our options, accordingly, are: >>> 1. Given the amount of evidence, the reported improvements will likely >>> materialize for a wide range of workloads. >>> 2. Gauging the interest from the past discussions [14][15][16], the >>> new features will likely be put to use for both personal computers >>> and data centers. >>> 3. Based on Google's track record, the new code will likely be well >>> maintained in the long term. It'd be more difficult if not >>> impossible to achieve similar effects on top of the existing >>> design. >> >> Hi Andrew, Linus, >> >> Can you please take a look at this patchset and let me know if it's >> 5.17 material? >> >> My goal is to get it merged asap so that users can reap the benefits >> and I can push the sequels. Please examine the data provided -- I >> think the unprecedented coverage and the magnitude of the improvements >> warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel > > Also my gratitude to those who have helped test MGLRU: > > Daniel - researcher at Michigan Tech > benchmarked memcached > > Holger - who has been testing/patching/contributing to various > subsystems since ~2008 > > Shuang - researcher at University of Rochester > benchmarked fio and provided a report > > Sofia - EDI https://www.edi.works > benchmarked the top eight memory hogs and provided reports > > Can you please provide your Tested-by tags? This will ensure the credit > for your contributions. > > Thanks! I have tested MGLRU using fio [1]. The performance improvement is fabulous. I hope this patchset could eventually get merged to enable large scale test and let more users talk about their experience. Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> [1] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
On Tue, Jan 11, 2022 at 5:41 PM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > > TLDR > > > ==== > > > The current page reclaim is too expensive in terms of CPU usage and it > > > often makes poor choices about what to evict. This patchset offers an > > > alternative solution that is performant, versatile and > > > straightforward. > > > > <snipped> > > > > > Summery > > > ======= > > > The facts are: > > > 1. The independent lab results and the real-world applications > > > indicate substantial improvements; there are no known regressions. > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > > work out of the box; there are no equivalent solutions. > > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > > with similar effects. > > > > > > Our options, accordingly, are: > > > 1. Given the amount of evidence, the reported improvements will likely > > > materialize for a wide range of workloads. > > > 2. Gauging the interest from the past discussions [14][15][16], the > > > new features will likely be put to use for both personal computers > > > and data centers. > > > 3. Based on Google's track record, the new code will likely be well > > > maintained in the long term. It'd be more difficult if not > > > impossible to achieve similar effects on top of the existing > > > design. > > > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel Android on ChromeOS has been using MGLRU for a while now, with great results. It would be great for more people to more easily be able to benefit from it. Acked-by: Suleiman Souhlal <suleiman@google.com> -- Suleiman
On Tue, Jan 11, 2022 at 12:41 AM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > > TLDR > > > ==== > > > The current page reclaim is too expensive in terms of CPU usage and it > > > often makes poor choices about what to evict. This patchset offers an > > > alternative solution that is performant, versatile and > > > straightforward. > > > > <snipped> > > > > > Summery > > > ======= > > > The facts are: > > > 1. The independent lab results and the real-world applications > > > indicate substantial improvements; there are no known regressions. > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > > work out of the box; there are no equivalent solutions. > > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > > with similar effects. > > > > > > Our options, accordingly, are: > > > 1. Given the amount of evidence, the reported improvements will likely > > > materialize for a wide range of workloads. > > > 2. Gauging the interest from the past discussions [14][15][16], the > > > new features will likely be put to use for both personal computers > > > and data centers. > > > 3. Based on Google's track record, the new code will likely be well > > > maintained in the long term. It'd be more difficult if not > > > impossible to achieve similar effects on top of the existing > > > design. > > > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel > > Also my gratitude to those who have helped test MGLRU: > > Daniel - researcher at Michigan Tech > benchmarked memcached > > Holger - who has been testing/patching/contributing to various > subsystems since ~2008 > > Shuang - researcher at University of Rochester > benchmarked fio and provided a report > > Sofia - EDI https://www.edi.works > benchmarked the top eight memory hogs and provided reports Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Hello. On úterý 4. ledna 2022 21:22:19 CET Yu Zhao wrote: > TLDR > ==== > The current page reclaim is too expensive in terms of CPU usage and it > often makes poor choices about what to evict. This patchset offers an > alternative solution that is performant, versatile and > straightforward. > > Design objectives > ================= > The design objectives are: > 1. Better representation of access recency > 2. Try to profit from spatial locality > 3. Clear fast path making obvious choices > 4. Simple self-correcting heuristics > > The representation of access recency is at the core of all LRU > approximations. The multigenerational LRU (MGLRU) divides pages into > multiple lists (generations), each having bounded access recency (a > time interval). Generations establish a common frame of reference and > help make better choices, e.g., between different memcgs on a computer > or different computers in a data center (for cluster job scheduling). > > Exploiting spatial locality improves the efficiency when gathering the > accessed bit. A rmap walk targets a single page and doesn't try to > profit from discovering an accessed PTE. A page table walk can sweep > all hotspots in an address space, but its search space can be too > large to make a profit. The key is to optimize both methods and use > them in combination. (PMU is another option for further exploration.) > > Fast path reduces code complexity and runtime overhead. Unmapped pages > don't require TLB flushes; clean pages don't require writeback. These > facts are only helpful when other conditions, e.g., access recency, > are similar. With generations as a common frame of reference, > additional factors stand out. But obvious choices might not be good > choices; thus self-correction is required (the next objective). > > The benefits of simple self-correcting heuristics are self-evident. > Again with generations as a common frame of reference, this becomes > attainable. Specifically, pages in the same generation are categorized > based on additional factors, and a closed-loop control statistically > compares the refault percentages across all categories and throttles > the eviction of those that have higher percentages. > > Patchset overview > ================= > 1. mm: x86, arm64: add arch_has_hw_pte_young() > 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG > Materializing hardware optimizations when trying to clear the accessed > bit in many PTEs. If hardware automatically sets the accessed bit in > PTEs, there is no need to worry about bursty page faults (emulating > the accessed bit). If it also sets the accessed bit in non-leaf PMD > entries, there is no need to search the PTE table pointed to by a PMD > entry that doesn't have the accessed bit set. > > 3. mm/vmscan.c: refactor shrink_node() > A minor refactor. > > 4. mm: multigenerational lru: groundwork > Adding the basic data structure and the functions to initialize it and > insert/remove pages. > > 5. mm: multigenerational lru: mm_struct list > An infra keeps track of mm_struct's for page table walkers and > provides them with optimizations, i.e., switch_mm() tracking and Bloom > filters. > > 6. mm: multigenerational lru: aging > 7. mm: multigenerational lru: eviction > "The page reclaim" is a producer/consumer model. "The aging" produces > cold pages, whereas "the eviction " consumes them. Cold pages flow > through generations. The aging uses the mm_struct list infra to sweep > dense hotspots in page tables. During a page table walk, the aging > clears the accessed bit and tags accessed pages with the youngest > generation number. The eviction sorts those pages when it encounters > them. For pages in the oldest generation, eviction walks the rmap to > check the accessed bit one more time before evicting them. During an > rmap walk, the eviction feeds dense hotspots back to the aging. Dense > hotspots flow through the Bloom filters. For pages not mapped in page > tables, the eviction uses the PID controller to statistically > determine whether they have higher refaults. If so, the eviction > throttles their eviction by moving them to the next generation (the > second oldest). > > 8. mm: multigenerational lru: user interface > The knobs to turn on/off MGLRU and provide the userspace with > thrashing prevention, working set estimation (the aging) and proactive > reclaim (the eviction). > > 9. mm: multigenerational lru: Kconfig > The Kconfig options. > > Benchmark results > ================= > Independent lab results > ----------------------- > Based on the popularity of searches [01] and the memory usage in > Google's public cloud, the most popular open-source memory-hungry > applications, in alphabetical order, are: > Apache Cassandra Memcached > Apache Hadoop MongoDB > Apache Spark PostgreSQL > MariaDB (MySQL) Redis > > An independent lab evaluated MGLRU with the most widely used benchmark > suites for the above applications. They posted 960 data points along > with kernel metrics and perf profiles collected over more than 500 > hours of total benchmark time. Their final reports show that, with 95% > confidence intervals (CIs), the above applications all performed > significantly better for at least part of their benchmark matrices. > > On 5.14: > 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% > less wall time to sort three billion random integers, respectively, > under the medium- and the high-concurrency conditions, when > overcommitting memory. There were no statistically significant > changes in wall time for the rest of the benchmark matrix. > 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]% > more transactions per minute (TPM), respectively, under the medium- > and the high-concurrency conditions, when overcommitting memory. > There were no statistically significant changes in TPM for the rest > of the benchmark matrix. > 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]% > and [21.59, 30.02]% more operations per second (OPS), respectively, > for sequential access, random access and Gaussian (distribution) > access, when THP=always; 95% CIs [13.85, 15.97]% and > [23.94, 29.92]% more OPS, respectively, for random access and > Gaussian access, when THP=never. There were no statistically > significant changes in OPS for the rest of the benchmark matrix. > 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and > [2.16, 3.55]% more operations per second (OPS), respectively, for > exponential (distribution) access, random access and Zipfian > (distribution) access, when underutilizing memory; 95% CIs > [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS, > respectively, for exponential access, random access and Zipfian > access, when overcommitting memory. > > On 5.15: > 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% > and [4.11, 7.50]% more operations per second (OPS), respectively, > for exponential (distribution) access, random access and Zipfian > (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, > [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for > exponential access, random access and Zipfian access, when swap was > on. > 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% > less average wall time to finish twelve parallel TeraSort jobs, > respectively, under the medium- and the high-concurrency > conditions, when swap was on. There were no statistically > significant changes in average wall time for the rest of the > benchmark matrix. > 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per > minute (TPM) under the high-concurrency condition, when swap was > off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, > respectively, under the medium- and the high-concurrency > conditions, when swap was on. There were no statistically > significant changes in TPM for the rest of the benchmark matrix. > 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and > [11.47, 19.36]% more total operations per second (OPS), > respectively, for sequential access, random access and Gaussian > (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, > [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, > for sequential access, random access and Gaussian access, when > THP=never. > > Our lab results > --------------- > To supplement the above results, we ran the following benchmark suites > on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks > are popular among MM developers, but we prefer large-scale A/B > experiments to validate improvements.) > fs_fio_bench_hdd_mq pft > fs_lmbench pgsql-hammerdb > fs_parallelio redis > fs_postmark stream > hackbench sysbenchthread > kernbench tpcc_spark > memcached unixbench > multichase vm-scalability > mutilate will-it-scale > nginx > > [01] https://trends.google.com > [02] https://lore.kernel.org/linux-mm/20211102002002.92051-1-bot@edi.works/ > [03] https://lore.kernel.org/linux-mm/20211009054315.47073-1-bot@edi.works/ > [04] https://lore.kernel.org/linux-mm/20211021194103.65648-1-bot@edi.works/ > [05] https://lore.kernel.org/linux-mm/20211109021346.50266-1-bot@edi.works/ > [06] https://lore.kernel.org/linux-mm/20211202062806.80365-1-bot@edi.works/ > [07] https://lore.kernel.org/linux-mm/20211209072416.33606-1-bot@edi.works/ > [08] https://lore.kernel.org/linux-mm/20211218071041.24077-1-bot@edi.works/ > [09] https://lore.kernel.org/linux-mm/20211122053248.57311-1-bot@edi.works/ > [10] https://lore.kernel.org/linux-mm/20220104202247.2903702-1-yuzhao@google.com/ > > Read-world applications > ======================= > Third-party testimonials > ------------------------ > Konstantin wrote [11]: > I have Archlinux with 8G RAM + zswap + swap. While developing, I > have lots of apps opened such as multiple LSP-servers for different > langs, chats, two browsers, etc... Usually, my system gets quickly > to a point of SWAP-storms, where I have to kill LSP-servers, > restart browsers to free memory, etc, otherwise the system lags > heavily and is barely usable. > > 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU > patchset, and I started up by opening lots of apps to create memory > pressure, and worked for a day like this. Till now I had *not a > single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never > getting to the point of 3G in SWAP before without a single > SWAP-storm. > > The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many > of its users reported their positive experiences to me, e.g., Shivodit > wrote: > I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the > archlinux testing repos), everything's been smooth so far. I also > decided to copy a large volume of files to check performance under > I/O load, and everything went smoothly - no stuttering was present, > everything was responsive. > > Large-scale deployments > ----------------------- > We've rolled out MGLRU to tens of millions of Chrome OS users and > about a million Android users. Google's fleetwide profiling [13] shows > an overall 40% decrease in kswapd CPU usage, in addition to > improvements in other UX metrics, e.g., an 85% decrease in the number > of low-memory kills at the 75th percentile and an 18% decrease in > rendering latency at the 50th percentile. > > [11] https://lore.kernel.org/linux-mm/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/ > [12] https://github.com/zen-kernel/zen-kernel/ > [13] https://research.google/pubs/pub44271/ > > Summery > ======= > The facts are: > 1. The independent lab results and the real-world applications > indicate substantial improvements; there are no known regressions. > 2. Thrashing prevention, working set estimation and proactive reclaim > work out of the box; there are no equivalent solutions. > 3. There is a lot of new code; nobody has demonstrated smaller changes > with similar effects. > > Our options, accordingly, are: > 1. Given the amount of evidence, the reported improvements will likely > materialize for a wide range of workloads. > 2. Gauging the interest from the past discussions [14][15][16], the > new features will likely be put to use for both personal computers > and data centers. > 3. Based on Google's track record, the new code will likely be well > maintained in the long term. It'd be more difficult if not > impossible to achieve similar effects on top of the existing > design. > > [14] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/ > [15] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/ > [16] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ > > Yu Zhao (9): > mm: x86, arm64: add arch_has_hw_pte_young() > mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG > mm/vmscan.c: refactor shrink_node() > mm: multigenerational lru: groundwork > mm: multigenerational lru: mm_struct list > mm: multigenerational lru: aging > mm: multigenerational lru: eviction > mm: multigenerational lru: user interface > mm: multigenerational lru: Kconfig > > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 80 + > arch/Kconfig | 9 + > arch/arm64/include/asm/cpufeature.h | 5 + > arch/arm64/include/asm/pgtable.h | 13 +- > arch/arm64/kernel/cpufeature.c | 19 + > arch/arm64/tools/cpucaps | 1 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 9 +- > arch/x86/mm/pgtable.c | 5 +- > fs/exec.c | 2 + > fs/fuse/dev.c | 3 +- > include/linux/cgroup.h | 15 +- > include/linux/memcontrol.h | 11 + > include/linux/mm.h | 42 + > include/linux/mm_inline.h | 204 ++ > include/linux/mm_types.h | 78 + > include/linux/mmzone.h | 175 ++ > include/linux/nodemask.h | 1 + > include/linux/oom.h | 16 + > include/linux/page-flags-layout.h | 19 +- > include/linux/page-flags.h | 4 +- > include/linux/pgtable.h | 17 +- > include/linux/sched.h | 4 + > include/linux/swap.h | 4 + > kernel/bounds.c | 3 + > kernel/cgroup/cgroup-internal.h | 1 - > kernel/exit.c | 1 + > kernel/fork.c | 9 + > kernel/sched/core.c | 1 + > mm/Kconfig | 48 + > mm/huge_memory.c | 3 +- > mm/memcontrol.c | 26 + > mm/memory.c | 21 +- > mm/mm_init.c | 6 +- > mm/oom_kill.c | 4 +- > mm/page_alloc.c | 1 + > mm/rmap.c | 7 + > mm/swap.c | 51 +- > mm/vmscan.c | 2691 ++++++++++++++++++++++++++- > mm/workingset.c | 119 +- > 41 files changed, 3591 insertions(+), 139 deletions(-) > create mode 100644 Documentation/vm/multigen_lru.rst For the series: Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle. Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet.
On Wed, Jan 12, 2022 at 09:56:58PM +0100, Oleksandr Natalenko wrote: > Hello. > > On úterý 4. ledna 2022 21:22:19 CET Yu Zhao wrote: > > TLDR > > ==== > > The current page reclaim is too expensive in terms of CPU usage and it > > often makes poor choices about what to evict. This patchset offers an > > alternative solution that is performant, versatile and > > straightforward. <snipped> > For the series: > > Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> > > I run this (and one of the previous spins) on nine machines (physical, virtual, workstations, servers) for quite some time with no hassle. > > Thanks for your job, and please keep me in Cc once you post new spins. I'm more than happy to deploy those across the fleet. Thanks, Oleksandr. And if I may take the liberty of introducing you as: Oleksandr - the post-factum kernel maintainer https://gitlab.com/post-factum/pf-kernel in addition to other downstream kernel maintainers I've introduced: > Alexandre - the XanMod kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel
On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote: > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > > TLDR > > > ==== > > > The current page reclaim is too expensive in terms of CPU usage and it > > > often makes poor choices about what to evict. This patchset offers an > > > alternative solution that is performant, versatile and > > > straightforward. > > > > <snipped> > > > > > Summery > > > ======= > > > The facts are: > > > 1. The independent lab results and the real-world applications > > > indicate substantial improvements; there are no known regressions. > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > > work out of the box; there are no equivalent solutions. > > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > > with similar effects. > > > > > > Our options, accordingly, are: > > > 1. Given the amount of evidence, the reported improvements will likely > > > materialize for a wide range of workloads. > > > 2. Gauging the interest from the past discussions [14][15][16], the > > > new features will likely be put to use for both personal computers > > > and data centers. > > > 3. Based on Google's track record, the new code will likely be well > > > maintained in the long term. It'd be more difficult if not > > > impossible to achieve similar effects on top of the existing > > > design. > > > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. My gratitude to Donald who has been helping test MGLRU since v2: Donald Carr (d@chaos-reins.com) Founder of Chaos Reins (http://chaos-reins.com), an SF based consultancy company specializing in designing/creating embedded Linux appliances. Can you please provide your Tested-by tags? This will ensure the credit for your contributions. Thanks!
January 18, 2022 1:21 AM, "Yu Zhao" <yuzhao@google.com> wrote: > On Tue, Jan 11, 2022 at 01:41:22AM -0700, Yu Zhao wrote: > >> On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: >> On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: >>> TLDR >>> ==== >>> The current page reclaim is too expensive in terms of CPU usage and it >>> often makes poor choices about what to evict. This patchset offers an >>> alternative solution that is performant, versatile and >>> straightforward. >> >> <snipped> >> >>> Summery >>> ======= >>> The facts are: >>> 1. The independent lab results and the real-world applications >>> indicate substantial improvements; there are no known regressions. >>> 2. Thrashing prevention, working set estimation and proactive reclaim >>> work out of the box; there are no equivalent solutions. >>> 3. There is a lot of new code; nobody has demonstrated smaller changes >>> with similar effects. >>> >>> Our options, accordingly, are: >>> 1. Given the amount of evidence, the reported improvements will likely >>> materialize for a wide range of workloads. >>> 2. Gauging the interest from the past discussions [14][15][16], the >>> new features will likely be put to use for both personal computers >>> and data centers. >>> 3. Based on Google's track record, the new code will likely be well >>> maintained in the long term. It'd be more difficult if not >>> impossible to achieve similar effects on top of the existing >>> design. >> >> Hi Andrew, Linus, >> >> Can you please take a look at this patchset and let me know if it's >> 5.17 material? >> >> My goal is to get it merged asap so that users can reap the benefits >> and I can push the sequels. Please examine the data provided -- I >> think the unprecedented coverage and the magnitude of the improvements >> warrant a green light. > > My gratitude to Donald who has been helping test MGLRU since v2: > > Donald Carr (d@chaos-reins.com) > > Founder of Chaos Reins (http://chaos-reins.com), an SF based > consultancy company specializing in designing/creating embedded > Linux appliances. Tested-by: Donald Carr <d@chaos-reins.com> > Can you please provide your Tested-by tags? This will ensure the credit > for your contributions. > > Thanks!
On Tue, Jan 11, 2022, at 2:41 AM, Yu Zhao wrote: > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > > TLDR > > > ==== > > > The current page reclaim is too expensive in terms of CPU usage and it > > > often makes poor choices about what to evict. This patchset offers an > > > alternative solution that is performant, versatile and > > > straightforward. > > > > <snipped> > > > > > Summery > > > ======= > > > The facts are: > > > 1. The independent lab results and the real-world applications > > > indicate substantial improvements; there are no known regressions. > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > > work out of the box; there are no equivalent solutions. > > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > > with similar effects. > > > > > > Our options, accordingly, are: > > > 1. Given the amount of evidence, the reported improvements will likely > > > materialize for a wide range of workloads. > > > 2. Gauging the interest from the past discussions [14][15][16], the > > > new features will likely be put to use for both personal computers > > > and data centers. > > > 3. Based on Google's track record, the new code will likely be well > > > maintained in the long term. It'd be more difficult if not > > > impossible to achieve similar effects on top of the existing > > > design. > > > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel > > Also my gratitude to those who have helped test MGLRU: > > Daniel - researcher at Michigan Tech > benchmarked memcached > > Holger - who has been testing/patching/contributing to various > subsystems since ~2008 > > Shuang - researcher at University of Rochester > benchmarked fio and provided a report > > Sofia - EDI https://www.edi.works > benchmarked the top eight memory hogs and provided reports > > Can you please provide your Tested-by tags? This will ensure the credit > for your contributions. > > Thanks! This feature has been a huge improvement for desktop linux, system responsiveness has hit a new level high memory pressure. Thanks Yu! Acked-by: Steven Barrett <steven@liquorix.net>
On Tue, Jan 11, 2022 at 3:41 AM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Jan 04, 2022 at 01:30:00PM -0700, Yu Zhao wrote: > > On Tue, Jan 04, 2022 at 01:22:19PM -0700, Yu Zhao wrote: > > > TLDR > > > ==== > > > The current page reclaim is too expensive in terms of CPU usage and it > > > often makes poor choices about what to evict. This patchset offers an > > > alternative solution that is performant, versatile and > > > straightforward. > > > > <snipped> > > > > > Summery > > > ======= > > > The facts are: > > > 1. The independent lab results and the real-world applications > > > indicate substantial improvements; there are no known regressions. > > > 2. Thrashing prevention, working set estimation and proactive reclaim > > > work out of the box; there are no equivalent solutions. > > > 3. There is a lot of new code; nobody has demonstrated smaller changes > > > with similar effects. > > > > > > Our options, accordingly, are: > > > 1. Given the amount of evidence, the reported improvements will likely > > > materialize for a wide range of workloads. > > > 2. Gauging the interest from the past discussions [14][15][16], the > > > new features will likely be put to use for both personal computers > > > and data centers. > > > 3. Based on Google's track record, the new code will likely be well > > > maintained in the long term. It'd be more difficult if not > > > impossible to achieve similar effects on top of the existing > > > design. > > > > Hi Andrew, Linus, > > > > Can you please take a look at this patchset and let me know if it's > > 5.17 material? > > > > My goal is to get it merged asap so that users can reap the benefits > > and I can push the sequels. Please examine the data provided -- I > > think the unprecedented coverage and the magnitude of the improvements > > warrant a green light. > > Downstream kernel maintainers who have been carrying MGLRU for more than > 3 versions, can you please provide your Acked-by tags? > > Having this patchset in the mainline will make your job easier :) > > Alexandre - the XanMod Kernel maintainer > https://xanmod.org > > Brian - the Chrome OS kernel memory maintainer > https://www.chromium.org MGLRU has been maturing in ChromeOS for quite some time, we've maintained it in a number of different kernels between 4.14 and 5.15, and it's become the default for tens of millions of users. We've seen substantial improvements in terms of CPU utilization and memory pressure resulting in fewer OOM kills and reduced UI latency. I would love to see this make it upstream so more desktop users can benefit. Acked-by: Brian Geffon <bgeffon@google.com> > > Jan - the Arch Linux Zen kernel maintainer > https://archlinux.org > > Steven - the Liquorix kernel maintainer > https://liquorix.net > > Suleiman - the ARCVM (Android downstream) kernel memory maintainer > https://chromium.googlesource.com/chromiumos/third_party/kernel > > Also my gratitude to those who have helped test MGLRU: > > Daniel - researcher at Michigan Tech > benchmarked memcached > > Holger - who has been testing/patching/contributing to various > subsystems since ~2008 > > Shuang - researcher at University of Rochester > benchmarked fio and provided a report > > Sofia - EDI https://www.edi.works > benchmarked the top eight memory hogs and provided reports > > Can you please provide your Tested-by tags? This will ensure the credit > for your contributions. > > Thanks!
On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote: > > TLDR > ==== > The current page reclaim is too expensive in terms of CPU usage and it > often makes poor choices about what to evict. This patchset offers an > alternative solution that is performant, versatile and > straightforward. > > Design objectives > ================= > The design objectives are: > 1. Better representation of access recency > 2. Try to profit from spatial locality > 3. Clear fast path making obvious choices > 4. Simple self-correcting heuristics > > The representation of access recency is at the core of all LRU > approximations. The multigenerational LRU (MGLRU) divides pages into > multiple lists (generations), each having bounded access recency (a > time interval). Generations establish a common frame of reference and > help make better choices, e.g., between different memcgs on a computer > or different computers in a data center (for cluster job scheduling). > > Exploiting spatial locality improves the efficiency when gathering the > accessed bit. A rmap walk targets a single page and doesn't try to > profit from discovering an accessed PTE. A page table walk can sweep > all hotspots in an address space, but its search space can be too > large to make a profit. The key is to optimize both methods and use > them in combination. (PMU is another option for further exploration.) > > Fast path reduces code complexity and runtime overhead. Unmapped pages > don't require TLB flushes; clean pages don't require writeback. These > facts are only helpful when other conditions, e.g., access recency, > are similar. With generations as a common frame of reference, > additional factors stand out. But obvious choices might not be good > choices; thus self-correction is required (the next objective). > > The benefits of simple self-correcting heuristics are self-evident. > Again with generations as a common frame of reference, this becomes > attainable. Specifically, pages in the same generation are categorized > based on additional factors, and a closed-loop control statistically > compares the refault percentages across all categories and throttles > the eviction of those that have higher percentages. > > Patchset overview > ================= > 1. mm: x86, arm64: add arch_has_hw_pte_young() > 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG > Materializing hardware optimizations when trying to clear the accessed > bit in many PTEs. If hardware automatically sets the accessed bit in > PTEs, there is no need to worry about bursty page faults (emulating > the accessed bit). If it also sets the accessed bit in non-leaf PMD > entries, there is no need to search the PTE table pointed to by a PMD > entry that doesn't have the accessed bit set. > > 3. mm/vmscan.c: refactor shrink_node() > A minor refactor. > > 4. mm: multigenerational lru: groundwork > Adding the basic data structure and the functions to initialize it and > insert/remove pages. > > 5. mm: multigenerational lru: mm_struct list > An infra keeps track of mm_struct's for page table walkers and > provides them with optimizations, i.e., switch_mm() tracking and Bloom > filters. > > 6. mm: multigenerational lru: aging > 7. mm: multigenerational lru: eviction > "The page reclaim" is a producer/consumer model. "The aging" produces > cold pages, whereas "the eviction " consumes them. Cold pages flow > through generations. The aging uses the mm_struct list infra to sweep > dense hotspots in page tables. During a page table walk, the aging > clears the accessed bit and tags accessed pages with the youngest > generation number. The eviction sorts those pages when it encounters > them. For pages in the oldest generation, eviction walks the rmap to > check the accessed bit one more time before evicting them. During an > rmap walk, the eviction feeds dense hotspots back to the aging. Dense > hotspots flow through the Bloom filters. For pages not mapped in page > tables, the eviction uses the PID controller to statistically > determine whether they have higher refaults. If so, the eviction > throttles their eviction by moving them to the next generation (the > second oldest). > > 8. mm: multigenerational lru: user interface > The knobs to turn on/off MGLRU and provide the userspace with > thrashing prevention, working set estimation (the aging) and proactive > reclaim (the eviction). > > 9. mm: multigenerational lru: Kconfig > The Kconfig options. > > Benchmark results > ================= > Independent lab results > ----------------------- > Based on the popularity of searches [01] and the memory usage in > Google's public cloud, the most popular open-source memory-hungry > applications, in alphabetical order, are: > Apache Cassandra Memcached > Apache Hadoop MongoDB > Apache Spark PostgreSQL > MariaDB (MySQL) Redis > > An independent lab evaluated MGLRU with the most widely used benchmark > suites for the above applications. They posted 960 data points along > with kernel metrics and perf profiles collected over more than 500 > hours of total benchmark time. Their final reports show that, with 95% > confidence intervals (CIs), the above applications all performed > significantly better for at least part of their benchmark matrices. > > On 5.14: > 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% > less wall time to sort three billion random integers, respectively, > under the medium- and the high-concurrency conditions, when > overcommitting memory. There were no statistically significant > changes in wall time for the rest of the benchmark matrix. > 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]% > more transactions per minute (TPM), respectively, under the medium- > and the high-concurrency conditions, when overcommitting memory. > There were no statistically significant changes in TPM for the rest > of the benchmark matrix. > 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]% > and [21.59, 30.02]% more operations per second (OPS), respectively, > for sequential access, random access and Gaussian (distribution) > access, when THP=always; 95% CIs [13.85, 15.97]% and > [23.94, 29.92]% more OPS, respectively, for random access and > Gaussian access, when THP=never. There were no statistically > significant changes in OPS for the rest of the benchmark matrix. > 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and > [2.16, 3.55]% more operations per second (OPS), respectively, for > exponential (distribution) access, random access and Zipfian > (distribution) access, when underutilizing memory; 95% CIs > [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS, > respectively, for exponential access, random access and Zipfian > access, when overcommitting memory. > > On 5.15: > 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% > and [4.11, 7.50]% more operations per second (OPS), respectively, > for exponential (distribution) access, random access and Zipfian > (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, > [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for > exponential access, random access and Zipfian access, when swap was > on. > 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% > less average wall time to finish twelve parallel TeraSort jobs, > respectively, under the medium- and the high-concurrency > conditions, when swap was on. There were no statistically > significant changes in average wall time for the rest of the > benchmark matrix. > 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per > minute (TPM) under the high-concurrency condition, when swap was > off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, > respectively, under the medium- and the high-concurrency > conditions, when swap was on. There were no statistically > significant changes in TPM for the rest of the benchmark matrix. > 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and > [11.47, 19.36]% more total operations per second (OPS), > respectively, for sequential access, random access and Gaussian > (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, > [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, > for sequential access, random access and Gaussian access, when > THP=never. > > Our lab results > --------------- > To supplement the above results, we ran the following benchmark suites > on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks > are popular among MM developers, but we prefer large-scale A/B > experiments to validate improvements.) > fs_fio_bench_hdd_mq pft > fs_lmbench pgsql-hammerdb > fs_parallelio redis > fs_postmark stream > hackbench sysbenchthread > kernbench tpcc_spark > memcached unixbench > multichase vm-scalability > mutilate will-it-scale > nginx > > [01] https://trends.google.com > [02] https://lore.kernel.org/linux-mm/20211102002002.92051-1-bot@edi.works/ > [03] https://lore.kernel.org/linux-mm/20211009054315.47073-1-bot@edi.works/ > [04] https://lore.kernel.org/linux-mm/20211021194103.65648-1-bot@edi.works/ > [05] https://lore.kernel.org/linux-mm/20211109021346.50266-1-bot@edi.works/ > [06] https://lore.kernel.org/linux-mm/20211202062806.80365-1-bot@edi.works/ > [07] https://lore.kernel.org/linux-mm/20211209072416.33606-1-bot@edi.works/ > [08] https://lore.kernel.org/linux-mm/20211218071041.24077-1-bot@edi.works/ > [09] https://lore.kernel.org/linux-mm/20211122053248.57311-1-bot@edi.works/ > [10] https://lore.kernel.org/linux-mm/20220104202247.2903702-1-yuzhao@google.com/ > > Read-world applications > ======================= > Third-party testimonials > ------------------------ > Konstantin wrote [11]: > I have Archlinux with 8G RAM + zswap + swap. While developing, I > have lots of apps opened such as multiple LSP-servers for different > langs, chats, two browsers, etc... Usually, my system gets quickly > to a point of SWAP-storms, where I have to kill LSP-servers, > restart browsers to free memory, etc, otherwise the system lags > heavily and is barely usable. > > 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU > patchset, and I started up by opening lots of apps to create memory > pressure, and worked for a day like this. Till now I had *not a > single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never > getting to the point of 3G in SWAP before without a single > SWAP-storm. > > The Arch Linux Zen kernel [12] has been using MGLRU since 5.12. Many > of its users reported their positive experiences to me, e.g., Shivodit > wrote: > I've tried the latest Zen kernel (5.14.13-zen1-1-zen in the > archlinux testing repos), everything's been smooth so far. I also > decided to copy a large volume of files to check performance under > I/O load, and everything went smoothly - no stuttering was present, > everything was responsive. > > Large-scale deployments > ----------------------- > We've rolled out MGLRU to tens of millions of Chrome OS users and > about a million Android users. Google's fleetwide profiling [13] shows > an overall 40% decrease in kswapd CPU usage, in addition to Hi Yu, Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64? And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG. Does it help a lot in decreasing the cpu usage? If so, this might be a good proof that arm64 also needs this hardware feature? In short, I am curious how much the improvement in this patchset depends on the hardware ability of NONLEAF_PMD_YOUNG. > improvements in other UX metrics, e.g., an 85% decrease in the number > of low-memory kills at the 75th percentile and an 18% decrease in > rendering latency at the 50th percentile. > > [11] https://lore.kernel.org/linux-mm/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/ > [12] https://github.com/zen-kernel/zen-kernel/ > [13] https://research.google/pubs/pub44271/ > Thanks Barry
On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote: > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote: <snipped> > > Large-scale deployments > > ----------------------- > > We've rolled out MGLRU to tens of millions of Chrome OS users and > > about a million Android users. Google's fleetwide profiling [13] shows > > an overall 40% decrease in kswapd CPU usage, in addition to > > Hi Yu, > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64? > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG. > Does it help a lot in decreasing the cpu usage? Hi Barry, The fleet-wide profiling data I shared was from x86. For arm64, I only have data from synthetic benchmarks at the moment, and it also shows similar improvements. For Chrome OS (individual users), walk_pte_range(), the function that would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't that helpful. > If so, this might be > a good proof that arm64 also needs this hardware feature? > In short, I am curious how much the improvement in this patchset depends > on the hardware ability of NONLEAF_PMD_YOUNG. For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value. In addition to cold/hot memory scanning, there are other use cases like dirty tracking, which can benefit from the accessed bit on non-leaf entries. I know some proprietary software uses this capability on x86 for different purposes than this patchset does. And AFAIK, x86 is the only arch that supports this capability, e.g., risc-v and ppc can only set the accessed bit in PTEs. In fact, I've discussed this with one of the arm maintainers Will. So please check with him too if you are interested in moving forward with the idea. I might be able to provide with additional data if you need it to make a decision. Thanks.
On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@google.com> wrote: > > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote: > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote: > > <snipped> > > > > Large-scale deployments > > > ----------------------- > > > We've rolled out MGLRU to tens of millions of Chrome OS users and > > > about a million Android users. Google's fleetwide profiling [13] shows > > > an overall 40% decrease in kswapd CPU usage, in addition to > > > > Hi Yu, > > > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64? > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG. > > Does it help a lot in decreasing the cpu usage? > > Hi Barry, > > The fleet-wide profiling data I shared was from x86. For arm64, I only > have data from synthetic benchmarks at the moment, and it also shows > similar improvements. > > For Chrome OS (individual users), walk_pte_range(), the function that > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't > that helpful. Hi Yu, Thanks! In the current kernel, depending on reverse mapping, while memory is under pressure, the cpu usage of kswapd can be very very high especially while a lot of pages have large mapcount, thus a huge reverse mapping cost. Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG? In this case, we can skip many PTE scans while PMD has no accessed bit set. But for a machine without NONLEAF, will the figure of cpu usage be much larger? > > > If so, this might be > > a good proof that arm64 also needs this hardware feature? > > In short, I am curious how much the improvement in this patchset depends > > on the hardware ability of NONLEAF_PMD_YOUNG. > > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value. > In addition to cold/hot memory scanning, there are other use cases like > dirty tracking, which can benefit from the accessed bit on non-leaf > entries. I know some proprietary software uses this capability on x86 > for different purposes than this patchset does. And AFAIK, x86 is the > only arch that supports this capability, e.g., risc-v and ppc can only > set the accessed bit in PTEs. Yep. NONLEAF is a nice feature. btw, page table should have a separate DIRTY bit, right? wouldn't dirty page tracking depend on the DIRTY bit rather than the accessed bit? so x86 also has NONLEAF dirty bit? Or they are scanning accessed bit of PMD before scanning DIRTY bits of PTEs? > > In fact, I've discussed this with one of the arm maintainers Will. So > please check with him too if you are interested in moving forward with > the idea. I might be able to provide with additional data if you need > it to make a decision. I am interested in running it and have some data without NONLEAF especially while free memory is very limited and the system has memory thrashing. > > Thanks. Thanks Barry
On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote: > On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote: > > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@google.com> wrote: > > > > <snipped> > > > > > > Large-scale deployments > > > > ----------------------- > > > > We've rolled out MGLRU to tens of millions of Chrome OS users and > > > > about a million Android users. Google's fleetwide profiling [13] shows > > > > an overall 40% decrease in kswapd CPU usage, in addition to > > > > > > Hi Yu, > > > > > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64? > > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG. > > > Does it help a lot in decreasing the cpu usage? > > > > Hi Barry, > > > > The fleet-wide profiling data I shared was from x86. For arm64, I only > > have data from synthetic benchmarks at the moment, and it also shows > > similar improvements. > > > > For Chrome OS (individual users), walk_pte_range(), the function that > > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small > > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't > > that helpful. > > Hi Yu, > Thanks! > > In the current kernel, depending on reverse mapping, while memory is > under pressure, > the cpu usage of kswapd can be very very high especially while a lot of pages > have large mapcount, thus a huge reverse mapping cost. Agreed. I've posted v7 which includes kswapd profiles collected from an arm64 v8.2 laptop under memory pressure. > Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG? No, it's from Snapdragon 7c. Please see the kswapd profiles in v7. > In this case, we can skip many PTE scans while PMD has no accessed bit > set. But for > a machine without NONLEAF, will the figure of cpu usage be much larger? So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But this definitely can vary, depending on the workloads. > > > If so, this might be > > > a good proof that arm64 also needs this hardware feature? > > > In short, I am curious how much the improvement in this patchset depends > > > on the hardware ability of NONLEAF_PMD_YOUNG. > > > > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value. > > In addition to cold/hot memory scanning, there are other use cases like > > dirty tracking, which can benefit from the accessed bit on non-leaf > > entries. I know some proprietary software uses this capability on x86 > > for different purposes than this patchset does. And AFAIK, x86 is the > > only arch that supports this capability, e.g., risc-v and ppc can only > > set the accessed bit in PTEs. > > Yep. NONLEAF is a nice feature. > > btw, page table should have a separate DIRTY bit, right? Yes. > wouldn't dirty page > tracking depend on the DIRTY bit rather than the accessed bit? It depends on the goal. > so x86 also has > NONLEAF dirty bit? No. > Or they are scanning accessed bit of PMD before > scanning DIRTY bits of PTEs? A mandatory sync to disk must use the dirty bit to ensure data integrity. But for a voluntary sync to disk, it can use the accessed bit to narrow the search of dirty pages. A mandatory sync is used to free specific dirty pages. A voluntary sync is used to keep the number of dirty pages low in general and it doesn't target any specific dirty pages. > > In fact, I've discussed this with one of the arm maintainers Will. So > > please check with him too if you are interested in moving forward with > > the idea. I might be able to provide with additional data if you need > > it to make a decision. > > I am interested in running it and have some data without NONLEAF > especially while free memory is very limited and the system has memory > thrashing. The v7 has a switch to disable this feature on x86. If you can run your workloads on x86, then it might be able to help you measure the difference.