Message ID | 20230414130303.2345383-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | variable-order, large folios for anonymous memory | expand |
On 4/14/2023 9:02 PM, Ryan Roberts wrote: > Hi All, > > This is a second RFC and my first proper attempt at implementing variable order, > large folios for anonymous memory. The first RFC [1], was a partial > implementation and a plea for help in debugging an issue I was hitting; thanks > to Yin Fengwei and Matthew Wilcox for their advice in solving that! > > The objective of variable order anonymous folios is to improve performance by > allocating larger chunks of memory during anonymous page faults: > > - Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > - Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. > > This patch set deals with the SW side of things only but sets us up nicely for > taking advantage of the HW improvements in the near future. > > I'm not yet benchmarking a wide variety of use cases, but those that I have > looked at are positive; I see kernel compilation time improved by up to 10%, > which I expect to improve further once I add in the arm64 "contiguous bit". > Memory consumption is somewhere between 1% less and 2% more, depending on how > its measured. More on perf and memory below. > > The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor > conflict resolution). I have a tree at [4]. > > [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ > [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 > > Approach > ======== > > There are 4 fault paths that have been modified: > - write fault on unallocated address: do_anonymous_page() > - write fault on zero page: wp_page_copy() > - write fault on non-exclusive CoW page: wp_page_copy() > - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() > > In the first 2 cases, we will determine the preferred order folio to allocate, > limited by a max order (currently order-4; see below), VMA and PMD bounds, and > state of neighboring PTEs. In the 3rd case, we aim to allocate the same order > folio as the source, subject to constraints that may arise if the source has > been mremapped or partially munmapped. And in the 4th case, we reuse as much of > the folio as we can, subject to the same mremap/munmap constraints. > > If allocation of our preferred folio order fails, we gracefully fall back to > lower orders all the way to 0. > > Note that none of this affects the behavior of traditional PMD-sized THP. If we > take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. > > Open Questions > ============== > > How to Move Forwards > -------------------- > > While the series is a small-ish code change, it represents a big shift in the > way things are done. So I'd appreciate any help in scaling up performance > testing, review and general advice on how best to guide a change like this into > the kernel. > > Folio Allocation Order Policy > ----------------------------- > > The current code is hardcoded to use a maximum order of 4. This was chosen for a > couple of reasons: > - From the SW performance perspective, I see a knee around here where > increasing it doesn't lead to much more performance gain. > - Intuitively I assume that higher orders become increasingly difficult to > allocate. > - From the HW performance perspective, arm64's HPA works on order-2 blocks and > "the contiguous bit" works on order-4 for 4KB base pages (although it's > order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going > any higher. > > I suggest that ultimately setting the max order should be left to the > architecture. arm64 would take advantage of this and set it to the order > required for the contiguous bit for the configured base page size. > > However, I also have a (mild) concern about increased memory consumption. If an > app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) > we would end up allocating 16x as much memory as we used to. One potential > approach I see here is to track fault addresses per-VMA, and increase a per-VMA > max allocation order for consecutive faults that extend a contiguous range, and > decrement when discontiguous. Alternatively/additionally, we could use the VMA > size as an indicator. I'd be interested in your thoughts/opinions. > > Deferred Split Queue Lock Contention > ------------------------------------ > > The results below show that we are spending a much greater proportion of time in > the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. > > I think this is (at least partially) related for contention on the deferred > split queue lock. This is a per-memcg spinlock, which means a single spinlock > shared among all 160 CPUs. I've solved part of the problem with the last patch > in the series (which cuts down the need to take the lock), but at folio free > time (free_transhuge_page()), the lock is still taken and I think this could be > a problem. Now that most anonymous pages are large folios, this lock is taken a > lot more. > > I think we could probably avoid taking the lock unless !list_empty(), but I > haven't convinced myself its definitely safe, so haven't applied it yet. Yes. It's safe. We also identified other lock contention with large folio for anonymous mapping like lru lock and zone lock. My understanding is that the anonymous page has much higher alloc/free frequency than page cache. So the lock contention was not exposed by large folio for page cache. I posted the related patch to: https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t Regards Yin, Fengwei > > Roadmap > ======= > > Beyond scaling up perf testing, I'm planning to enable use of the "contiguous > bit" on arm64 to validate predictions about HW speedups. > > I also think there are some opportunities with madvise to split folios to non-0 > orders, which might improve performance in some cases. madvise is also mistaking > exclusive large folios for non-exclusive ones at the moment (due to the "small > pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly > frees the folio. > > Results > ======= > > Performance > ----------- > > Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and > with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned > before each run. > > make defconfig && time make -jN Image > > First with -j8: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 373.0 | 342.8 | -8.1% | > | user-time | 2333.9 | 2275.3 | -2.5% | > | sys-time | 510.7 | 340.9 | -33.3% | > > Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel > execution. The next 2 tables show a breakdown of the cycles spent in the kernel > for the 8 job config: > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | data abort | 683B | 316B | -53.8% | > | instruction abort | 93B | 76B | -18.4% | > | syscall | 887B | 767B | -13.6% | > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | arm64_sys_openat | 194B | 188B | -3.3% | > | arm64_sys_exit_group | 192B | 124B | -35.7% | > | arm64_sys_read | 124B | 108B | -12.7% | > | arm64_sys_execve | 75B | 67B | -11.0% | > | arm64_sys_mmap | 51B | 50B | -3.0% | > | arm64_sys_mprotect | 15B | 13B | -12.0% | > | arm64_sys_write | 43B | 42B | -2.9% | > | arm64_sys_munmap | 15B | 12B | -17.0% | > | arm64_sys_newfstatat | 46B | 41B | -9.7% | > | arm64_sys_clone | 26B | 24B | -10.0% | > > And now with -j160: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 53.7 | 48.2 | -10.2% | > | user-time | 2705.8 | 2842.1 | 5.0% | > | sys-time | 1370.4 | 1064.3 | -22.3% | > > Above shows a 10.2% improvement in real time execution. But ~3x more time is > spent in the kernel than for the -j8 config. I think this is related to the lock > contention issue I highlighted above, but haven't bottomed it out yet. It's also > not yet clear to me why user-time increases by 5%. > > I've also run all the will-it-scale microbenchmarks for a single task, using the > process mode. Results for multiple runs on the same kernel are noisy - I see ~5% > fluctuation. So I'm just calling out tests with results that have gt 5% > improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests > are regressed: > > | benchmark | baseline | anonfolio | percent change | > | | ops/s | ops/s | BIGGER=better | > | ---------------------|---------:|----------:|---------------:| > | context_switch1.csv | 328744 | 351150 | 6.8% | > | malloc1.csv | 96214 | 50890 | -47.1% | > | mmap1.csv | 410253 | 375746 | -8.4% | > | page_fault1.csv | 624061 | 3185678 | 410.5% | > | page_fault2.csv | 416483 | 557448 | 33.8% | > | page_fault3.csv | 724566 | 1152726 | 59.1% | > | read1.csv | 1806908 | 1905752 | 5.5% | > | read2.csv | 587722 | 1942062 | 230.4% | > | tlb_flush1.csv | 143910 | 152097 | 5.7% | > | tlb_flush2.csv | 266763 | 322320 | 20.8% | > > I believe malloc1 is an unrealistic test, since it does malloc/free for 128M > object in a loop and never touches the allocated memory. I think the malloc > implementation is maintaining a header just before the allocated object, which > causes a single page fault. Previously that page fault allocated 1 page. Now it > is allocating 16 pages. This cost would be repaid if the test code wrote to the > allocated object. Alternatively the folio allocation order policy described > above would also solve this. > > It is not clear to me why mmap1 has slowed down. This remains a todo. > > Memory > ------ > > I measured memory consumption while doing a kernel compile with 8 jobs on a > system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the > workload, then calcualted "memory used" high and low watermarks using both > MemFree and MemAvailable. If there is a better way of measuring system memory > consumption, please let me know! > > mem-used = 4GB - /proc/meminfo:MemFree > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 825 | 842 | 2.1% | > | mem-used-high | 2697 | 2672 | -0.9% | > > mem-used = 4GB - /proc/meminfo:MemAvailable > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 518 | 530 | 2.3% | > | mem-used-high | 1522 | 1537 | 1.0% | > > For the high watermark, the methods disagree; we are either saving 1% or using > 1% more. For the low watermark, both methods agree that we are using about 2% > more. I plan to investigate whether the proposed folio allocation order policy > can reduce this to zero. > > Thanks for making it this far! > Ryan > > > Ryan Roberts (17): > mm: Expose clear_huge_page() unconditionally > mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() > mm: Introduce try_vma_alloc_movable_folio() > mm: Implement folio_add_new_anon_rmap_range() > mm: Routines to determine max anon folio allocation order > mm: Allocate large folios for anonymous memory > mm: Allow deferred splitting of arbitrary large anon folios > mm: Implement folio_move_anon_rmap_range() > mm: Update wp_page_reuse() to operate on range of pages > mm: Reuse large folios for anonymous memory > mm: Split __wp_page_copy_user() into 2 variants > mm: ptep_clear_flush_range_notify() macro for batch operation > mm: Implement folio_remove_rmap_range() > mm: Copy large folios for anonymous memory > mm: Convert zero page to large folios on write > mm: mmap: Align unhinted maps to highest anon folio order > mm: Batch-zap large anonymous folio PTE mappings > > arch/alpha/include/asm/page.h | 5 +- > arch/arm64/include/asm/page.h | 3 +- > arch/arm64/mm/fault.c | 7 +- > arch/ia64/include/asm/page.h | 5 +- > arch/m68k/include/asm/page_no.h | 7 +- > arch/s390/include/asm/page.h | 5 +- > arch/x86/include/asm/page.h | 5 +- > include/linux/highmem.h | 23 +- > include/linux/mm.h | 8 +- > include/linux/mmu_notifier.h | 31 ++ > include/linux/rmap.h | 6 + > mm/memory.c | 877 ++++++++++++++++++++++++++++---- > mm/mmap.c | 4 +- > mm/rmap.c | 147 +++++- > 14 files changed, 1000 insertions(+), 133 deletions(-) > > -- > 2.25.1 >
On 4/14/2023 9:02 PM, Ryan Roberts wrote: > Hi All, > > This is a second RFC and my first proper attempt at implementing variable order, > large folios for anonymous memory. The first RFC [1], was a partial > implementation and a plea for help in debugging an issue I was hitting; thanks > to Yin Fengwei and Matthew Wilcox for their advice in solving that! > > The objective of variable order anonymous folios is to improve performance by > allocating larger chunks of memory during anonymous page faults: > > - Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > - Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. > > This patch set deals with the SW side of things only but sets us up nicely for > taking advantage of the HW improvements in the near future. > > I'm not yet benchmarking a wide variety of use cases, but those that I have > looked at are positive; I see kernel compilation time improved by up to 10%, > which I expect to improve further once I add in the arm64 "contiguous bit". > Memory consumption is somewhere between 1% less and 2% more, depending on how > its measured. More on perf and memory below. > > The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor > conflict resolution). I have a tree at [4]. > > [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ > [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 > > Approach > ======== > > There are 4 fault paths that have been modified: > - write fault on unallocated address: do_anonymous_page() > - write fault on zero page: wp_page_copy() > - write fault on non-exclusive CoW page: wp_page_copy() > - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() > > In the first 2 cases, we will determine the preferred order folio to allocate, > limited by a max order (currently order-4; see below), VMA and PMD bounds, and > state of neighboring PTEs. In the 3rd case, we aim to allocate the same order > folio as the source, subject to constraints that may arise if the source has > been mremapped or partially munmapped. And in the 4th case, we reuse as much of > the folio as we can, subject to the same mremap/munmap constraints. > > If allocation of our preferred folio order fails, we gracefully fall back to > lower orders all the way to 0. > > Note that none of this affects the behavior of traditional PMD-sized THP. If we > take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. > > Open Questions > ============== > > How to Move Forwards > -------------------- > > While the series is a small-ish code change, it represents a big shift in the > way things are done. So I'd appreciate any help in scaling up performance > testing, review and general advice on how best to guide a change like this into > the kernel. > > Folio Allocation Order Policy > ----------------------------- > > The current code is hardcoded to use a maximum order of 4. This was chosen for a > couple of reasons: > - From the SW performance perspective, I see a knee around here where > increasing it doesn't lead to much more performance gain. > - Intuitively I assume that higher orders become increasingly difficult to > allocate. > - From the HW performance perspective, arm64's HPA works on order-2 blocks and > "the contiguous bit" works on order-4 for 4KB base pages (although it's > order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going > any higher. > > I suggest that ultimately setting the max order should be left to the > architecture. arm64 would take advantage of this and set it to the order > required for the contiguous bit for the configured base page size. > > However, I also have a (mild) concern about increased memory consumption. If an > app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) > we would end up allocating 16x as much memory as we used to. One potential > approach I see here is to track fault addresses per-VMA, and increase a per-VMA > max allocation order for consecutive faults that extend a contiguous range, and > decrement when discontiguous. Alternatively/additionally, we could use the VMA > size as an indicator. I'd be interested in your thoughts/opinions. > > Deferred Split Queue Lock Contention > ------------------------------------ > > The results below show that we are spending a much greater proportion of time in > the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. > > I think this is (at least partially) related for contention on the deferred > split queue lock. This is a per-memcg spinlock, which means a single spinlock > shared among all 160 CPUs. I've solved part of the problem with the last patch > in the series (which cuts down the need to take the lock), but at folio free > time (free_transhuge_page()), the lock is still taken and I think this could be > a problem. Now that most anonymous pages are large folios, this lock is taken a > lot more. > > I think we could probably avoid taking the lock unless !list_empty(), but I > haven't convinced myself its definitely safe, so haven't applied it yet. > > Roadmap > ======= > > Beyond scaling up perf testing, I'm planning to enable use of the "contiguous > bit" on arm64 to validate predictions about HW speedups. > > I also think there are some opportunities with madvise to split folios to non-0 > orders, which might improve performance in some cases. madvise is also mistaking > exclusive large folios for non-exclusive ones at the moment (due to the "small > pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly > frees the folio. > > Results > ======= > > Performance > ----------- > > Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and > with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned > before each run. > > make defconfig && time make -jN Image > > First with -j8: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 373.0 | 342.8 | -8.1% | > | user-time | 2333.9 | 2275.3 | -2.5% | > | sys-time | 510.7 | 340.9 | -33.3% | > > Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel > execution. The next 2 tables show a breakdown of the cycles spent in the kernel > for the 8 job config: > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | data abort | 683B | 316B | -53.8% | > | instruction abort | 93B | 76B | -18.4% | > | syscall | 887B | 767B | -13.6% | > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | arm64_sys_openat | 194B | 188B | -3.3% | > | arm64_sys_exit_group | 192B | 124B | -35.7% | > | arm64_sys_read | 124B | 108B | -12.7% | > | arm64_sys_execve | 75B | 67B | -11.0% | > | arm64_sys_mmap | 51B | 50B | -3.0% | > | arm64_sys_mprotect | 15B | 13B | -12.0% | > | arm64_sys_write | 43B | 42B | -2.9% | > | arm64_sys_munmap | 15B | 12B | -17.0% | > | arm64_sys_newfstatat | 46B | 41B | -9.7% | > | arm64_sys_clone | 26B | 24B | -10.0% | > > And now with -j160: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 53.7 | 48.2 | -10.2% | > | user-time | 2705.8 | 2842.1 | 5.0% | > | sys-time | 1370.4 | 1064.3 | -22.3% | > > Above shows a 10.2% improvement in real time execution. But ~3x more time is > spent in the kernel than for the -j8 config. I think this is related to the lock > contention issue I highlighted above, but haven't bottomed it out yet. It's also > not yet clear to me why user-time increases by 5%. > > I've also run all the will-it-scale microbenchmarks for a single task, using the > process mode. Results for multiple runs on the same kernel are noisy - I see ~5% > fluctuation. So I'm just calling out tests with results that have gt 5% > improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests > are regressed: > > | benchmark | baseline | anonfolio | percent change | > | | ops/s | ops/s | BIGGER=better | > | ---------------------|---------:|----------:|---------------:| > | context_switch1.csv | 328744 | 351150 | 6.8% | > | malloc1.csv | 96214 | 50890 | -47.1% | > | mmap1.csv | 410253 | 375746 | -8.4% | > | page_fault1.csv | 624061 | 3185678 | 410.5% | > | page_fault2.csv | 416483 | 557448 | 33.8% | > | page_fault3.csv | 724566 | 1152726 | 59.1% | > | read1.csv | 1806908 | 1905752 | 5.5% | > | read2.csv | 587722 | 1942062 | 230.4% | > | tlb_flush1.csv | 143910 | 152097 | 5.7% | > | tlb_flush2.csv | 266763 | 322320 | 20.8% | > > I believe malloc1 is an unrealistic test, since it does malloc/free for 128M > object in a loop and never touches the allocated memory. I think the malloc > implementation is maintaining a header just before the allocated object, which > causes a single page fault. Previously that page fault allocated 1 page. Now it > is allocating 16 pages. This cost would be repaid if the test code wrote to the > allocated object. Alternatively the folio allocation order policy described > above would also solve this. > > It is not clear to me why mmap1 has slowed down. This remains a todo. > > Memory > ------ > > I measured memory consumption while doing a kernel compile with 8 jobs on a > system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the > workload, then calcualted "memory used" high and low watermarks using both > MemFree and MemAvailable. If there is a better way of measuring system memory > consumption, please let me know! > > mem-used = 4GB - /proc/meminfo:MemFree > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 825 | 842 | 2.1% | > | mem-used-high | 2697 | 2672 | -0.9% | > > mem-used = 4GB - /proc/meminfo:MemAvailable > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 518 | 530 | 2.3% | > | mem-used-high | 1522 | 1537 | 1.0% | > > For the high watermark, the methods disagree; we are either saving 1% or using > 1% more. For the low watermark, both methods agree that we are using about 2% > more. I plan to investigate whether the proposed folio allocation order policy > can reduce this to zero. Besides the memory consumption, the large folio could impact the tail latency of page allocation also (extra zeroing memory, more operations on slow path of page allocation). Again, my understanding is the tail latency of page allocation is more important to anonymous page than page cache page because anonymous page is allocated/freed more frequently. Regards Yin, Fengwei > > Thanks for making it this far! > Ryan > > > Ryan Roberts (17): > mm: Expose clear_huge_page() unconditionally > mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() > mm: Introduce try_vma_alloc_movable_folio() > mm: Implement folio_add_new_anon_rmap_range() > mm: Routines to determine max anon folio allocation order > mm: Allocate large folios for anonymous memory > mm: Allow deferred splitting of arbitrary large anon folios > mm: Implement folio_move_anon_rmap_range() > mm: Update wp_page_reuse() to operate on range of pages > mm: Reuse large folios for anonymous memory > mm: Split __wp_page_copy_user() into 2 variants > mm: ptep_clear_flush_range_notify() macro for batch operation > mm: Implement folio_remove_rmap_range() > mm: Copy large folios for anonymous memory > mm: Convert zero page to large folios on write > mm: mmap: Align unhinted maps to highest anon folio order > mm: Batch-zap large anonymous folio PTE mappings > > arch/alpha/include/asm/page.h | 5 +- > arch/arm64/include/asm/page.h | 3 +- > arch/arm64/mm/fault.c | 7 +- > arch/ia64/include/asm/page.h | 5 +- > arch/m68k/include/asm/page_no.h | 7 +- > arch/s390/include/asm/page.h | 5 +- > arch/x86/include/asm/page.h | 5 +- > include/linux/highmem.h | 23 +- > include/linux/mm.h | 8 +- > include/linux/mmu_notifier.h | 31 ++ > include/linux/rmap.h | 6 + > mm/memory.c | 877 ++++++++++++++++++++++++++++---- > mm/mmap.c | 4 +- > mm/rmap.c | 147 +++++- > 14 files changed, 1000 insertions(+), 133 deletions(-) > > -- > 2.25.1 >
On 17/04/2023 09:04, Yin, Fengwei wrote: > > > On 4/14/2023 9:02 PM, Ryan Roberts wrote: >> Hi All, >> >> This is a second RFC and my first proper attempt at implementing variable order, >> large folios for anonymous memory. The first RFC [1], was a partial >> implementation and a plea for help in debugging an issue I was hitting; thanks >> to Yin Fengwei and Matthew Wilcox for their advice in solving that! >> >> The objective of variable order anonymous folios is to improve performance by >> allocating larger chunks of memory during anonymous page faults: >> >> - Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> - Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. >> >> This patch set deals with the SW side of things only but sets us up nicely for >> taking advantage of the HW improvements in the near future. >> >> I'm not yet benchmarking a wide variety of use cases, but those that I have >> looked at are positive; I see kernel compilation time improved by up to 10%, >> which I expect to improve further once I add in the arm64 "contiguous bit". >> Memory consumption is somewhere between 1% less and 2% more, depending on how >> its measured. More on perf and memory below. >> >> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor >> conflict resolution). I have a tree at [4]. >> >> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ >> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ >> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 >> >> Approach >> ======== >> >> There are 4 fault paths that have been modified: >> - write fault on unallocated address: do_anonymous_page() >> - write fault on zero page: wp_page_copy() >> - write fault on non-exclusive CoW page: wp_page_copy() >> - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() >> >> In the first 2 cases, we will determine the preferred order folio to allocate, >> limited by a max order (currently order-4; see below), VMA and PMD bounds, and >> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order >> folio as the source, subject to constraints that may arise if the source has >> been mremapped or partially munmapped. And in the 4th case, we reuse as much of >> the folio as we can, subject to the same mremap/munmap constraints. >> >> If allocation of our preferred folio order fails, we gracefully fall back to >> lower orders all the way to 0. >> >> Note that none of this affects the behavior of traditional PMD-sized THP. If we >> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. >> >> Open Questions >> ============== >> >> How to Move Forwards >> -------------------- >> >> While the series is a small-ish code change, it represents a big shift in the >> way things are done. So I'd appreciate any help in scaling up performance >> testing, review and general advice on how best to guide a change like this into >> the kernel. >> >> Folio Allocation Order Policy >> ----------------------------- >> >> The current code is hardcoded to use a maximum order of 4. This was chosen for a >> couple of reasons: >> - From the SW performance perspective, I see a knee around here where >> increasing it doesn't lead to much more performance gain. >> - Intuitively I assume that higher orders become increasingly difficult to >> allocate. >> - From the HW performance perspective, arm64's HPA works on order-2 blocks and >> "the contiguous bit" works on order-4 for 4KB base pages (although it's >> order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going >> any higher. >> >> I suggest that ultimately setting the max order should be left to the >> architecture. arm64 would take advantage of this and set it to the order >> required for the contiguous bit for the configured base page size. >> >> However, I also have a (mild) concern about increased memory consumption. If an >> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) >> we would end up allocating 16x as much memory as we used to. One potential >> approach I see here is to track fault addresses per-VMA, and increase a per-VMA >> max allocation order for consecutive faults that extend a contiguous range, and >> decrement when discontiguous. Alternatively/additionally, we could use the VMA >> size as an indicator. I'd be interested in your thoughts/opinions. >> >> Deferred Split Queue Lock Contention >> ------------------------------------ >> >> The results below show that we are spending a much greater proportion of time in >> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. >> >> I think this is (at least partially) related for contention on the deferred >> split queue lock. This is a per-memcg spinlock, which means a single spinlock >> shared among all 160 CPUs. I've solved part of the problem with the last patch >> in the series (which cuts down the need to take the lock), but at folio free >> time (free_transhuge_page()), the lock is still taken and I think this could be >> a problem. Now that most anonymous pages are large folios, this lock is taken a >> lot more. >> >> I think we could probably avoid taking the lock unless !list_empty(), but I >> haven't convinced myself its definitely safe, so haven't applied it yet. > Yes. It's safe. We also identified other lock contention with large folio > for anonymous mapping like lru lock and zone lock. My understanding is that > the anonymous page has much higher alloc/free frequency than page cache. > > So the lock contention was not exposed by large folio for page cache. > > > I posted the related patch to: > https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t Thanks for posting! I added these patches on top of mine and reran my perf test for -j160. Results as follows: | | baseline time | anonfolio time | percent change | | | to compile (s) | to compile (s) | SMALLER=better | |-----------|---------------:|---------------:|---------------:| | real-time | 53.7 | 49.1 | -8.4% | | user-time | 2705.8 | 2939.1 | 8.6% | | sys-time | 1370.4 | 993.4 | -27.5% | So comparing to the original table below, its definitely reduced the time spent in the kernel, but time in user space has increased, which is unexpected! I think this is something I will have to dig into more deeply, but I don't have access to the HW for the next couple of weeks so will come back to it then. > > > Regards > Yin, Fengwei > > >> >> Roadmap >> ======= >> >> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous >> bit" on arm64 to validate predictions about HW speedups. >> >> I also think there are some opportunities with madvise to split folios to non-0 >> orders, which might improve performance in some cases. madvise is also mistaking >> exclusive large folios for non-exclusive ones at the moment (due to the "small >> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly >> frees the folio. >> >> Results >> ======= >> >> Performance >> ----------- >> >> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and >> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned >> before each run. >> >> make defconfig && time make -jN Image >> >> First with -j8: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 373.0 | 342.8 | -8.1% | >> | user-time | 2333.9 | 2275.3 | -2.5% | >> | sys-time | 510.7 | 340.9 | -33.3% | >> >> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel >> execution. The next 2 tables show a breakdown of the cycles spent in the kernel >> for the 8 job config: >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | data abort | 683B | 316B | -53.8% | >> | instruction abort | 93B | 76B | -18.4% | >> | syscall | 887B | 767B | -13.6% | >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | arm64_sys_openat | 194B | 188B | -3.3% | >> | arm64_sys_exit_group | 192B | 124B | -35.7% | >> | arm64_sys_read | 124B | 108B | -12.7% | >> | arm64_sys_execve | 75B | 67B | -11.0% | >> | arm64_sys_mmap | 51B | 50B | -3.0% | >> | arm64_sys_mprotect | 15B | 13B | -12.0% | >> | arm64_sys_write | 43B | 42B | -2.9% | >> | arm64_sys_munmap | 15B | 12B | -17.0% | >> | arm64_sys_newfstatat | 46B | 41B | -9.7% | >> | arm64_sys_clone | 26B | 24B | -10.0% | >> >> And now with -j160: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 53.7 | 48.2 | -10.2% | >> | user-time | 2705.8 | 2842.1 | 5.0% | >> | sys-time | 1370.4 | 1064.3 | -22.3% | >> >> Above shows a 10.2% improvement in real time execution. But ~3x more time is >> spent in the kernel than for the -j8 config. I think this is related to the lock >> contention issue I highlighted above, but haven't bottomed it out yet. It's also >> not yet clear to me why user-time increases by 5%. >> >> I've also run all the will-it-scale microbenchmarks for a single task, using the >> process mode. Results for multiple runs on the same kernel are noisy - I see ~5% >> fluctuation. So I'm just calling out tests with results that have gt 5% >> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests >> are regressed: >> >> | benchmark | baseline | anonfolio | percent change | >> | | ops/s | ops/s | BIGGER=better | >> | ---------------------|---------:|----------:|---------------:| >> | context_switch1.csv | 328744 | 351150 | 6.8% | >> | malloc1.csv | 96214 | 50890 | -47.1% | >> | mmap1.csv | 410253 | 375746 | -8.4% | >> | page_fault1.csv | 624061 | 3185678 | 410.5% | >> | page_fault2.csv | 416483 | 557448 | 33.8% | >> | page_fault3.csv | 724566 | 1152726 | 59.1% | >> | read1.csv | 1806908 | 1905752 | 5.5% | >> | read2.csv | 587722 | 1942062 | 230.4% | >> | tlb_flush1.csv | 143910 | 152097 | 5.7% | >> | tlb_flush2.csv | 266763 | 322320 | 20.8% | >> >> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M >> object in a loop and never touches the allocated memory. I think the malloc >> implementation is maintaining a header just before the allocated object, which >> causes a single page fault. Previously that page fault allocated 1 page. Now it >> is allocating 16 pages. This cost would be repaid if the test code wrote to the >> allocated object. Alternatively the folio allocation order policy described >> above would also solve this. >> >> It is not clear to me why mmap1 has slowed down. This remains a todo. >> >> Memory >> ------ >> >> I measured memory consumption while doing a kernel compile with 8 jobs on a >> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the >> workload, then calcualted "memory used" high and low watermarks using both >> MemFree and MemAvailable. If there is a better way of measuring system memory >> consumption, please let me know! >> >> mem-used = 4GB - /proc/meminfo:MemFree >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 825 | 842 | 2.1% | >> | mem-used-high | 2697 | 2672 | -0.9% | >> >> mem-used = 4GB - /proc/meminfo:MemAvailable >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 518 | 530 | 2.3% | >> | mem-used-high | 1522 | 1537 | 1.0% | >> >> For the high watermark, the methods disagree; we are either saving 1% or using >> 1% more. For the low watermark, both methods agree that we are using about 2% >> more. I plan to investigate whether the proposed folio allocation order policy >> can reduce this to zero. >> >> Thanks for making it this far! >> Ryan >> >> >> Ryan Roberts (17): >> mm: Expose clear_huge_page() unconditionally >> mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() >> mm: Introduce try_vma_alloc_movable_folio() >> mm: Implement folio_add_new_anon_rmap_range() >> mm: Routines to determine max anon folio allocation order >> mm: Allocate large folios for anonymous memory >> mm: Allow deferred splitting of arbitrary large anon folios >> mm: Implement folio_move_anon_rmap_range() >> mm: Update wp_page_reuse() to operate on range of pages >> mm: Reuse large folios for anonymous memory >> mm: Split __wp_page_copy_user() into 2 variants >> mm: ptep_clear_flush_range_notify() macro for batch operation >> mm: Implement folio_remove_rmap_range() >> mm: Copy large folios for anonymous memory >> mm: Convert zero page to large folios on write >> mm: mmap: Align unhinted maps to highest anon folio order >> mm: Batch-zap large anonymous folio PTE mappings >> >> arch/alpha/include/asm/page.h | 5 +- >> arch/arm64/include/asm/page.h | 3 +- >> arch/arm64/mm/fault.c | 7 +- >> arch/ia64/include/asm/page.h | 5 +- >> arch/m68k/include/asm/page_no.h | 7 +- >> arch/s390/include/asm/page.h | 5 +- >> arch/x86/include/asm/page.h | 5 +- >> include/linux/highmem.h | 23 +- >> include/linux/mm.h | 8 +- >> include/linux/mmu_notifier.h | 31 ++ >> include/linux/rmap.h | 6 + >> mm/memory.c | 877 ++++++++++++++++++++++++++++---- >> mm/mmap.c | 4 +- >> mm/rmap.c | 147 +++++- >> 14 files changed, 1000 insertions(+), 133 deletions(-) >> >> -- >> 2.25.1 >>
On 17/04/2023 09:19, Yin, Fengwei wrote: > > > On 4/14/2023 9:02 PM, Ryan Roberts wrote: >> Hi All, >> >> This is a second RFC and my first proper attempt at implementing variable order, >> large folios for anonymous memory. The first RFC [1], was a partial >> implementation and a plea for help in debugging an issue I was hitting; thanks >> to Yin Fengwei and Matthew Wilcox for their advice in solving that! >> >> The objective of variable order anonymous folios is to improve performance by >> allocating larger chunks of memory during anonymous page faults: >> >> - Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> - Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. >> >> This patch set deals with the SW side of things only but sets us up nicely for >> taking advantage of the HW improvements in the near future. >> >> I'm not yet benchmarking a wide variety of use cases, but those that I have >> looked at are positive; I see kernel compilation time improved by up to 10%, >> which I expect to improve further once I add in the arm64 "contiguous bit". >> Memory consumption is somewhere between 1% less and 2% more, depending on how >> its measured. More on perf and memory below. >> >> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor >> conflict resolution). I have a tree at [4]. >> >> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ >> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ >> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 >> >> Approach >> ======== >> >> There are 4 fault paths that have been modified: >> - write fault on unallocated address: do_anonymous_page() >> - write fault on zero page: wp_page_copy() >> - write fault on non-exclusive CoW page: wp_page_copy() >> - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() >> >> In the first 2 cases, we will determine the preferred order folio to allocate, >> limited by a max order (currently order-4; see below), VMA and PMD bounds, and >> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order >> folio as the source, subject to constraints that may arise if the source has >> been mremapped or partially munmapped. And in the 4th case, we reuse as much of >> the folio as we can, subject to the same mremap/munmap constraints. >> >> If allocation of our preferred folio order fails, we gracefully fall back to >> lower orders all the way to 0. >> >> Note that none of this affects the behavior of traditional PMD-sized THP. If we >> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. >> >> Open Questions >> ============== >> >> How to Move Forwards >> -------------------- >> >> While the series is a small-ish code change, it represents a big shift in the >> way things are done. So I'd appreciate any help in scaling up performance >> testing, review and general advice on how best to guide a change like this into >> the kernel. >> >> Folio Allocation Order Policy >> ----------------------------- >> >> The current code is hardcoded to use a maximum order of 4. This was chosen for a >> couple of reasons: >> - From the SW performance perspective, I see a knee around here where >> increasing it doesn't lead to much more performance gain. >> - Intuitively I assume that higher orders become increasingly difficult to >> allocate. >> - From the HW performance perspective, arm64's HPA works on order-2 blocks and >> "the contiguous bit" works on order-4 for 4KB base pages (although it's >> order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going >> any higher. >> >> I suggest that ultimately setting the max order should be left to the >> architecture. arm64 would take advantage of this and set it to the order >> required for the contiguous bit for the configured base page size. >> >> However, I also have a (mild) concern about increased memory consumption. If an >> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) >> we would end up allocating 16x as much memory as we used to. One potential >> approach I see here is to track fault addresses per-VMA, and increase a per-VMA >> max allocation order for consecutive faults that extend a contiguous range, and >> decrement when discontiguous. Alternatively/additionally, we could use the VMA >> size as an indicator. I'd be interested in your thoughts/opinions. >> >> Deferred Split Queue Lock Contention >> ------------------------------------ >> >> The results below show that we are spending a much greater proportion of time in >> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. >> >> I think this is (at least partially) related for contention on the deferred >> split queue lock. This is a per-memcg spinlock, which means a single spinlock >> shared among all 160 CPUs. I've solved part of the problem with the last patch >> in the series (which cuts down the need to take the lock), but at folio free >> time (free_transhuge_page()), the lock is still taken and I think this could be >> a problem. Now that most anonymous pages are large folios, this lock is taken a >> lot more. >> >> I think we could probably avoid taking the lock unless !list_empty(), but I >> haven't convinced myself its definitely safe, so haven't applied it yet. >> >> Roadmap >> ======= >> >> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous >> bit" on arm64 to validate predictions about HW speedups. >> >> I also think there are some opportunities with madvise to split folios to non-0 >> orders, which might improve performance in some cases. madvise is also mistaking >> exclusive large folios for non-exclusive ones at the moment (due to the "small >> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly >> frees the folio. >> >> Results >> ======= >> >> Performance >> ----------- >> >> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and >> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned >> before each run. >> >> make defconfig && time make -jN Image >> >> First with -j8: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 373.0 | 342.8 | -8.1% | >> | user-time | 2333.9 | 2275.3 | -2.5% | >> | sys-time | 510.7 | 340.9 | -33.3% | >> >> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel >> execution. The next 2 tables show a breakdown of the cycles spent in the kernel >> for the 8 job config: >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | data abort | 683B | 316B | -53.8% | >> | instruction abort | 93B | 76B | -18.4% | >> | syscall | 887B | 767B | -13.6% | >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | arm64_sys_openat | 194B | 188B | -3.3% | >> | arm64_sys_exit_group | 192B | 124B | -35.7% | >> | arm64_sys_read | 124B | 108B | -12.7% | >> | arm64_sys_execve | 75B | 67B | -11.0% | >> | arm64_sys_mmap | 51B | 50B | -3.0% | >> | arm64_sys_mprotect | 15B | 13B | -12.0% | >> | arm64_sys_write | 43B | 42B | -2.9% | >> | arm64_sys_munmap | 15B | 12B | -17.0% | >> | arm64_sys_newfstatat | 46B | 41B | -9.7% | >> | arm64_sys_clone | 26B | 24B | -10.0% | >> >> And now with -j160: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 53.7 | 48.2 | -10.2% | >> | user-time | 2705.8 | 2842.1 | 5.0% | >> | sys-time | 1370.4 | 1064.3 | -22.3% | >> >> Above shows a 10.2% improvement in real time execution. But ~3x more time is >> spent in the kernel than for the -j8 config. I think this is related to the lock >> contention issue I highlighted above, but haven't bottomed it out yet. It's also >> not yet clear to me why user-time increases by 5%. >> >> I've also run all the will-it-scale microbenchmarks for a single task, using the >> process mode. Results for multiple runs on the same kernel are noisy - I see ~5% >> fluctuation. So I'm just calling out tests with results that have gt 5% >> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests >> are regressed: >> >> | benchmark | baseline | anonfolio | percent change | >> | | ops/s | ops/s | BIGGER=better | >> | ---------------------|---------:|----------:|---------------:| >> | context_switch1.csv | 328744 | 351150 | 6.8% | >> | malloc1.csv | 96214 | 50890 | -47.1% | >> | mmap1.csv | 410253 | 375746 | -8.4% | >> | page_fault1.csv | 624061 | 3185678 | 410.5% | >> | page_fault2.csv | 416483 | 557448 | 33.8% | >> | page_fault3.csv | 724566 | 1152726 | 59.1% | >> | read1.csv | 1806908 | 1905752 | 5.5% | >> | read2.csv | 587722 | 1942062 | 230.4% | >> | tlb_flush1.csv | 143910 | 152097 | 5.7% | >> | tlb_flush2.csv | 266763 | 322320 | 20.8% | >> >> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M >> object in a loop and never touches the allocated memory. I think the malloc >> implementation is maintaining a header just before the allocated object, which >> causes a single page fault. Previously that page fault allocated 1 page. Now it >> is allocating 16 pages. This cost would be repaid if the test code wrote to the >> allocated object. Alternatively the folio allocation order policy described >> above would also solve this. >> >> It is not clear to me why mmap1 has slowed down. This remains a todo. >> >> Memory >> ------ >> >> I measured memory consumption while doing a kernel compile with 8 jobs on a >> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the >> workload, then calcualted "memory used" high and low watermarks using both >> MemFree and MemAvailable. If there is a better way of measuring system memory >> consumption, please let me know! >> >> mem-used = 4GB - /proc/meminfo:MemFree >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 825 | 842 | 2.1% | >> | mem-used-high | 2697 | 2672 | -0.9% | >> >> mem-used = 4GB - /proc/meminfo:MemAvailable >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 518 | 530 | 2.3% | >> | mem-used-high | 1522 | 1537 | 1.0% | >> >> For the high watermark, the methods disagree; we are either saving 1% or using >> 1% more. For the low watermark, both methods agree that we are using about 2% >> more. I plan to investigate whether the proposed folio allocation order policy >> can reduce this to zero. > > Besides the memory consumption, the large folio could impact the tail latency > of page allocation also (extra zeroing memory, more operations on slow path of > page allocation). I agree; this series could be thought of as trading latency for throughput. There are a couple of mitigations to try to limit the extra latency; on the Altra at least, we are taking advantage of the CPU's streaming mode for the memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x 4K. And currently I'm using __GPF_NORETRY for high order folio allocations, which I understood to mean we shouldn't suffer high allocation latency due to reclaim, etc. Although, I know we are discussing whether this is correct in the thread for patch 3. > > Again, my understanding is the tail latency of page allocation is more important > to anonymous page than page cache page because anonymous page is allocated/freed > more frequently. I had assumed that any serious application that needs to guarantee no latency due to page faults would preallocate and lock all memory during init? Perhaps that's wishful thinking? > > > Regards > Yin, Fengwei > >> >> Thanks for making it this far! >> Ryan >> >> >> Ryan Roberts (17): >> mm: Expose clear_huge_page() unconditionally >> mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() >> mm: Introduce try_vma_alloc_movable_folio() >> mm: Implement folio_add_new_anon_rmap_range() >> mm: Routines to determine max anon folio allocation order >> mm: Allocate large folios for anonymous memory >> mm: Allow deferred splitting of arbitrary large anon folios >> mm: Implement folio_move_anon_rmap_range() >> mm: Update wp_page_reuse() to operate on range of pages >> mm: Reuse large folios for anonymous memory >> mm: Split __wp_page_copy_user() into 2 variants >> mm: ptep_clear_flush_range_notify() macro for batch operation >> mm: Implement folio_remove_rmap_range() >> mm: Copy large folios for anonymous memory >> mm: Convert zero page to large folios on write >> mm: mmap: Align unhinted maps to highest anon folio order >> mm: Batch-zap large anonymous folio PTE mappings >> >> arch/alpha/include/asm/page.h | 5 +- >> arch/arm64/include/asm/page.h | 3 +- >> arch/arm64/mm/fault.c | 7 +- >> arch/ia64/include/asm/page.h | 5 +- >> arch/m68k/include/asm/page_no.h | 7 +- >> arch/s390/include/asm/page.h | 5 +- >> arch/x86/include/asm/page.h | 5 +- >> include/linux/highmem.h | 23 +- >> include/linux/mm.h | 8 +- >> include/linux/mmu_notifier.h | 31 ++ >> include/linux/rmap.h | 6 + >> mm/memory.c | 877 ++++++++++++++++++++++++++++---- >> mm/mmap.c | 4 +- >> mm/rmap.c | 147 +++++- >> 14 files changed, 1000 insertions(+), 133 deletions(-) >> >> -- >> 2.25.1 >>
On 14.04.23 15:02, Ryan Roberts wrote: > Hi All, > > This is a second RFC and my first proper attempt at implementing variable order, > large folios for anonymous memory. The first RFC [1], was a partial > implementation and a plea for help in debugging an issue I was hitting; thanks > to Yin Fengwei and Matthew Wilcox for their advice in solving that! > > The objective of variable order anonymous folios is to improve performance by > allocating larger chunks of memory during anonymous page faults: > > - Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > - Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. > > This patch set deals with the SW side of things only but sets us up nicely for > taking advantage of the HW improvements in the near future. > > I'm not yet benchmarking a wide variety of use cases, but those that I have > looked at are positive; I see kernel compilation time improved by up to 10%, > which I expect to improve further once I add in the arm64 "contiguous bit". > Memory consumption is somewhere between 1% less and 2% more, depending on how > its measured. More on perf and memory below. > > The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor > conflict resolution). I have a tree at [4]. > > [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ > [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 > > Approach > ======== > > There are 4 fault paths that have been modified: > - write fault on unallocated address: do_anonymous_page() > - write fault on zero page: wp_page_copy() > - write fault on non-exclusive CoW page: wp_page_copy() > - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() > > In the first 2 cases, we will determine the preferred order folio to allocate, > limited by a max order (currently order-4; see below), VMA and PMD bounds, and > state of neighboring PTEs. In the 3rd case, we aim to allocate the same order > folio as the source, subject to constraints that may arise if the source has > been mremapped or partially munmapped. And in the 4th case, we reuse as much of > the folio as we can, subject to the same mremap/munmap constraints. Just a note (that you maybe already know) that we have to be a bit careful in the wp_copy path with replacing sub-pages that are marked exclusive. Currently, we always only replace a single shared anon (sub)page by a fresh exclusive base-page during a write-fault/unsharing. As the sub-page is already marked "maybe shared", it cannot get pinned concurrently and everybody is happy. If you now decide to replace more subpages, you have to be careful that none of them are still exclusive -- because they could get pinned concurrently and replacing them would result in memory corruptions. There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial fork() that could result in something like that. Further, we have to be a bit careful regarding replacing ranges that are backed by different anon pages (for example, due to fork() deciding to copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages). So what should be safe is replacing all sub-pages of a folio that are marked "maybe shared" by a new folio under PT lock. However, I wonder if it's really worth the complexity. For THP we were happy so far to *not* optimize this, implying that maybe we shouldn't worry about optimizing the fork() case for now that heavily. One optimization once could think of instead (that I raised previously in other context) is the detection of exclusivity after fork()+exit in the child (IOW, only the parent continues to exist). Once PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio during fork(), we'd always decide to copy instead of reuse (because page_count() > 1, as the folio is PTE mapped). Scanning the surrounding page table if it makes sense (e.g., page_count() <= folio_nr_pages()), to test if all page references are from the current process would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this optimization.
On 17/04/2023 11:54, David Hildenbrand wrote: > On 14.04.23 15:02, Ryan Roberts wrote: >> Hi All, >> >> This is a second RFC and my first proper attempt at implementing variable order, >> large folios for anonymous memory. The first RFC [1], was a partial >> implementation and a plea for help in debugging an issue I was hitting; thanks >> to Yin Fengwei and Matthew Wilcox for their advice in solving that! >> >> The objective of variable order anonymous folios is to improve performance by >> allocating larger chunks of memory during anonymous page faults: >> >> - Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> - Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. >> >> This patch set deals with the SW side of things only but sets us up nicely for >> taking advantage of the HW improvements in the near future. >> >> I'm not yet benchmarking a wide variety of use cases, but those that I have >> looked at are positive; I see kernel compilation time improved by up to 10%, >> which I expect to improve further once I add in the arm64 "contiguous bit". >> Memory consumption is somewhere between 1% less and 2% more, depending on how >> its measured. More on perf and memory below. >> >> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor >> conflict resolution). I have a tree at [4]. >> >> [1] >> https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ >> [2] >> https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ >> [3] >> https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ >> [4] >> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 >> >> Approach >> ======== >> >> There are 4 fault paths that have been modified: >> - write fault on unallocated address: do_anonymous_page() >> - write fault on zero page: wp_page_copy() >> - write fault on non-exclusive CoW page: wp_page_copy() >> - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() >> >> In the first 2 cases, we will determine the preferred order folio to allocate, >> limited by a max order (currently order-4; see below), VMA and PMD bounds, and >> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order >> folio as the source, subject to constraints that may arise if the source has >> been mremapped or partially munmapped. And in the 4th case, we reuse as much of >> the folio as we can, subject to the same mremap/munmap constraints. > > Just a note (that you maybe already know) that we have to be a bit careful in > the wp_copy path with replacing sub-pages that are marked exclusive. Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I think I have a bug. (I'm guessing the GUP fast path assumes that if it sees an exclusive page then that page can't go away? And if I then wp_copy it, I'm breaking the assumption? But surely user space can munmap it at any time and the consequences are similar? It's probably clear that I don't know much about the GUP implementation details...) My current patch always prefers reuse over copy, and expands the size of the reuse to the biggest set of pages that are all exclusive (determined either by the presence of the anon_exclusive flag or from the refcount), and covered by the same folio (and a few other bounds constraints - see calc_anon_folio_range_reuse()). If I determine I must copy (because the "anchor" page is not exclusive), then I determine the size of the copy region based on a few constraints (see calc_anon_folio_order_copy()). But I think you are saying that no pages in that region are allowed to have the anon_exclusive flag set? In which case, this could be fixed by adding that check in the function. > > Currently, we always only replace a single shared anon (sub)page by a fresh > exclusive base-page during a write-fault/unsharing. As the sub-page is already > marked "maybe shared", it cannot get pinned concurrently and everybody is happy. When you say, "maybe shared" is that determined by the absence of the "exclusive" flag? > > If you now decide to replace more subpages, you have to be careful that none of > them are still exclusive -- because they could get pinned concurrently and > replacing them would result in memory corruptions. > > There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial > fork() that could result in something like that. Are there any test cases that stress the kernel in this area that I could use to validate my fix? > > Further, we have to be a bit careful regarding replacing ranges that are backed > by different anon pages (for example, due to fork() deciding to copy some > sub-pages of a PTE-mapped folio instead of sharing all sub-pages). I don't understand this statement; do you mean "different anon _folios_"? I am scanning the page table to expand the region that I reuse/copy and as part of that scan, make sure that I only cover a single folio. So I think I conform here - the scan would give up once it gets to the hole. > > > So what should be safe is replacing all sub-pages of a folio that are marked > "maybe shared" by a new folio under PT lock. However, I wonder if it's really > worth the complexity. For THP we were happy so far to *not* optimize this, > implying that maybe we shouldn't worry about optimizing the fork() case for now > that heavily. I don't have the exact numbers to hand, but I'm pretty sure I remember enabling large copies was contributing a measurable amount to the performance improvement. (Certainly, the zero-page copy case, is definitely a big contributer). I don't have access to the HW at the moment but can rerun later with and without to double check. > > > One optimization once could think of instead (that I raised previously in other > context) is the detection of exclusivity after fork()+exit in the child (IOW, > only the parent continues to exist). Once PG_anon_exclusive was cleared for all > sub-pages of the THP-mapped folio during fork(), we'd always decide to copy > instead of reuse (because page_count() > 1, as the folio is PTE mapped). > Scanning the surrounding page table if it makes sense (e.g., page_count() <= > folio_nr_pages()), to test if all page references are from the current process > would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. > The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. > For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this > optimization. Yes, I have already implemented this in my series; see patch 10. Thanks, Ryan
[...] >> Just a note (that you maybe already know) that we have to be a bit careful in >> the wp_copy path with replacing sub-pages that are marked exclusive. > > Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I > think I have a bug. > > (I'm guessing the GUP fast path assumes that if it sees an exclusive page then > that page can't go away? And if I then wp_copy it, I'm breaking the assumption? > But surely user space can munmap it at any time and the consequences are > similar? It's probably clear that I don't know much about the GUP implementation > details...) If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that this page cannot randomly be replaced by core MM due to COW. See gup_must_unshare(). So it can go ahead and pin the page. As long as user space doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped page. If GUP finds a writeable PTE, it assumes that this page cannot randomly be replaced by core MM due to COW -- because writable implies exclusive. See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply go ahead and pin the page. GUP-fast runs lockless, not even taking the PT locks. It syncs against concurrent fork() using a special seqlock, and essentially unpins whatever it temporarily pinned when it detects that fork() was running concurrently. But it might result in some pages temporarily being flagged as "maybe pinned". In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM) or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of the subpage by first unmapping the PTE and conditionally remapping it. See mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if page_try_share_anon_rmap() fails because the page may be pinned). Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy. Replacing an exclusive PTE (including writable PTEs) requires some work to sync with GUP-fast and goes rather in the "maybe just don't bother" terriroty. > > My current patch always prefers reuse over copy, and expands the size of the > reuse to the biggest set of pages that are all exclusive (determined either by > the presence of the anon_exclusive flag or from the refcount), and covered by > the same folio (and a few other bounds constraints - see > calc_anon_folio_range_reuse()). > > If I determine I must copy (because the "anchor" page is not exclusive), then I > determine the size of the copy region based on a few constraints (see > calc_anon_folio_order_copy()). But I think you are saying that no pages in that > region are allowed to have the anon_exclusive flag set? In which case, this > could be fixed by adding that check in the function. Yes, changing a PTE that points at an anonymous subpage that has the "exclusive" flag set requires more care. > >> >> Currently, we always only replace a single shared anon (sub)page by a fresh >> exclusive base-page during a write-fault/unsharing. As the sub-page is already >> marked "maybe shared", it cannot get pinned concurrently and everybody is happy. > > When you say, "maybe shared" is that determined by the absence of the > "exclusive" flag? Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once "maybe shared", we must only go back to "exclusive (set the flag) if we are sure that there are no other references to the page. > >> >> If you now decide to replace more subpages, you have to be careful that none of >> them are still exclusive -- because they could get pinned concurrently and >> replacing them would result in memory corruptions. >> >> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial >> fork() that could result in something like that. > > Are there any test cases that stress the kernel in this area that I could use to > validate my fix? tools/testing/selftests/mm/cow.c does excessive tests (including some MADV_DONTFORK -- that's what I actually meant -- and partial mremap tests), but mostly focuses on ordinary base pages (order-0), THP, and hugetlb. We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that rely on races are not a good candidate for selftests). We might want to extend tools/testing/selftests/mm/cow.c to test for some of the cases you extend. We may also change the detection of THP (I think, by luck, it would currently also test your patches to some degree set the way it tests for THP) if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } Would have to be, for example, if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } Because we touch the first PTE in a PMD and want to test if core-mm gave us a full THP (last PTE also populated). Extending the tests to cover other anon THP sizes could work by aligning a VMA to THP/2 size (so we're sure we don't get a full THP), and then testing if we get more PTEs populated -> your code active. > >> >> Further, we have to be a bit careful regarding replacing ranges that are backed >> by different anon pages (for example, due to fork() deciding to copy some >> sub-pages of a PTE-mapped folio instead of sharing all sub-pages). > > I don't understand this statement; do you mean "different anon _folios_"? I am > scanning the page table to expand the region that I reuse/copy and as part of > that scan, make sure that I only cover a single folio. So I think I conform here > - the scan would give up once it gets to the hole. During fork(), what could happen (temporary detection of pinned page resulting in a copy) is something weird like: PTE 0: subpage0 of anon page #1 (maybe shared) PTE 1: subpage1 of anon page #1 (maybe shared PTE 2: anon page #2 (exclusive) PTE 3: subpage2 of anon page #1 (maybe shared Of course, any combination of above. Further, with mremap() we might get completely crazy layouts, randomly mapping sub-pages of anon pages, mixed with other sub-pages or base-page folios. Maybe it's all handled already by your code, just pointing out which kind of mess we might get :) > >> >> >> So what should be safe is replacing all sub-pages of a folio that are marked >> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >> worth the complexity. For THP we were happy so far to *not* optimize this, >> implying that maybe we shouldn't worry about optimizing the fork() case for now >> that heavily. > > I don't have the exact numbers to hand, but I'm pretty sure I remember enabling > large copies was contributing a measurable amount to the performance > improvement. (Certainly, the zero-page copy case, is definitely a big > contributer). I don't have access to the HW at the moment but can rerun later > with and without to double check. In which test exactly? Some micro-benchmark? > >> >> >> One optimization once could think of instead (that I raised previously in other >> context) is the detection of exclusivity after fork()+exit in the child (IOW, >> only the parent continues to exist). Once PG_anon_exclusive was cleared for all >> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy >> instead of reuse (because page_count() > 1, as the folio is PTE mapped). >> Scanning the surrounding page table if it makes sense (e.g., page_count() <= >> folio_nr_pages()), to test if all page references are from the current process >> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. >> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. >> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this >> optimization. > > Yes, I have already implemented this in my series; see patch 10. Oh, good! That's the most important part.
On 17/04/2023 15:05, David Hildenbrand wrote: > [...] > >>> Just a note (that you maybe already know) that we have to be a bit careful in >>> the wp_copy path with replacing sub-pages that are marked exclusive. >> >> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I >> think I have a bug. >> >> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then >> that page can't go away? And if I then wp_copy it, I'm breaking the assumption? >> But surely user space can munmap it at any time and the consequences are >> similar? It's probably clear that I don't know much about the GUP implementation >> details...) > > If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that > this page cannot randomly be replaced by core MM due to COW. See > gup_must_unshare(). So it can go ahead and pin the page. As long as user space > doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the > pinned page must correspond to the mapped page. > > If GUP finds a writeable PTE, it assumes that this page cannot randomly be > replaced by core MM due to COW -- because writable implies exclusive. See, for > example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply > go ahead and pin the page. > > GUP-fast runs lockless, not even taking the PT locks. It syncs against > concurrent fork() using a special seqlock, and essentially unpins whatever it > temporarily pinned when it detects that fork() was running concurrently. But it > might result in some pages temporarily being flagged as "maybe pinned". > > In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM) > or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of > the subpage by first unmapping the PTE and conditionally remapping it. See > mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if > page_try_share_anon_rmap() fails because the page may be pinned). > > Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy. > Replacing an exclusive PTE (including writable PTEs) requires some work to sync > with GUP-fast and goes rather in the "maybe just don't bother" terriroty. Yep agreed. I'll plan to fix this by adding the constraint that all pages of the copy range (calc_anon_folio_order_copy()) must be "maybe shared". > >> >> My current patch always prefers reuse over copy, and expands the size of the >> reuse to the biggest set of pages that are all exclusive (determined either by >> the presence of the anon_exclusive flag or from the refcount), and covered by >> the same folio (and a few other bounds constraints - see >> calc_anon_folio_range_reuse()). >> >> If I determine I must copy (because the "anchor" page is not exclusive), then I >> determine the size of the copy region based on a few constraints (see >> calc_anon_folio_order_copy()). But I think you are saying that no pages in that >> region are allowed to have the anon_exclusive flag set? In which case, this >> could be fixed by adding that check in the function. > > Yes, changing a PTE that points at an anonymous subpage that has the "exclusive" > flag set requires more care. > >> >>> >>> Currently, we always only replace a single shared anon (sub)page by a fresh >>> exclusive base-page during a write-fault/unsharing. As the sub-page is already >>> marked "maybe shared", it cannot get pinned concurrently and everybody is happy. >> >> When you say, "maybe shared" is that determined by the absence of the >> "exclusive" flag? > > Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once > "maybe shared", we must only go back to "exclusive (set the flag) if we are sure > that there are no other references to the page. > >> >>> >>> If you now decide to replace more subpages, you have to be careful that none of >>> them are still exclusive -- because they could get pinned concurrently and >>> replacing them would result in memory corruptions. >>> >>> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial >>> fork() that could result in something like that. >> >> Are there any test cases that stress the kernel in this area that I could use to >> validate my fix? > > tools/testing/selftests/mm/cow.c does excessive tests (including some > MADV_DONTFORK -- that's what I actually meant -- and partial mremap tests), but > mostly focuses on ordinary base pages (order-0), THP, and hugetlb. > > We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that > rely on races are not a good candidate for selftests). > > We might want to extend tools/testing/selftests/mm/cow.c to test for some of the > cases you extend. > > We may also change the detection of THP (I think, by luck, it would currently > also test your patches to some degree set the way it tests for THP) > > if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) { > ksft_test_result_skip("Did not get a THP populated\n"); > goto munmap; > } > > Would have to be, for example, > > if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) { > ksft_test_result_skip("Did not get a THP populated\n"); > goto munmap; > } > > Because we touch the first PTE in a PMD and want to test if core-mm gave us a > full THP (last PTE also populated). > > > Extending the tests to cover other anon THP sizes could work by aligning a VMA > to THP/2 size (so we're sure we don't get a full THP), and then testing if we > get more PTEs populated -> your code active. Thanks. I'll run all these and make sure they pass and look at adding new variants for the next rev. > >> >>> >>> Further, we have to be a bit careful regarding replacing ranges that are backed >>> by different anon pages (for example, due to fork() deciding to copy some >>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages). >> >> I don't understand this statement; do you mean "different anon _folios_"? I am >> scanning the page table to expand the region that I reuse/copy and as part of >> that scan, make sure that I only cover a single folio. So I think I conform here >> - the scan would give up once it gets to the hole. > > During fork(), what could happen (temporary detection of pinned page resulting > in a copy) is something weird like: > > PTE 0: subpage0 of anon page #1 (maybe shared) > PTE 1: subpage1 of anon page #1 (maybe shared > PTE 2: anon page #2 (exclusive) > PTE 3: subpage2 of anon page #1 (maybe shared Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap something new in PTE2. But I don't see how it happens at fork. For PTE3, did you mean subpage _3_? > > Of course, any combination of above. > > Further, with mremap() we might get completely crazy layouts, randomly mapping > sub-pages of anon pages, mixed with other sub-pages or base-page folios. > > Maybe it's all handled already by your code, just pointing out which kind of > mess we might get :) Yep, this is already handled; the scan to expand the range ensures that all the PTEs map to the expected contiguous pages in the same folio. > >> >>> >>> >>> So what should be safe is replacing all sub-pages of a folio that are marked >>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>> worth the complexity. For THP we were happy so far to *not* optimize this, >>> implying that maybe we shouldn't worry about optimizing the fork() case for now >>> that heavily. >> >> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >> large copies was contributing a measurable amount to the performance >> improvement. (Certainly, the zero-page copy case, is definitely a big >> contributer). I don't have access to the HW at the moment but can rerun later >> with and without to double check. > > In which test exactly? Some micro-benchmark? The kernel compile benchmark that I quoted numbers for in the cover letter. I have some trace points (not part of the submitted series) that tell me how many mappings of each order we get for each code path. I'm pretty sure I remember all of these 4 code paths contributing non-negligible amounts. > >> >>> >>> >>> One optimization once could think of instead (that I raised previously in other >>> context) is the detection of exclusivity after fork()+exit in the child (IOW, >>> only the parent continues to exist). Once PG_anon_exclusive was cleared for all >>> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy >>> instead of reuse (because page_count() > 1, as the folio is PTE mapped). >>> Scanning the surrounding page table if it makes sense (e.g., page_count() <= >>> folio_nr_pages()), to test if all page references are from the current process >>> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. >>> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. >>> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this >>> optimization. >> >> Yes, I have already implemented this in my series; see patch 10. > > Oh, good! That's the most important part. >
>>> >>>> >>>> Further, we have to be a bit careful regarding replacing ranges that are backed >>>> by different anon pages (for example, due to fork() deciding to copy some >>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages). >>> >>> I don't understand this statement; do you mean "different anon _folios_"? I am >>> scanning the page table to expand the region that I reuse/copy and as part of >>> that scan, make sure that I only cover a single folio. So I think I conform here >>> - the scan would give up once it gets to the hole. >> >> During fork(), what could happen (temporary detection of pinned page resulting >> in a copy) is something weird like: >> >> PTE 0: subpage0 of anon page #1 (maybe shared) >> PTE 1: subpage1 of anon page #1 (maybe shared >> PTE 2: anon page #2 (exclusive) >> PTE 3: subpage2 of anon page #1 (maybe shared > > Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap > something new in PTE2. But I don't see how it happens at fork. For PTE3, did you > mean subpage _3_? > Yes, fat fingers :) Thanks for paying attention! Above could be optimized by processing all consecutive PTEs at once: meaning, we check if the page maybe pinned only once, and then either copy all PTEs or share all PTEs. It's unlikely to happen in practice, I guess, though. >> >> Of course, any combination of above. >> >> Further, with mremap() we might get completely crazy layouts, randomly mapping >> sub-pages of anon pages, mixed with other sub-pages or base-page folios. >> >> Maybe it's all handled already by your code, just pointing out which kind of >> mess we might get :) > > Yep, this is already handled; the scan to expand the range ensures that all the > PTEs map to the expected contiguous pages in the same folio. Okay, great. > >> >>> >>>> >>>> >>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>> implying that maybe we shouldn't worry about optimizing the fork() case for now >>>> that heavily. >>> >>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >>> large copies was contributing a measurable amount to the performance >>> improvement. (Certainly, the zero-page copy case, is definitely a big >>> contributer). I don't have access to the HW at the moment but can rerun later >>> with and without to double check. >> >> In which test exactly? Some micro-benchmark? > > The kernel compile benchmark that I quoted numbers for in the cover letter. I > have some trace points (not part of the submitted series) that tell me how many > mappings of each order we get for each code path. I'm pretty sure I remember all > of these 4 code paths contributing non-negligible amounts. Interesting! It would be great to see if there is an actual difference after patch #10 was applied without the other COW replacement.
On 17/04/2023 16:44, David Hildenbrand wrote: >>>> >>>>> >>>>> >>>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>>> implying that maybe we shouldn't worry about optimizing the fork() case for >>>>> now >>>>> that heavily. >>>> >>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >>>> large copies was contributing a measurable amount to the performance >>>> improvement. (Certainly, the zero-page copy case, is definitely a big >>>> contributer). I don't have access to the HW at the moment but can rerun later >>>> with and without to double check. >>> >>> In which test exactly? Some micro-benchmark? >> >> The kernel compile benchmark that I quoted numbers for in the cover letter. I >> have some trace points (not part of the submitted series) that tell me how many >> mappings of each order we get for each code path. I'm pretty sure I remember all >> of these 4 code paths contributing non-negligible amounts. > > Interesting! It would be great to see if there is an actual difference after > patch #10 was applied without the other COW replacement. > I'll aim to get some formal numbers when I next have access to the HW.
Hi David, On 17/04/2023 15:05, David Hildenbrand wrote: > [...] > >>> Just a note (that you maybe already know) that we have to be a bit careful in >>> the wp_copy path with replacing sub-pages that are marked exclusive. >> >> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I >> think I have a bug. >> >> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then >> that page can't go away? And if I then wp_copy it, I'm breaking the assumption? >> But surely user space can munmap it at any time and the consequences are >> similar? It's probably clear that I don't know much about the GUP implementation >> details...) > > If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that > this page cannot randomly be replaced by core MM due to COW. See > gup_must_unshare(). So it can go ahead and pin the page. As long as user space > doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the > pinned page must correspond to the mapped page. > > If GUP finds a writeable PTE, it assumes that this page cannot randomly be > replaced by core MM due to COW -- because writable implies exclusive. See, for > example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply > go ahead and pin the page. > > GUP-fast runs lockless, not even taking the PT locks. It syncs against > concurrent fork() using a special seqlock, and essentially unpins whatever it > temporarily pinned when it detects that fork() was running concurrently. But it > might result in some pages temporarily being flagged as "maybe pinned". > > In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM) > or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of > the subpage by first unmapping the PTE and conditionally remapping it. See > mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if > page_try_share_anon_rmap() fails because the page may be pinned). > > Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy. > Replacing an exclusive PTE (including writable PTEs) requires some work to sync > with GUP-fast and goes rather in the "maybe just don't bother" terriroty. I'm looking to fix this problem in my code, but am struggling to see how the current code is safe. I'm thinking about the following scenario: - A page is CoW mapped into processes A and B. - The page takes a fault in process A, and do_wp_page() determines that it is "maybe-shared" and therefore must copy. So drops the PTL and calls wp_page_copy(). - Process B exits. - Another thread in process A faults on the page. This time dw_wp_page() determines that the page is exclusive (due to the ref count), and reuses it, marking it exclusive along the way. - wp_page_copy() from the original thread in process A retakes the PTL and copies the _now exclusive_ page. Having typed it up, I guess this can't happen, because wp_page_copy() will only do the copy if the PTE hasn't changed and it will have changed because it is now writable? So this is safe? To make things more convoluted, what happens if the second thread does an mprotect() to make the page RO after its write fault was handled? I think mprotect() will serialize on the mmap write lock so this is safe too? Sorry if this is a bit rambly, just trying to make sure I've understood everything correctly. Thanks, Ryan
> I'm looking to fix this problem in my code, but am struggling to see how the > current code is safe. I'm thinking about the following scenario: > Let's see :) > - A page is CoW mapped into processes A and B. > - The page takes a fault in process A, and do_wp_page() determines that it is > "maybe-shared" and therefore must copy. So drops the PTL and calls > wp_page_copy(). Note that before calling wp_page_copy(), we do a folio_get(folio); Further, the page table reference is only dropped once we actually replace the page in the page table. So while in wp_page_copy(), the folio should have at least 2 references if the page is still mapped. > - Process B exits. > - Another thread in process A faults on the page. This time dw_wp_page() > determines that the page is exclusive (due to the ref count), and reuses it, > marking it exclusive along the way. The refcount should not be 1 (other reference from the wp_page_copy() caller), so A won't be able to reuse it, and ... > - wp_page_copy() from the original thread in process A retakes the PTL and > copies the _now exclusive_ page. > > Having typed it up, I guess this can't happen, because wp_page_copy() will only > do the copy if the PTE hasn't changed and it will have changed because it is now > writable? So this is safe? this applies as well. If the pte changed (when reusing due to a write failt it's now writable, or someone else broke COW), we back off. For FAULT_FLAG_UNSHARE, however, the PTE may not change. But the additional reference should make it work. I think it works as intended. It would be clearer if we'd also recheck in wp_page_copy() whether we still don't have an exclusive anon page under PT lock -- and if we would, back off. > > To make things more convoluted, what happens if the second thread does an > mprotect() to make the page RO after its write fault was handled? I think > mprotect() will serialize on the mmap write lock so this is safe too? Yes, mprotect() synchronizes that. There are other mechanisms to write-protect a page, though, under mmap lock in read mode (uffd-wp). So it's a valid concern. In all of these cases, reuse should be prevented due to the additional reference on the folio when entering wp_page_copy() right from the start, not turning the page exclusive but instead replacing it by a copy. An additional sanity check sounds like the right thing to do. > > Sorry if this is a bit rambly, just trying to make sure I've understood > everything correctly. It's a very interesting corner case, thanks for bringing that up. I think the old mapcount based approach could have suffered from this theoretical issue, but I might be wrong.
On 19/04/2023 11:51, David Hildenbrand wrote: >> I'm looking to fix this problem in my code, but am struggling to see how the >> current code is safe. I'm thinking about the following scenario: >> > > Let's see :) > >> - A page is CoW mapped into processes A and B. >> - The page takes a fault in process A, and do_wp_page() determines that it is >> "maybe-shared" and therefore must copy. So drops the PTL and calls >> wp_page_copy(). > > Note that before calling wp_page_copy(), we do a folio_get(folio); Further, the > page table reference is only dropped once we actually replace the page in the > page table. So while in wp_page_copy(), the folio should have at least 2 > references if the page is still mapped. Ahh, yes, of course! > >> - Process B exits. >> - Another thread in process A faults on the page. This time dw_wp_page() >> determines that the page is exclusive (due to the ref count), and reuses it, >> marking it exclusive along the way. > > The refcount should not be 1 (other reference from the wp_page_copy() caller), > so A won't be able to reuse it, and ... > >> - wp_page_copy() from the original thread in process A retakes the PTL and >> copies the _now exclusive_ page. >> >> Having typed it up, I guess this can't happen, because wp_page_copy() will only >> do the copy if the PTE hasn't changed and it will have changed because it is now >> writable? So this is safe? > > this applies as well. If the pte changed (when reusing due to a write failt it's > now writable, or someone else broke COW), we back off. For FAULT_FLAG_UNSHARE, > however, the PTE may not change. But the additional reference should make it work. > > I think it works as intended. It would be clearer if we'd also recheck in > wp_page_copy() whether we still don't have an exclusive anon page under PT lock > -- and if we would, back off. Yes, agreed. I now have a fix for my code that adds a check during the PTE scan that each page is non-exclusive. And the scan is (already) done once without the PTL to determine the size of destination folio, and then again with the lock to ensure there was no race. I'm not seeing any change in the relative counts of folio orders (which is expected due to the case being rare). Thanks!
Hi David, On 17/04/2023 16:44, David Hildenbrand wrote: >>>>> >>>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>>> implying that maybe we shouldn't worry about optimizing the fork() case for >>>>> now >>>>> that heavily. >>>> >>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >>>> large copies was contributing a measurable amount to the performance >>>> improvement. (Certainly, the zero-page copy case, is definitely a big >>>> contributer). I don't have access to the HW at the moment but can rerun later >>>> with and without to double check. >>> >>> In which test exactly? Some micro-benchmark? >> >> The kernel compile benchmark that I quoted numbers for in the cover letter. I >> have some trace points (not part of the submitted series) that tell me how many >> mappings of each order we get for each code path. I'm pretty sure I remember all >> of these 4 code paths contributing non-negligible amounts. > > Interesting! It would be great to see if there is an actual difference after > patch #10 was applied without the other COW replacement. > Sorry about the delay. I now have some numbers for this... I rearranged the patch order so that all the "utility" stuff (new rmap functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a couple of general improvements (7, 17), which should be dormant until we have the final patches, then finally (6, 10, 14, 15), which implement large anon folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we never replace an exclusive page). I've measured performance at the following locations in the patch set: - baseline: none of my patches applied - utility: has utility and general improvement patches applied - alloc: utility + 6 - reuse: utility + 6 + 10 - copy: utility + 6 + 10 + 14 - zero-alloc: utility + 6 + 19 + 14 + 15 The test is `make defconfig && time make -jN Image` for a clean checkout of v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some per-boot variance (probably down to kaslr, etc). So have booted each kernel 7 times for a total of 3x7=21 samples per kernel. Then I've taken the mean: jobs=8: | label | real | user | kernel | |:-----------|-------:|-------:|---------:| | baseline | 0.0% | 0.0% | 0.0% | | utility | -2.7% | -2.8% | -3.1% | | alloc | -6.0% | -2.3% | -24.1% | | reuse | -9.5% | -5.8% | -28.5% | | copy | -10.6% | -6.9% | -29.4% | | zero-alloc | -9.2% | -5.1% | -29.8% | jobs=160: | label | real | user | kernel | |:-----------|-------:|-------:|---------:| | baseline | 0.0% | 0.0% | 0.0% | | utility | -1.8% | -0.0% | -7.7% | | alloc | -6.0% | 1.8% | -20.9% | | reuse | -7.8% | -1.6% | -24.1% | | copy | -7.8% | -2.5% | -26.8% | | zero-alloc | -7.7% | 1.5% | -29.4% | So it looks like patch 10 (reuse) is making a difference, but copy and zero-alloc are not adding a huge amount, as you hypothesized. Personally I would prefer not to drop those patches though, as it will all help towards utilization of contiguous PTEs on arm64, which is the second part of the change that I'm now working on. For the final config ("zero-alloc") I also collected stats on how many operations each of the 4 paths was performing, using ftrace and histograms. "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of pages in the source folio): do_anonymous_page: { pnr: 1 } hitcount: 2749722 { pnr: 4 } hitcount: 387832 { pnr: 8 } hitcount: 409628 { pnr: 16 } hitcount: 4296115 pages: 76315914 faults: 7843297 pages per fault: 9.7 wp_page_reuse (anon): { pnr: 1, fnr: 1 } hitcount: 47887 { pnr: 3, fnr: 4 } hitcount: 2 { pnr: 4, fnr: 4 } hitcount: 6131 { pnr: 6, fnr: 8 } hitcount: 1 { pnr: 7, fnr: 8 } hitcount: 10 { pnr: 8, fnr: 8 } hitcount: 3794 { pnr: 1, fnr: 16 } hitcount: 36 { pnr: 2, fnr: 16 } hitcount: 23 { pnr: 3, fnr: 16 } hitcount: 5 { pnr: 4, fnr: 16 } hitcount: 9 { pnr: 5, fnr: 16 } hitcount: 8 { pnr: 6, fnr: 16 } hitcount: 9 { pnr: 7, fnr: 16 } hitcount: 3 { pnr: 8, fnr: 16 } hitcount: 24 { pnr: 9, fnr: 16 } hitcount: 2 { pnr: 10, fnr: 16 } hitcount: 1 { pnr: 11, fnr: 16 } hitcount: 9 { pnr: 12, fnr: 16 } hitcount: 2 { pnr: 13, fnr: 16 } hitcount: 27 { pnr: 14, fnr: 16 } hitcount: 2 { pnr: 15, fnr: 16 } hitcount: 54 { pnr: 16, fnr: 16 } hitcount: 6673 pages: 211393 faults: 64712 pages per fault: 3.3 wp_page_copy (anon): { pnr: 1, fnr: 1 } hitcount: 81242 { pnr: 4, fnr: 4 } hitcount: 5974 { pnr: 1, fnr: 8 } hitcount: 1 { pnr: 4, fnr: 8 } hitcount: 1 { pnr: 8, fnr: 8 } hitcount: 12933 { pnr: 1, fnr: 16 } hitcount: 19 { pnr: 4, fnr: 16 } hitcount: 3 { pnr: 8, fnr: 16 } hitcount: 7 { pnr: 16, fnr: 16 } hitcount: 4106 pages: 274390 faults: 104286 pages per fault: 2.6 wp_page_copy (zero): { pnr: 1 } hitcount: 178699 { pnr: 4 } hitcount: 14498 { pnr: 8 } hitcount: 23644 { pnr: 16 } hitcount: 257940 pages: 4552883 faults: 474781 pages per fault: 9.6 Thanks, Ryan
On 26.04.23 12:41, Ryan Roberts wrote: > Hi David, > > On 17/04/2023 16:44, David Hildenbrand wrote: > >>>>>> >>>>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>>>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for >>>>>> now >>>>>> that heavily. >>>>> >>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >>>>> large copies was contributing a measurable amount to the performance >>>>> improvement. (Certainly, the zero-page copy case, is definitely a big >>>>> contributer). I don't have access to the HW at the moment but can rerun later >>>>> with and without to double check. >>>> >>>> In which test exactly? Some micro-benchmark? >>> >>> The kernel compile benchmark that I quoted numbers for in the cover letter. I >>> have some trace points (not part of the submitted series) that tell me how many >>> mappings of each order we get for each code path. I'm pretty sure I remember all >>> of these 4 code paths contributing non-negligible amounts. >> >> Interesting! It would be great to see if there is an actual difference after >> patch #10 was applied without the other COW replacement. >> > > Sorry about the delay. I now have some numbers for this... > Dito, I'm swamped :) Thanks for running these benchmarks! As LSF/MM reminded me again of this topic ... > I rearranged the patch order so that all the "utility" stuff (new rmap > functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a > couple of general improvements (7, 17), which should be dormant until we have > the final patches, then finally (6, 10, 14, 15), which implement large anon > folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've > dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we > never replace an exclusive page). > > I've measured performance at the following locations in the patch set: > > - baseline: none of my patches applied > - utility: has utility and general improvement patches applied > - alloc: utility + 6 > - reuse: utility + 6 + 10 > - copy: utility + 6 + 10 + 14 > - zero-alloc: utility + 6 + 19 + 14 + 15 > > The test is `make defconfig && time make -jN Image` for a clean checkout of > v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some > per-boot variance (probably down to kaslr, etc). So have booted each kernel 7 > times for a total of 3x7=21 samples per kernel. Then I've taken the mean: > > > jobs=8: > > | label | real | user | kernel | > |:-----------|-------:|-------:|---------:| > | baseline | 0.0% | 0.0% | 0.0% | > | utility | -2.7% | -2.8% | -3.1% | > | alloc | -6.0% | -2.3% | -24.1% | > | reuse | -9.5% | -5.8% | -28.5% | > | copy | -10.6% | -6.9% | -29.4% | > | zero-alloc | -9.2% | -5.1% | -29.8% | > > > jobs=160: > > | label | real | user | kernel | > |:-----------|-------:|-------:|---------:| > | baseline | 0.0% | 0.0% | 0.0% | > | utility | -1.8% | -0.0% | -7.7% | > | alloc | -6.0% | 1.8% | -20.9% | > | reuse | -7.8% | -1.6% | -24.1% | > | copy | -7.8% | -2.5% | -26.8% | > | zero-alloc | -7.7% | 1.5% | -29.4% | > > > So it looks like patch 10 (reuse) is making a difference, but copy and > zero-alloc are not adding a huge amount, as you hypothesized. Personally I would > prefer not to drop those patches though, as it will all help towards utilization > of contiguous PTEs on arm64, which is the second part of the change that I'm now > working on. Yes, pretty much what I expected :) I can only suggest to (1) Make the initial support as simple and minimal as possible. That means, strip anything that is not absolutely required. That is, exclude *at least* copy and zero-alloc. We can always add selected optimizations on top later. You'll do yourself a favor to get as much review coverage, faster review for inclusion, and less chances for nasty BUGs. (2) Keep the COW logic simple. We've had too many issues in that area for my taste already. As 09854ba94c6a ("mm: do_wp_page() simplification") from Linus puts it: "Simplify, simplify, simplify.". If it doesn't add significant benefit, rather keep it simple. > > > For the final config ("zero-alloc") I also collected stats on how many > operations each of the 4 paths was performing, using ftrace and histograms. > "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of > pages in the source folio): > > > do_anonymous_page: > > { pnr: 1 } hitcount: 2749722 > { pnr: 4 } hitcount: 387832 > { pnr: 8 } hitcount: 409628 > { pnr: 16 } hitcount: 4296115 > > pages: 76315914 > faults: 7843297 > pages per fault: 9.7 > > > wp_page_reuse (anon): > > { pnr: 1, fnr: 1 } hitcount: 47887 > { pnr: 3, fnr: 4 } hitcount: 2 > { pnr: 4, fnr: 4 } hitcount: 6131 > { pnr: 6, fnr: 8 } hitcount: 1 > { pnr: 7, fnr: 8 } hitcount: 10 > { pnr: 8, fnr: 8 } hitcount: 3794 > { pnr: 1, fnr: 16 } hitcount: 36 > { pnr: 2, fnr: 16 } hitcount: 23 > { pnr: 3, fnr: 16 } hitcount: 5 > { pnr: 4, fnr: 16 } hitcount: 9 > { pnr: 5, fnr: 16 } hitcount: 8 > { pnr: 6, fnr: 16 } hitcount: 9 > { pnr: 7, fnr: 16 } hitcount: 3 > { pnr: 8, fnr: 16 } hitcount: 24 > { pnr: 9, fnr: 16 } hitcount: 2 > { pnr: 10, fnr: 16 } hitcount: 1 > { pnr: 11, fnr: 16 } hitcount: 9 > { pnr: 12, fnr: 16 } hitcount: 2 > { pnr: 13, fnr: 16 } hitcount: 27 > { pnr: 14, fnr: 16 } hitcount: 2 > { pnr: 15, fnr: 16 } hitcount: 54 > { pnr: 16, fnr: 16 } hitcount: 6673 > > pages: 211393 > faults: 64712 > pages per fault: 3.3 > > > wp_page_copy (anon): > > { pnr: 1, fnr: 1 } hitcount: 81242 > { pnr: 4, fnr: 4 } hitcount: 5974 > { pnr: 1, fnr: 8 } hitcount: 1 > { pnr: 4, fnr: 8 } hitcount: 1 > { pnr: 8, fnr: 8 } hitcount: 12933 > { pnr: 1, fnr: 16 } hitcount: 19 > { pnr: 4, fnr: 16 } hitcount: 3 > { pnr: 8, fnr: 16 } hitcount: 7 > { pnr: 16, fnr: 16 } hitcount: 4106 > > pages: 274390 > faults: 104286 > pages per fault: 2.6 > > > wp_page_copy (zero): > > { pnr: 1 } hitcount: 178699 > { pnr: 4 } hitcount: 14498 > { pnr: 8 } hitcount: 23644 > { pnr: 16 } hitcount: 257940 > > pages: 4552883 > faults: 474781 > pages per fault: 9.6 I'll have to set aside more time to digest these values :)
On 17/05/2023 14:58, David Hildenbrand wrote: > On 26.04.23 12:41, Ryan Roberts wrote: >> Hi David, >> >> On 17/04/2023 16:44, David Hildenbrand wrote: >> >>>>>>> >>>>>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's >>>>>>> really >>>>>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for >>>>>>> now >>>>>>> that heavily. >>>>>> >>>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember >>>>>> enabling >>>>>> large copies was contributing a measurable amount to the performance >>>>>> improvement. (Certainly, the zero-page copy case, is definitely a big >>>>>> contributer). I don't have access to the HW at the moment but can rerun later >>>>>> with and without to double check. >>>>> >>>>> In which test exactly? Some micro-benchmark? >>>> >>>> The kernel compile benchmark that I quoted numbers for in the cover letter. I >>>> have some trace points (not part of the submitted series) that tell me how many >>>> mappings of each order we get for each code path. I'm pretty sure I remember >>>> all >>>> of these 4 code paths contributing non-negligible amounts. >>> >>> Interesting! It would be great to see if there is an actual difference after >>> patch #10 was applied without the other COW replacement. >>> >> >> Sorry about the delay. I now have some numbers for this... >> > > Dito, I'm swamped :) Thanks for running these benchmarks! > > As LSF/MM reminded me again of this topic ... > >> I rearranged the patch order so that all the "utility" stuff (new rmap >> functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a >> couple of general improvements (7, 17), which should be dormant until we have >> the final patches, then finally (6, 10, 14, 15), which implement large anon >> folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've >> dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we >> never replace an exclusive page). >> >> I've measured performance at the following locations in the patch set: >> >> - baseline: none of my patches applied >> - utility: has utility and general improvement patches applied >> - alloc: utility + 6 >> - reuse: utility + 6 + 10 >> - copy: utility + 6 + 10 + 14 >> - zero-alloc: utility + 6 + 19 + 14 + 15 >> >> The test is `make defconfig && time make -jN Image` for a clean checkout of >> v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some >> per-boot variance (probably down to kaslr, etc). So have booted each kernel 7 >> times for a total of 3x7=21 samples per kernel. Then I've taken the mean: >> >> >> jobs=8: >> >> | label | real | user | kernel | >> |:-----------|-------:|-------:|---------:| >> | baseline | 0.0% | 0.0% | 0.0% | >> | utility | -2.7% | -2.8% | -3.1% | >> | alloc | -6.0% | -2.3% | -24.1% | >> | reuse | -9.5% | -5.8% | -28.5% | >> | copy | -10.6% | -6.9% | -29.4% | >> | zero-alloc | -9.2% | -5.1% | -29.8% | >> >> >> jobs=160: >> >> | label | real | user | kernel | >> |:-----------|-------:|-------:|---------:| >> | baseline | 0.0% | 0.0% | 0.0% | >> | utility | -1.8% | -0.0% | -7.7% | >> | alloc | -6.0% | 1.8% | -20.9% | >> | reuse | -7.8% | -1.6% | -24.1% | >> | copy | -7.8% | -2.5% | -26.8% | >> | zero-alloc | -7.7% | 1.5% | -29.4% | >> >> >> So it looks like patch 10 (reuse) is making a difference, but copy and >> zero-alloc are not adding a huge amount, as you hypothesized. Personally I would >> prefer not to drop those patches though, as it will all help towards utilization >> of contiguous PTEs on arm64, which is the second part of the change that I'm now >> working on. > > Yes, pretty much what I expected :) I can only suggest to > > (1) Make the initial support as simple and minimal as possible. That > means, strip anything that is not absolutely required. That is, > exclude *at least* copy and zero-alloc. We can always add selected > optimizations on top later. > > You'll do yourself a favor to get as much review coverage, faster > review for inclusion, and less chances for nasty BUGs. As I'm building out the testing capability, I'm seeing that with different HW configs and workloads, things move a bit and zero-alloc can account for up to 1% in some cases. Copy is still pretty marginal, but I wonder if we might see more value from it on Android where the Zygote is constantly forked? Regardless, I appreciate your point about making the initial contribution as small and simple as possible, so as I get closer to posting a v1, I'll keep it in mind and make sure I follow your advice. Thanks, Ryan > > (2) Keep the COW logic simple. We've had too many issues > in that area for my taste already. As 09854ba94c6a ("mm: > do_wp_page() simplification") from Linus puts it: "Simplify, > simplify, simplify.". If it doesn't add significant benefit, rather > keep it simple. > >> >> >> For the final config ("zero-alloc") I also collected stats on how many >> operations each of the 4 paths was performing, using ftrace and histograms. >> "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of >> pages in the source folio): >> >> >> do_anonymous_page: >> >> { pnr: 1 } hitcount: 2749722 >> { pnr: 4 } hitcount: 387832 >> { pnr: 8 } hitcount: 409628 >> { pnr: 16 } hitcount: 4296115 >> >> pages: 76315914 >> faults: 7843297 >> pages per fault: 9.7 >> >> >> wp_page_reuse (anon): >> >> { pnr: 1, fnr: 1 } hitcount: 47887 >> { pnr: 3, fnr: 4 } hitcount: 2 >> { pnr: 4, fnr: 4 } hitcount: 6131 >> { pnr: 6, fnr: 8 } hitcount: 1 >> { pnr: 7, fnr: 8 } hitcount: 10 >> { pnr: 8, fnr: 8 } hitcount: 3794 >> { pnr: 1, fnr: 16 } hitcount: 36 >> { pnr: 2, fnr: 16 } hitcount: 23 >> { pnr: 3, fnr: 16 } hitcount: 5 >> { pnr: 4, fnr: 16 } hitcount: 9 >> { pnr: 5, fnr: 16 } hitcount: 8 >> { pnr: 6, fnr: 16 } hitcount: 9 >> { pnr: 7, fnr: 16 } hitcount: 3 >> { pnr: 8, fnr: 16 } hitcount: 24 >> { pnr: 9, fnr: 16 } hitcount: 2 >> { pnr: 10, fnr: 16 } hitcount: 1 >> { pnr: 11, fnr: 16 } hitcount: 9 >> { pnr: 12, fnr: 16 } hitcount: 2 >> { pnr: 13, fnr: 16 } hitcount: 27 >> { pnr: 14, fnr: 16 } hitcount: 2 >> { pnr: 15, fnr: 16 } hitcount: 54 >> { pnr: 16, fnr: 16 } hitcount: 6673 >> >> pages: 211393 >> faults: 64712 >> pages per fault: 3.3 >> >> >> wp_page_copy (anon): >> >> { pnr: 1, fnr: 1 } hitcount: 81242 >> { pnr: 4, fnr: 4 } hitcount: 5974 >> { pnr: 1, fnr: 8 } hitcount: 1 >> { pnr: 4, fnr: 8 } hitcount: 1 >> { pnr: 8, fnr: 8 } hitcount: 12933 >> { pnr: 1, fnr: 16 } hitcount: 19 >> { pnr: 4, fnr: 16 } hitcount: 3 >> { pnr: 8, fnr: 16 } hitcount: 7 >> { pnr: 16, fnr: 16 } hitcount: 4106 >> >> pages: 274390 >> faults: 104286 >> pages per fault: 2.6 >> >> >> wp_page_copy (zero): >> >> { pnr: 1 } hitcount: 178699 >> { pnr: 4 } hitcount: 14498 >> { pnr: 8 } hitcount: 23644 >> { pnr: 16 } hitcount: 257940 >> >> pages: 4552883 >> faults: 474781 >> pages per fault: 9.6 > > I'll have to set aside more time to digest these values :) >