Message ID | 20241212073711.82300-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: map zero-filled pages to zero_pfn while doing swap-in | expand |
On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > While developing the zeromap series, Usama observed that certain > workloads may contain over 10% zero-filled pages. This may present > an opportunity to save memory by mapping zero-filled pages to zero_pfn > in do_swap_page(). If a write occurs later, do_wp_page() can > allocate a new page using the Copy-on-Write mechanism. Shouldn't this be done during, or rather instead of swap out instead? Swapping all zero pages out just to optimize the in-memory representation on seems rather backwards.
On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > While developing the zeromap series, Usama observed that certain > > workloads may contain over 10% zero-filled pages. This may present > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > in do_swap_page(). If a write occurs later, do_wp_page() can > > allocate a new page using the Copy-on-Write mechanism. > > Shouldn't this be done during, or rather instead of swap out instead? > Swapping all zero pages out just to optimize the in-memory > representation on seems rather backwards. I’m having trouble understanding your point—it seems like you might not have fully read the code. :-) The situation is as follows: for a zero-filled page, we are currently allocating a new page unconditionally. By mapping this zero-filled page to zero_pfn, we could save the memory used by this page. We don't need to allocate the memory until the page is written(which may never happen). > Thanks Barry
On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > While developing the zeromap series, Usama observed that certain > > > workloads may contain over 10% zero-filled pages. This may present > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > allocate a new page using the Copy-on-Write mechanism. > > > > Shouldn't this be done during, or rather instead of swap out instead? > > Swapping all zero pages out just to optimize the in-memory > > representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) I've not read the code at all, I've read your commit log. > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. Yes. But why do that in swap-in and not swap-out?
On 12.12.24 09:46, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> While developing the zeromap series, Usama observed that certain >>> workloads may contain over 10% zero-filled pages. This may present >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn >>> in do_swap_page(). If a write occurs later, do_wp_page() can >>> allocate a new page using the Copy-on-Write mechanism. >> >> Shouldn't this be done during, or rather instead of swap out instead? >> Swapping all zero pages out just to optimize the in-memory >> representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) > > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. > > We don't need to allocate the memory until the page is written(which may never > happen). I think what Christoph means is that you would determine that at PTE unmap time, and directly place the zero page in there. So there would be no need to have the page fault at all. I suspect at PTE unmap time might be problematic, because we might still have other (i.e., GUP) references modifying that page, and we can only rely on the page content being stable after we flushed the TLB as well. (I recall some deferred flushing optimizations)
On Thu, Dec 12, 2024 at 9:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > While developing the zeromap series, Usama observed that certain > > > > workloads may contain over 10% zero-filled pages. This may present > > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > > allocate a new page using the Copy-on-Write mechanism. > > > > > > Shouldn't this be done during, or rather instead of swap out instead? > > > Swapping all zero pages out just to optimize the in-memory > > > representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > I've not read the code at all, I've read your commit log. > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > Yes. But why do that in swap-in and not swap-out? Usama implemented this in swap-out, where no I/O occurs after his zeromap series. A bit is set in the swap->zeromap bitmap if the swapped-out page is zero-filled. and all swapp-out I/O is skipped. Now, the situation is that when we re-access a swapped-out page, we don’t always need to allocate a new page. Instead, we can map it to zero_pfn and defer the allocation until the page is written. >
On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > On 12.12.24 09:46, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> While developing the zeromap series, Usama observed that certain > >>> workloads may contain over 10% zero-filled pages. This may present > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > >>> allocate a new page using the Copy-on-Write mechanism. > >> > >> Shouldn't this be done during, or rather instead of swap out instead? > >> Swapping all zero pages out just to optimize the in-memory > >> representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > > > We don't need to allocate the memory until the page is written(which may never > > happen). > > I think what Christoph means is that you would determine that at PTE > unmap time, and directly place the zero page in there. So there would be > no need to have the page fault at all. > > I suspect at PTE unmap time might be problematic, because we might still > have other (i.e., GUP) references modifying that page, and we can only > rely on the page content being stable after we flushed the TLB as well. > (I recall some deferred flushing optimizations) Yes, we need to follow a strict sequence: 1. try_to_unmap - unmap PTEs in all processes; 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; 3. pageout - zeromap will set 1 in bitmap if page is zero-filled At the moment of pageout(), we can be confident that the page is zero-filled. mapping to zeropage during unmap seems quite risky. > > -- > Cheers, > > David / dhildenb > Thanks Barry
On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 12.12.24 09:46, Barry Song wrote: > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > >> > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > >>> From: Barry Song <v-songbaohua@oppo.com> > > >>> > > >>> While developing the zeromap series, Usama observed that certain > > >>> workloads may contain over 10% zero-filled pages. This may present > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > >>> allocate a new page using the Copy-on-Write mechanism. > > >> > > >> Shouldn't this be done during, or rather instead of swap out instead? > > >> Swapping all zero pages out just to optimize the in-memory > > >> representation on seems rather backwards. > > > > > > I’m having trouble understanding your point—it seems like you might > > > not have fully read the code. :-) > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > allocating a new > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > save the memory used by this page. > > > > > > We don't need to allocate the memory until the page is written(which may never > > > happen). > > > > I think what Christoph means is that you would determine that at PTE > > unmap time, and directly place the zero page in there. So there would be > > no need to have the page fault at all. > > > > I suspect at PTE unmap time might be problematic, because we might still > > have other (i.e., GUP) references modifying that page, and we can only > > rely on the page content being stable after we flushed the TLB as well. > > (I recall some deferred flushing optimizations) > > Yes, we need to follow a strict sequence: > > 1. try_to_unmap - unmap PTEs in all processes; > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > At the moment of pageout(), we can be confident that the page is zero-filled. > > mapping to zeropage during unmap seems quite risky. You have to unmap and flush to stop modifications, but I think not in all processes before it's safe to decide. Shared anon pages have COW semantics; when you enter try_to_unmap() with a page and rmap gives you a pte, it's one of these: a) never forked, no sibling ptes b) cow broken into private copy, no sibling ptes c) cow/WP; any writes to this or another pte will go to a new page. In cases a and b you need to unmap and flush the current pte, but then it's safe to check contents and set the zero pte right away, even before finishing the rmap walk. In case c, modifications to the page are impossible due to WP, so you don't even need to unmap and flush before checking the contents. The pte lock holds up COW breaking to a new page until you're done. It's definitely more complicated than the current implementation, but if it can be made to work, we could get rid of the bitmap. You might also reduce faults, but I'm a bit skeptical. Presumably zerofilled regions are mostly considered invalid by the application, not useful data, so a populating write that will cowbreak seems more likely to happen next than a faultless read from the zeropage.
On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 12.12.24 09:46, Barry Song wrote: > > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > >> > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > >>> From: Barry Song <v-songbaohua@oppo.com> > > > >>> > > > >>> While developing the zeromap series, Usama observed that certain > > > >>> workloads may contain over 10% zero-filled pages. This may present > > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > > >>> allocate a new page using the Copy-on-Write mechanism. > > > >> > > > >> Shouldn't this be done during, or rather instead of swap out instead? > > > >> Swapping all zero pages out just to optimize the in-memory > > > >> representation on seems rather backwards. > > > > > > > > I’m having trouble understanding your point—it seems like you might > > > > not have fully read the code. :-) > > > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > > allocating a new > > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > > save the memory used by this page. > > > > > > > > We don't need to allocate the memory until the page is written(which may never > > > > happen). > > > > > > I think what Christoph means is that you would determine that at PTE > > > unmap time, and directly place the zero page in there. So there would be > > > no need to have the page fault at all. > > > > > > I suspect at PTE unmap time might be problematic, because we might still > > > have other (i.e., GUP) references modifying that page, and we can only > > > rely on the page content being stable after we flushed the TLB as well. > > > (I recall some deferred flushing optimizations) > > > > Yes, we need to follow a strict sequence: > > > > 1. try_to_unmap - unmap PTEs in all processes; > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > > > At the moment of pageout(), we can be confident that the page is zero-filled. > > > > mapping to zeropage during unmap seems quite risky. > > You have to unmap and flush to stop modifications, but I think not in > all processes before it's safe to decide. Shared anon pages have COW > semantics; when you enter try_to_unmap() with a page and rmap gives > you a pte, it's one of these: > > a) never forked, no sibling ptes > b) cow broken into private copy, no sibling ptes > c) cow/WP; any writes to this or another pte will go to a new page. > > In cases a and b you need to unmap and flush the current pte, but then > it's safe to check contents and set the zero pte right away, even > before finishing the rmap walk. > > In case c, modifications to the page are impossible due to WP, so you > don't even need to unmap and flush before checking the contents. The > pte lock holds up COW breaking to a new page until you're done. > > It's definitely more complicated than the current implementation, but > if it can be made to work, we could get rid of the bitmap. > > You might also reduce faults, but I'm a bit skeptical. Presumably > zerofilled regions are mostly considered invalid by the application, > not useful data, so a populating write that will cowbreak seems more > likely to happen next than a faultless read from the zeropage. Yes. That is right. I created the following debug patch to count the proportional distribution of zero_swpin reads, as well as the comparison between zero_swpin and zero_swpout: diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..ed9d1a6cc565 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, SWAP_RA_HIT, SWPIN_ZERO, SWPOUT_ZERO, + SWPIN_ZERO_READ, #ifdef CONFIG_KSM KSM_SWPIN_COPY, #endif diff --git a/mm/memory.c b/mm/memory.c index f3040c69f648..3aacfbe7bd77 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* Count SWPIN_ZERO since page_io was skipped */ objcg = get_obj_cgroup_from_swap(entry); count_vm_events(SWPIN_ZERO, 1); + count_vm_events(SWPIN_ZERO_READ, 1); if (objcg) { count_objcg_events(objcg, SWPIN_ZERO, 1); obj_cgroup_put(objcg); diff --git a/mm/vmstat.c b/mm/vmstat.c index 4d016314a56c..9465fe9bda9e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = { "swap_ra_hit", "swpin_zero", "swpout_zero", + "swpin_zero_read", #ifdef CONFIG_KSM "ksm_swpin_copy", #endif For a kernel-build workload in a single memcg with only 1GB of memory, use the script below: #!/bin/bash echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled vmstat_path="/proc/vmstat" thp_base_path="/sys/kernel/mm/transparent_hugepage" read_values() { pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}') echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read" } for ((i=1; i<=5; i++)) do echo echo "*** Executing round $i ***" make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1 #kernel build initial_values=($(read_values)) time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null final_values=($(read_values)) echo "pswpin: $((final_values[0] - initial_values[0]))" echo "pswpout: $((final_values[1] - initial_values[1]))" echo "pgpgin: $((final_values[2] - initial_values[2]))" echo "pgpgout: $((final_values[3] - initial_values[3]))" echo "swpout_zero: $((final_values[4] - initial_values[4]))" echo "swpin_zero: $((final_values[5] - initial_values[5]))" echo "swpin_zero_read: $((final_values[6] - initial_values[6]))" done The results I am seeing are as follows: real 6m43.998s user 47m3.800s sys 5m7.169s pswpin: 342041 pswpout: 1470846 pgpgin: 11744932 pgpgout: 14466564 swpout_zero: 318030 swpin_zero: 93621 swpin_zero_read: 13118 The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846. The percentage is 17.8% = 318,030 / (318,030 + 1,470,846). About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those zero_swpin pages are read (13,118 / 93,621). Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this kernel-build workload. Thus, the total build time of my final results falls within the testing jitter range, showing no noticeable difference while the conceptual model code with lots of zero-filled pages and read swap-in shows significant differences. I'm not sure if we can identify another real workload with more read swpin to observe noticeable improvements. Perhaps Usama has some? Thanks Barry
On Fri, Dec 13, 2024 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > > > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > > On 12.12.24 09:46, Barry Song wrote: > > > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > >> > > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > >>> From: Barry Song <v-songbaohua@oppo.com> > > > > >>> > > > > >>> While developing the zeromap series, Usama observed that certain > > > > >>> workloads may contain over 10% zero-filled pages. This may present > > > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > > > >>> allocate a new page using the Copy-on-Write mechanism. > > > > >> > > > > >> Shouldn't this be done during, or rather instead of swap out instead? > > > > >> Swapping all zero pages out just to optimize the in-memory > > > > >> representation on seems rather backwards. > > > > > > > > > > I’m having trouble understanding your point—it seems like you might > > > > > not have fully read the code. :-) > > > > > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > > > allocating a new > > > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > > > save the memory used by this page. > > > > > > > > > > We don't need to allocate the memory until the page is written(which may never > > > > > happen). > > > > > > > > I think what Christoph means is that you would determine that at PTE > > > > unmap time, and directly place the zero page in there. So there would be > > > > no need to have the page fault at all. > > > > > > > > I suspect at PTE unmap time might be problematic, because we might still > > > > have other (i.e., GUP) references modifying that page, and we can only > > > > rely on the page content being stable after we flushed the TLB as well. > > > > (I recall some deferred flushing optimizations) > > > > > > Yes, we need to follow a strict sequence: > > > > > > 1. try_to_unmap - unmap PTEs in all processes; > > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > > > > > At the moment of pageout(), we can be confident that the page is zero-filled. > > > > > > mapping to zeropage during unmap seems quite risky. > > > > You have to unmap and flush to stop modifications, but I think not in > > all processes before it's safe to decide. Shared anon pages have COW > > semantics; when you enter try_to_unmap() with a page and rmap gives > > you a pte, it's one of these: > > > > a) never forked, no sibling ptes > > b) cow broken into private copy, no sibling ptes > > c) cow/WP; any writes to this or another pte will go to a new page. > > > > In cases a and b you need to unmap and flush the current pte, but then > > it's safe to check contents and set the zero pte right away, even > > before finishing the rmap walk. > > > > In case c, modifications to the page are impossible due to WP, so you > > don't even need to unmap and flush before checking the contents. The > > pte lock holds up COW breaking to a new page until you're done. > > > > It's definitely more complicated than the current implementation, but > > if it can be made to work, we could get rid of the bitmap. > > > > You might also reduce faults, but I'm a bit skeptical. Presumably > > zerofilled regions are mostly considered invalid by the application, > > not useful data, so a populating write that will cowbreak seems more > > likely to happen next than a faultless read from the zeropage. > > Yes. That is right. > > I created the following debug patch to count the proportional distribution > of zero_swpin reads, as well as the comparison between zero_swpin and > zero_swpout: > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index f70d0958095c..ed9d1a6cc565 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > SWAP_RA_HIT, > SWPIN_ZERO, > SWPOUT_ZERO, > + SWPIN_ZERO_READ, > #ifdef CONFIG_KSM > KSM_SWPIN_COPY, > #endif > diff --git a/mm/memory.c b/mm/memory.c > index f3040c69f648..3aacfbe7bd77 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > /* Count SWPIN_ZERO since page_io was skipped */ > objcg = get_obj_cgroup_from_swap(entry); > count_vm_events(SWPIN_ZERO, 1); > + count_vm_events(SWPIN_ZERO_READ, 1); > if (objcg) { > count_objcg_events(objcg, SWPIN_ZERO, 1); > obj_cgroup_put(objcg); > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 4d016314a56c..9465fe9bda9e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = { > "swap_ra_hit", > "swpin_zero", > "swpout_zero", > + "swpin_zero_read", > #ifdef CONFIG_KSM > "ksm_swpin_copy", > #endif > > > For a kernel-build workload in a single memcg with only 1GB of memory, use > the script below: > > #!/bin/bash > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path="/proc/vmstat" > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}') > > echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read" > } > > for ((i=1; i<=5; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1 > #kernel build > initial_values=($(read_values)) > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > final_values=($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "pgpgin: $((final_values[2] - initial_values[2]))" > echo "pgpgout: $((final_values[3] - initial_values[3]))" > echo "swpout_zero: $((final_values[4] - initial_values[4]))" > echo "swpin_zero: $((final_values[5] - initial_values[5]))" > echo "swpin_zero_read: $((final_values[6] - initial_values[6]))" > done > > > The results I am seeing are as follows: > > real 6m43.998s > user 47m3.800s > sys 5m7.169s > pswpin: 342041 > pswpout: 1470846 > pgpgin: 11744932 > pgpgout: 14466564 > swpout_zero: 318030 > swpin_zero: 93621 > swpin_zero_read: 13118 > > The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846. > The percentage is 17.8% = 318,030 / (318,030 + 1,470,846). > > About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those > zero_swpin pages are read (13,118 / 93,621). > > Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages > will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this > kernel-build workload. Thus, the total build time of my final results falls Apologies for the mistake in my math. I shouldn't have used swpout as the denominator and swpin as the numerator. Instead, both the numerator and denominator should be based on swpin. Potentially, 13,118 swpin_zero_read / (342,041 pswpin + 13,118 swpin_zero_read) could be saved for swap-in, meaning 3.7% of all swap-ins can be saved using zero_pfn. > within the testing jitter range, showing no noticeable difference while > the conceptual model code with lots of zero-filled pages and read swap-in > shows significant differences. Although 3.7% of swap-ins can be saved, my X86 PC is too weak to demonstrate the differences, as numerous factors—such as temperature and unstable I/O latency—can impact the final build time. Hopefully, others can share test results conducted on stable hardware and more workloads. > > I'm not sure if we can identify another real workload with more read swpin > to observe noticeable improvements. Perhaps Usama has some? > Thanks Barry
diff --git a/mm/memory.c b/mm/memory.c index 2bacebbf4cf6..b37f0f61d0bc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4294,6 +4294,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct swap_info_struct *si = NULL; rmap_t rmap_flags = RMAP_NONE; bool need_clear_cache = false; + bool map_zero_pfn = false; bool exclusive = false; swp_entry_t entry; pte_t pte; @@ -4364,6 +4365,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) swapcache = folio; if (!folio) { + /* Use the zero-page for reads */ + if (!(vmf->flags & FAULT_FLAG_WRITE) && + !mm_forbids_zeropage(vma->vm_mm) && + __swap_count(entry) == 1) { + swap_zeromap_batch(entry, 1, &map_zero_pfn); + if (map_zero_pfn) { + if (swapcache_prepare(entry, 1)) { + add_wait_queue(&swapcache_wq, &wait); + schedule_timeout_uninterruptible(1); + remove_wait_queue(&swapcache_wq, &wait); + goto out; + } + nr_pages = 1; + need_clear_cache = true; + pte = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), + vma->vm_page_prot)); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + &vmf->ptl); + if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), + vmf->orig_pte))) + goto unlock; + + page = pfn_to_page(my_zero_pfn(vmf->address)); + arch_swap_restore(entry, page_folio(page)); + swap_free_nr(entry, 1); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1); + set_ptes(vma->vm_mm, vmf->address, vmf->pte, pte, 1); + arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address, pte, pte, 1); + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + goto unlock; + } + } + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { /* skip swapcache */