diff mbox series

[RFC] mm: map zero-filled pages to zero_pfn while doing swap-in

Message ID 20241212073711.82300-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: map zero-filled pages to zero_pfn while doing swap-in | expand

Commit Message

Barry Song Dec. 12, 2024, 7:37 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

While developing the zeromap series, Usama observed that certain
workloads may contain over 10% zero-filled pages. This may present
an opportunity to save memory by mapping zero-filled pages to zero_pfn
in do_swap_page(). If a write occurs later, do_wp_page() can
allocate a new page using the Copy-on-Write mechanism.

For workloads with numerous zero-filled pages, this can greatly
reduce the RSS.

For example:
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/mman.h>

 #define SIZE (20 * 1024 * 1024)
 int main()
 {
 	volatile char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	volatile char data;

 	if (buffer == MAP_FAILED) {
 		perror("mmap failed");
 		exit(EXIT_FAILURE);
 	}

 	memset(buffer, 0, SIZE);

 	if (madvise(buffer, SIZE, MADV_PAGEOUT) != 0)
 		perror("madvise MADV_PAGEOUT failed");

 	for (size_t i = 0; i < SIZE; i++)
 		data = buffer[i];
 	sleep(1000);

 	return 0;
 }

~ # ./a.out &

w/o patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       101  2.9 10.6  22540 21268 ttyAMA0  S    06:50   0:00 ./a.out

w/ patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       141  0.1  0.3  22540   792 ttyAMA0  S    06:38   0:00 ./a.out

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Christoph Hellwig Dec. 12, 2024, 8:29 a.m. UTC | #1
On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> While developing the zeromap series, Usama observed that certain
> workloads may contain over 10% zero-filled pages. This may present
> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> in do_swap_page(). If a write occurs later, do_wp_page() can
> allocate a new page using the Copy-on-Write mechanism.

Shouldn't this be done during, or rather instead of swap out instead?
Swapping all zero pages out just to optimize the in-memory
representation on seems rather backwards.
Barry Song Dec. 12, 2024, 8:46 a.m. UTC | #2
On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > While developing the zeromap series, Usama observed that certain
> > workloads may contain over 10% zero-filled pages. This may present
> > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > in do_swap_page(). If a write occurs later, do_wp_page() can
> > allocate a new page using the Copy-on-Write mechanism.
>
> Shouldn't this be done during, or rather instead of swap out instead?
> Swapping all zero pages out just to optimize the in-memory
> representation on seems rather backwards.

I’m having trouble understanding your point—it seems like you might
not have fully read the code. :-)

The situation is as follows: for a zero-filled page, we are currently
allocating a new
page unconditionally. By mapping this zero-filled page to zero_pfn, we could
save the memory used by this page.

We don't need to allocate the memory until the page is written(which may never
happen).

>

Thanks
Barry
Christoph Hellwig Dec. 12, 2024, 8:50 a.m. UTC | #3
On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > While developing the zeromap series, Usama observed that certain
> > > workloads may contain over 10% zero-filled pages. This may present
> > > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > in do_swap_page(). If a write occurs later, do_wp_page() can
> > > allocate a new page using the Copy-on-Write mechanism.
> >
> > Shouldn't this be done during, or rather instead of swap out instead?
> > Swapping all zero pages out just to optimize the in-memory
> > representation on seems rather backwards.
> 
> I’m having trouble understanding your point—it seems like you might
> not have fully read the code. :-)

I've not read the code at all, I've read your commit log.

> The situation is as follows: for a zero-filled page, we are currently
> allocating a new
> page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> save the memory used by this page.

Yes.  But why do that in swap-in and not swap-out?
David Hildenbrand Dec. 12, 2024, 8:50 a.m. UTC | #4
On 12.12.24 09:46, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> While developing the zeromap series, Usama observed that certain
>>> workloads may contain over 10% zero-filled pages. This may present
>>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
>>> in do_swap_page(). If a write occurs later, do_wp_page() can
>>> allocate a new page using the Copy-on-Write mechanism.
>>
>> Shouldn't this be done during, or rather instead of swap out instead?
>> Swapping all zero pages out just to optimize the in-memory
>> representation on seems rather backwards.
> 
> I’m having trouble understanding your point—it seems like you might
> not have fully read the code. :-)
> 
> The situation is as follows: for a zero-filled page, we are currently
> allocating a new
> page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> save the memory used by this page.
> 
> We don't need to allocate the memory until the page is written(which may never
> happen).

I think what Christoph means is that you would determine that at PTE 
unmap time, and directly place the zero page in there. So there would be 
no need to have the page fault at all.

I suspect at PTE unmap time might be problematic, because we might still 
have other (i.e., GUP) references modifying that page, and we can only 
rely on the page content being stable after we flushed the TLB as well. 
(I recall some deferred flushing optimizations)
Barry Song Dec. 12, 2024, 8:54 a.m. UTC | #5
On Thu, Dec 12, 2024 at 9:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > While developing the zeromap series, Usama observed that certain
> > > > workloads may contain over 10% zero-filled pages. This may present
> > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > > in do_swap_page(). If a write occurs later, do_wp_page() can
> > > > allocate a new page using the Copy-on-Write mechanism.
> > >
> > > Shouldn't this be done during, or rather instead of swap out instead?
> > > Swapping all zero pages out just to optimize the in-memory
> > > representation on seems rather backwards.
> >
> > I’m having trouble understanding your point—it seems like you might
> > not have fully read the code. :-)
>
> I've not read the code at all, I've read your commit log.
>
> > The situation is as follows: for a zero-filled page, we are currently
> > allocating a new
> > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > save the memory used by this page.
>
> Yes.  But why do that in swap-in and not swap-out?

Usama implemented this in swap-out, where no I/O occurs after
his zeromap series. A bit is set in the swap->zeromap bitmap if
the swapped-out page is zero-filled. and all swapp-out I/O is
skipped.

Now, the situation is that when we re-access a swapped-out
page, we don’t always need to allocate a new page. Instead,
we can map it to zero_pfn and defer the allocation until the
page is written.

>
Barry Song Dec. 12, 2024, 9:16 a.m. UTC | #6
On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.12.24 09:46, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> While developing the zeromap series, Usama observed that certain
> >>> workloads may contain over 10% zero-filled pages. This may present
> >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> >>> allocate a new page using the Copy-on-Write mechanism.
> >>
> >> Shouldn't this be done during, or rather instead of swap out instead?
> >> Swapping all zero pages out just to optimize the in-memory
> >> representation on seems rather backwards.
> >
> > I’m having trouble understanding your point—it seems like you might
> > not have fully read the code. :-)
> >
> > The situation is as follows: for a zero-filled page, we are currently
> > allocating a new
> > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > save the memory used by this page.
> >
> > We don't need to allocate the memory until the page is written(which may never
> > happen).
>
> I think what Christoph means is that you would determine that at PTE
> unmap time, and directly place the zero page in there. So there would be
> no need to have the page fault at all.
>
> I suspect at PTE unmap time might be problematic, because we might still
> have other (i.e., GUP) references modifying that page, and we can only
> rely on the page content being stable after we flushed the TLB as well.
> (I recall some deferred flushing optimizations)

Yes, we need to follow a strict sequence:

1. try_to_unmap - unmap PTEs in all processes;
2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
3. pageout - zeromap will set 1 in bitmap if page is zero-filled

At the moment of pageout(), we can be confident that the page is zero-filled.

mapping to zeropage during unmap seems quite risky.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
Johannes Weiner Dec. 12, 2024, 4:25 p.m. UTC | #7
On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 12.12.24 09:46, Barry Song wrote:
> > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >>
> > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > >>> From: Barry Song <v-songbaohua@oppo.com>
> > >>>
> > >>> While developing the zeromap series, Usama observed that certain
> > >>> workloads may contain over 10% zero-filled pages. This may present
> > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> > >>> allocate a new page using the Copy-on-Write mechanism.
> > >>
> > >> Shouldn't this be done during, or rather instead of swap out instead?
> > >> Swapping all zero pages out just to optimize the in-memory
> > >> representation on seems rather backwards.
> > >
> > > I’m having trouble understanding your point—it seems like you might
> > > not have fully read the code. :-)
> > >
> > > The situation is as follows: for a zero-filled page, we are currently
> > > allocating a new
> > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > > save the memory used by this page.
> > >
> > > We don't need to allocate the memory until the page is written(which may never
> > > happen).
> >
> > I think what Christoph means is that you would determine that at PTE
> > unmap time, and directly place the zero page in there. So there would be
> > no need to have the page fault at all.
> >
> > I suspect at PTE unmap time might be problematic, because we might still
> > have other (i.e., GUP) references modifying that page, and we can only
> > rely on the page content being stable after we flushed the TLB as well.
> > (I recall some deferred flushing optimizations)
> 
> Yes, we need to follow a strict sequence:
> 
> 1. try_to_unmap - unmap PTEs in all processes;
> 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
> 3. pageout - zeromap will set 1 in bitmap if page is zero-filled
> 
> At the moment of pageout(), we can be confident that the page is zero-filled.
> 
> mapping to zeropage during unmap seems quite risky.

You have to unmap and flush to stop modifications, but I think not in
all processes before it's safe to decide. Shared anon pages have COW
semantics; when you enter try_to_unmap() with a page and rmap gives
you a pte, it's one of these:

  a) never forked, no sibling ptes
  b) cow broken into private copy, no sibling ptes
  c) cow/WP; any writes to this or another pte will go to a new page.

In cases a and b you need to unmap and flush the current pte, but then
it's safe to check contents and set the zero pte right away, even
before finishing the rmap walk.

In case c, modifications to the page are impossible due to WP, so you
don't even need to unmap and flush before checking the contents. The
pte lock holds up COW breaking to a new page until you're done.

It's definitely more complicated than the current implementation, but
if it can be made to work, we could get rid of the bitmap.

You might also reduce faults, but I'm a bit skeptical. Presumably
zerofilled regions are mostly considered invalid by the application,
not useful data, so a populating write that will cowbreak seems more
likely to happen next than a faultless read from the zeropage.
Barry Song Dec. 13, 2024, 1:47 a.m. UTC | #8
On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 12.12.24 09:46, Barry Song wrote:
> > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >>
> > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > >>> From: Barry Song <v-songbaohua@oppo.com>
> > > >>>
> > > >>> While developing the zeromap series, Usama observed that certain
> > > >>> workloads may contain over 10% zero-filled pages. This may present
> > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> > > >>> allocate a new page using the Copy-on-Write mechanism.
> > > >>
> > > >> Shouldn't this be done during, or rather instead of swap out instead?
> > > >> Swapping all zero pages out just to optimize the in-memory
> > > >> representation on seems rather backwards.
> > > >
> > > > I’m having trouble understanding your point—it seems like you might
> > > > not have fully read the code. :-)
> > > >
> > > > The situation is as follows: for a zero-filled page, we are currently
> > > > allocating a new
> > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > > > save the memory used by this page.
> > > >
> > > > We don't need to allocate the memory until the page is written(which may never
> > > > happen).
> > >
> > > I think what Christoph means is that you would determine that at PTE
> > > unmap time, and directly place the zero page in there. So there would be
> > > no need to have the page fault at all.
> > >
> > > I suspect at PTE unmap time might be problematic, because we might still
> > > have other (i.e., GUP) references modifying that page, and we can only
> > > rely on the page content being stable after we flushed the TLB as well.
> > > (I recall some deferred flushing optimizations)
> >
> > Yes, we need to follow a strict sequence:
> >
> > 1. try_to_unmap - unmap PTEs in all processes;
> > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
> > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled
> >
> > At the moment of pageout(), we can be confident that the page is zero-filled.
> >
> > mapping to zeropage during unmap seems quite risky.
>
> You have to unmap and flush to stop modifications, but I think not in
> all processes before it's safe to decide. Shared anon pages have COW
> semantics; when you enter try_to_unmap() with a page and rmap gives
> you a pte, it's one of these:
>
>   a) never forked, no sibling ptes
>   b) cow broken into private copy, no sibling ptes
>   c) cow/WP; any writes to this or another pte will go to a new page.
>
> In cases a and b you need to unmap and flush the current pte, but then
> it's safe to check contents and set the zero pte right away, even
> before finishing the rmap walk.
>
> In case c, modifications to the page are impossible due to WP, so you
> don't even need to unmap and flush before checking the contents. The
> pte lock holds up COW breaking to a new page until you're done.
>
> It's definitely more complicated than the current implementation, but
> if it can be made to work, we could get rid of the bitmap.
>
> You might also reduce faults, but I'm a bit skeptical. Presumably
> zerofilled regions are mostly considered invalid by the application,
> not useful data, so a populating write that will cowbreak seems more
> likely to happen next than a faultless read from the zeropage.

Yes. That is right.

I created the following debug patch to count the proportional distribution
of zero_swpin reads, as well as the comparison between zero_swpin and
zero_swpout:

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..ed9d1a6cc565 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		SWAP_RA_HIT,
 		SWPIN_ZERO,
 		SWPOUT_ZERO,
+		SWPIN_ZERO_READ,
 #ifdef CONFIG_KSM
 		KSM_SWPIN_COPY,
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index f3040c69f648..3aacfbe7bd77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                                /* Count SWPIN_ZERO since page_io was skipped */
                                objcg = get_obj_cgroup_from_swap(entry);
                                count_vm_events(SWPIN_ZERO, 1);
+                               count_vm_events(SWPIN_ZERO_READ, 1);
                                if (objcg) {
                                        count_objcg_events(objcg, SWPIN_ZERO, 1);
                                        obj_cgroup_put(objcg);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..9465fe9bda9e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = {
 	"swap_ra_hit",
 	"swpin_zero",
 	"swpout_zero",
+	"swpin_zero_read",
 #ifdef CONFIG_KSM
 	"ksm_swpin_copy",
 #endif


For a kernel-build workload in a single memcg with only 1GB of memory, use
the script below:

#!/bin/bash

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

vmstat_path="/proc/vmstat"
thp_base_path="/sys/kernel/mm/transparent_hugepage"

read_values() {
    pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
    pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
    pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
    pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
    swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
    swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
    swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}')
    
    echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read"
}

for ((i=1; i<=5; i++))
do
  echo
  echo "*** Executing round $i ***"
  make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
  sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1
  #kernel build
  initial_values=($(read_values))
  time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
        CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
  final_values=($(read_values))

  echo "pswpin: $((final_values[0] - initial_values[0]))"
  echo "pswpout: $((final_values[1] - initial_values[1]))"
  echo "pgpgin: $((final_values[2] - initial_values[2]))"
  echo "pgpgout: $((final_values[3] - initial_values[3]))"
  echo "swpout_zero: $((final_values[4] - initial_values[4]))"
  echo "swpin_zero: $((final_values[5] - initial_values[5]))"
  echo "swpin_zero_read: $((final_values[6] - initial_values[6]))"
done


The results I am seeing are as follows:

real	6m43.998s
user	47m3.800s
sys	5m7.169s
pswpin: 342041
pswpout: 1470846
pgpgin: 11744932
pgpgout: 14466564
swpout_zero: 318030
swpin_zero: 93621
swpin_zero_read: 13118

The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846.
The percentage is 17.8% = 318,030 / (318,030 + 1,470,846).

About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those
zero_swpin pages are read (13,118 / 93,621).

Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages
will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this
kernel-build workload. Thus, the total build time of my final results falls
within the testing jitter range, showing no noticeable difference while
the conceptual model code with lots of zero-filled pages and read swap-in
shows significant differences.

I'm not sure if we can identify another real workload with more read swpin
to observe noticeable improvements. Perhaps Usama has some?

Thanks
Barry
Barry Song Dec. 13, 2024, 2:27 a.m. UTC | #9
On Fri, Dec 13, 2024 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote:
> > > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 12.12.24 09:46, Barry Song wrote:
> > > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >>
> > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > > >>> From: Barry Song <v-songbaohua@oppo.com>
> > > > >>>
> > > > >>> While developing the zeromap series, Usama observed that certain
> > > > >>> workloads may contain over 10% zero-filled pages. This may present
> > > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> > > > >>> allocate a new page using the Copy-on-Write mechanism.
> > > > >>
> > > > >> Shouldn't this be done during, or rather instead of swap out instead?
> > > > >> Swapping all zero pages out just to optimize the in-memory
> > > > >> representation on seems rather backwards.
> > > > >
> > > > > I’m having trouble understanding your point—it seems like you might
> > > > > not have fully read the code. :-)
> > > > >
> > > > > The situation is as follows: for a zero-filled page, we are currently
> > > > > allocating a new
> > > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > > > > save the memory used by this page.
> > > > >
> > > > > We don't need to allocate the memory until the page is written(which may never
> > > > > happen).
> > > >
> > > > I think what Christoph means is that you would determine that at PTE
> > > > unmap time, and directly place the zero page in there. So there would be
> > > > no need to have the page fault at all.
> > > >
> > > > I suspect at PTE unmap time might be problematic, because we might still
> > > > have other (i.e., GUP) references modifying that page, and we can only
> > > > rely on the page content being stable after we flushed the TLB as well.
> > > > (I recall some deferred flushing optimizations)
> > >
> > > Yes, we need to follow a strict sequence:
> > >
> > > 1. try_to_unmap - unmap PTEs in all processes;
> > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
> > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled
> > >
> > > At the moment of pageout(), we can be confident that the page is zero-filled.
> > >
> > > mapping to zeropage during unmap seems quite risky.
> >
> > You have to unmap and flush to stop modifications, but I think not in
> > all processes before it's safe to decide. Shared anon pages have COW
> > semantics; when you enter try_to_unmap() with a page and rmap gives
> > you a pte, it's one of these:
> >
> >   a) never forked, no sibling ptes
> >   b) cow broken into private copy, no sibling ptes
> >   c) cow/WP; any writes to this or another pte will go to a new page.
> >
> > In cases a and b you need to unmap and flush the current pte, but then
> > it's safe to check contents and set the zero pte right away, even
> > before finishing the rmap walk.
> >
> > In case c, modifications to the page are impossible due to WP, so you
> > don't even need to unmap and flush before checking the contents. The
> > pte lock holds up COW breaking to a new page until you're done.
> >
> > It's definitely more complicated than the current implementation, but
> > if it can be made to work, we could get rid of the bitmap.
> >
> > You might also reduce faults, but I'm a bit skeptical. Presumably
> > zerofilled regions are mostly considered invalid by the application,
> > not useful data, so a populating write that will cowbreak seems more
> > likely to happen next than a faultless read from the zeropage.
>
> Yes. That is right.
>
> I created the following debug patch to count the proportional distribution
> of zero_swpin reads, as well as the comparison between zero_swpin and
> zero_swpout:
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f70d0958095c..ed9d1a6cc565 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 SWAP_RA_HIT,
>                 SWPIN_ZERO,
>                 SWPOUT_ZERO,
> +               SWPIN_ZERO_READ,
>  #ifdef CONFIG_KSM
>                 KSM_SWPIN_COPY,
>  #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index f3040c69f648..3aacfbe7bd77 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                 /* Count SWPIN_ZERO since page_io was skipped */
>                                 objcg = get_obj_cgroup_from_swap(entry);
>                                 count_vm_events(SWPIN_ZERO, 1);
> +                               count_vm_events(SWPIN_ZERO_READ, 1);
>                                 if (objcg) {
>                                         count_objcg_events(objcg, SWPIN_ZERO, 1);
>                                         obj_cgroup_put(objcg);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4d016314a56c..9465fe9bda9e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = {
>         "swap_ra_hit",
>         "swpin_zero",
>         "swpout_zero",
> +       "swpin_zero_read",
>  #ifdef CONFIG_KSM
>         "ksm_swpin_copy",
>  #endif
>
>
> For a kernel-build workload in a single memcg with only 1GB of memory, use
> the script below:
>
> #!/bin/bash
>
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>
> vmstat_path="/proc/vmstat"
> thp_base_path="/sys/kernel/mm/transparent_hugepage"
>
> read_values() {
>     pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
>     pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
>     pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
>     pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
>     swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
>     swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
>     swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}')
>
>     echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read"
> }
>
> for ((i=1; i<=5; i++))
> do
>   echo
>   echo "*** Executing round $i ***"
>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>   sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1
>   #kernel build
>   initial_values=($(read_values))
>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
>   final_values=($(read_values))
>
>   echo "pswpin: $((final_values[0] - initial_values[0]))"
>   echo "pswpout: $((final_values[1] - initial_values[1]))"
>   echo "pgpgin: $((final_values[2] - initial_values[2]))"
>   echo "pgpgout: $((final_values[3] - initial_values[3]))"
>   echo "swpout_zero: $((final_values[4] - initial_values[4]))"
>   echo "swpin_zero: $((final_values[5] - initial_values[5]))"
>   echo "swpin_zero_read: $((final_values[6] - initial_values[6]))"
> done
>
>
> The results I am seeing are as follows:
>
> real    6m43.998s
> user    47m3.800s
> sys     5m7.169s
> pswpin: 342041
> pswpout: 1470846
> pgpgin: 11744932
> pgpgout: 14466564
> swpout_zero: 318030
> swpin_zero: 93621
> swpin_zero_read: 13118
>
> The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846.
> The percentage is 17.8% = 318,030 / (318,030 + 1,470,846).
>
> About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those
> zero_swpin pages are read (13,118 / 93,621).
>
> Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages
> will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this
> kernel-build workload. Thus, the total build time of my final results falls

Apologies for the mistake in my math. I shouldn't have used swpout as the
denominator and swpin as the numerator. Instead, both the numerator and
denominator should be based on swpin.

Potentially, 13,118 swpin_zero_read / (342,041 pswpin + 13,118 swpin_zero_read)
could be saved for swap-in, meaning 3.7% of all swap-ins can be saved using
zero_pfn.

> within the testing jitter range, showing no noticeable difference while
> the conceptual model code with lots of zero-filled pages and read swap-in
> shows significant differences.

Although 3.7% of swap-ins can be saved, my X86 PC is too weak to
demonstrate the differences, as numerous factors—such as temperature
and unstable I/O latency—can impact the final build time.

Hopefully, others can share test results conducted on stable hardware and
more workloads.

>
> I'm not sure if we can identify another real workload with more read swpin
> to observe noticeable improvements. Perhaps Usama has some?
>

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 2bacebbf4cf6..b37f0f61d0bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4294,6 +4294,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
+	bool map_zero_pfn = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4364,6 +4365,39 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
+		/* Use the zero-page for reads */
+		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
+		    !mm_forbids_zeropage(vma->vm_mm) &&
+		    __swap_count(entry) == 1)  {
+			swap_zeromap_batch(entry, 1, &map_zero_pfn);
+			if (map_zero_pfn) {
+				if (swapcache_prepare(entry, 1)) {
+					add_wait_queue(&swapcache_wq, &wait);
+					schedule_timeout_uninterruptible(1);
+					remove_wait_queue(&swapcache_wq, &wait);
+					goto out;
+				}
+				nr_pages = 1;
+				need_clear_cache = true;
+				pte = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
+						vma->vm_page_prot));
+				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+						&vmf->ptl);
+				if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte),
+						vmf->orig_pte)))
+					goto unlock;
+
+				page = pfn_to_page(my_zero_pfn(vmf->address));
+				arch_swap_restore(entry, page_folio(page));
+				swap_free_nr(entry, 1);
+				add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
+				set_ptes(vma->vm_mm, vmf->address, vmf->pte, pte, 1);
+				arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address, pte, pte, 1);
+				update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+				goto unlock;
+			}
+		}
+
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */