diff mbox series

[v7,2/2] mm: support large folios swap-in for sync io devices

Message ID 20240821074541.516249-3-hanchuanhua@oppo.com (mailing list archive)
State New
Headers show
Series mm: Ignite large folios swap-in support | expand

Commit Message

Chuanhua Han Aug. 21, 2024, 7:45 a.m. UTC
From: Chuanhua Han <hanchuanhua@oppo.com>

Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process. The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.

This patch introduces mTHP swap-in support. It starts from sync devices
such as zRAM. This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost. In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to enable
   zsmalloc compression/decompression with larger granularity[2]. The upcoming
   optimization in zsmalloc will significantly increase swap speed and improve
   compression efficiency. Tested by running 100 iterations of swapping 100MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapout(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 223 insertions(+), 27 deletions(-)

Comments

Shakeel Butt Aug. 21, 2024, 5:31 p.m. UTC | #1
On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> 
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
>    zsmalloc compression/decompression with larger granularity[2]. The upcoming
>    optimization in zsmalloc will significantly increase swap speed and improve
>    compression efficiency. Tested by running 100 iterations of swapping 100MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swapout(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533

Are the above number with the patch series at [2] or without? Also can
you explain your experiment setup or how can someone reproduce these?

> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
Barry Song Aug. 21, 2024, 9:13 p.m. UTC | #2
On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> >
> > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> >    zsmalloc compression/decompression with larger granularity[2]. The upcoming
> >    optimization in zsmalloc will significantly increase swap speed and improve
> >    compression efficiency. Tested by running 100 iterations of swapping 100MiB
> >    of anon memory, the swap speed improved dramatically:
> >                 time consumption of swapin(ms)   time consumption of swapout(ms)
> >      lz4 4k                  45274                    90540
> >      lz4 64k                 22942                    55667
> >      zstdn 4k                85035                    186585
> >      zstdn 64k               46558                    118533
>
> Are the above number with the patch series at [2] or without? Also can
> you explain your experiment setup or how can someone reproduce these?

Hi Shakeel,

The data was recorded after applying both this patch (swap-in mTHP) and
patch [2] (compressing/decompressing mTHP instead of page). However,
without the swap-in series, patch [2] becomes useless because:

If we have a large object, such as 16 pages in zsmalloc:
do_swap_page will happen 16 times:
1. decompress the whole large object and copy one page;
2. decompress the whole large object and copy one page;
3. decompress the whole large object and copy one page;
....
16.  decompress the whole large object and copy one page;

So, patchset [2] will actually degrade performance rather than
enhance it if we don't have this swap-in series. This swap-in
series is a prerequisite for the zsmalloc/zram series.

We reproduced the data through the following simple steps:
1. Collected anonymous pages from a running phone and saved them to a file.
2. Used a small program to open and read the file into a mapped anonymous
memory.
3.  Do the belows in the small program:
swapout_start_time
madv_pageout()
swapout_end_time

swapin_start_time
read_data()
swapin_end_time

We calculate the throughput of swapout and swapin using the difference between
end_time and start_time. Additionally, we record the memory usage of zram after
the swapout is complete.

>
> > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>

Thanks
Barry
Shakeel Butt Aug. 23, 2024, 5:56 p.m. UTC | #3
Hi Barry,

On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote:
> On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote:
> > > From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > >
> > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> > >    zsmalloc compression/decompression with larger granularity[2]. The upcoming
> > >    optimization in zsmalloc will significantly increase swap speed and improve
> > >    compression efficiency. Tested by running 100 iterations of swapping 100MiB
> > >    of anon memory, the swap speed improved dramatically:
> > >                 time consumption of swapin(ms)   time consumption of swapout(ms)
> > >      lz4 4k                  45274                    90540
> > >      lz4 64k                 22942                    55667
> > >      zstdn 4k                85035                    186585
> > >      zstdn 64k               46558                    118533
> >
> > Are the above number with the patch series at [2] or without? Also can
> > you explain your experiment setup or how can someone reproduce these?
> 
> Hi Shakeel,
> 
> The data was recorded after applying both this patch (swap-in mTHP) and
> patch [2] (compressing/decompressing mTHP instead of page). However,
> without the swap-in series, patch [2] becomes useless because:
> 
> If we have a large object, such as 16 pages in zsmalloc:
> do_swap_page will happen 16 times:
> 1. decompress the whole large object and copy one page;
> 2. decompress the whole large object and copy one page;
> 3. decompress the whole large object and copy one page;
> ....
> 16.  decompress the whole large object and copy one page;
> 
> So, patchset [2] will actually degrade performance rather than
> enhance it if we don't have this swap-in series. This swap-in
> series is a prerequisite for the zsmalloc/zram series.

Thanks for the explanation.

> 
> We reproduced the data through the following simple steps:
> 1. Collected anonymous pages from a running phone and saved them to a file.
> 2. Used a small program to open and read the file into a mapped anonymous
> memory.
> 3.  Do the belows in the small program:
> swapout_start_time
> madv_pageout()
> swapout_end_time
> 
> swapin_start_time
> read_data()
> swapin_end_time
> 
> We calculate the throughput of swapout and swapin using the difference between
> end_time and start_time. Additionally, we record the memory usage of zram after
> the swapout is complete.
> 

Please correct me if I am wrong but you are saying in your experiment,
100 MiB took 90540 ms to compress/swapout and 45274 ms to
decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms
if backed by 64k pages. Basically the table shows total time to compress
or decomress 100 MiB of memory, right?

> >
> > > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
> >
> 
> Thanks
> Barry
Barry Song Aug. 26, 2024, 7:46 p.m. UTC | #4
On Sat, Aug 24, 2024 at 5:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi Barry,
>
> On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote:
> > On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote:
> > > > From: Chuanhua Han <hanchuanhua@oppo.com>
> > > >
> > > >
> > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> > > >    zsmalloc compression/decompression with larger granularity[2]. The upcoming
> > > >    optimization in zsmalloc will significantly increase swap speed and improve
> > > >    compression efficiency. Tested by running 100 iterations of swapping 100MiB
> > > >    of anon memory, the swap speed improved dramatically:
> > > >                 time consumption of swapin(ms)   time consumption of swapout(ms)
> > > >      lz4 4k                  45274                    90540
> > > >      lz4 64k                 22942                    55667
> > > >      zstdn 4k                85035                    186585
> > > >      zstdn 64k               46558                    118533
> > >
> > > Are the above number with the patch series at [2] or without? Also can
> > > you explain your experiment setup or how can someone reproduce these?
> >
> > Hi Shakeel,
> >
> > The data was recorded after applying both this patch (swap-in mTHP) and
> > patch [2] (compressing/decompressing mTHP instead of page). However,
> > without the swap-in series, patch [2] becomes useless because:
> >
> > If we have a large object, such as 16 pages in zsmalloc:
> > do_swap_page will happen 16 times:
> > 1. decompress the whole large object and copy one page;
> > 2. decompress the whole large object and copy one page;
> > 3. decompress the whole large object and copy one page;
> > ....
> > 16.  decompress the whole large object and copy one page;
> >
> > So, patchset [2] will actually degrade performance rather than
> > enhance it if we don't have this swap-in series. This swap-in
> > series is a prerequisite for the zsmalloc/zram series.
>
> Thanks for the explanation.
>
> >
> > We reproduced the data through the following simple steps:
> > 1. Collected anonymous pages from a running phone and saved them to a file.
> > 2. Used a small program to open and read the file into a mapped anonymous
> > memory.
> > 3.  Do the belows in the small program:
> > swapout_start_time
> > madv_pageout()
> > swapout_end_time
> >
> > swapin_start_time
> > read_data()
> > swapin_end_time
> >
> > We calculate the throughput of swapout and swapin using the difference between
> > end_time and start_time. Additionally, we record the memory usage of zram after
> > the swapout is complete.
> >
>
> Please correct me if I am wrong but you are saying in your experiment,
> 100 MiB took 90540 ms to compress/swapout and 45274 ms to
> decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms
> if backed by 64k pages. Basically the table shows total time to compress
> or decomress 100 MiB of memory, right?

Hi Shakeel,
Tangquan(CC'd) collected the data and double-checked the case to confirm
the answer to your question.

We have three cases:
1. no mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
2. have mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
3. have mTHP swap-in, have zsmalloc/zram multi-pages compression/decompression

The data was 1 vs 3.

To provide more precise data that covers each change, Tangquan tested
1 vs. 2 and
2 vs. 3 yesterday using LZ4 (the hardware might differ from the
previous test, but the
data shows the same trend) per my request.

1. no mTHP swapin, no zsmalloc/zram patch
swapin_ms.   30336
swapout_ms. 65651

2. have mTHP swapin, no zsmalloc/zram patch
swapin_ms.   27161
swapout_ms. 61135

3. have mTHP swapin, have zsmalloc/zram patch
swapin_ms.   13683
swapout_ms. 43305

The test pseudocode is as follows:

addr=mmap(100M)
read_anon_data_from_file_to addr();

for(i=0;i<100;i++) {
      swapout_start_time;
      madv_pageout();
      swapout_end_time;
      swapin_start_time;
      read_addr_to_swapin();
      swapin_end_time;
}

So, while we saw some improvement from 1 to 2, the significant gains
come from using large blocks for compression and decompression.

This mTHP swap-in series ensures that mTHPs aren't lost after the first swap-in,
so the following 99 iterations continue to involve THP swap-out and
mTHP swap-in.
The improvement from 1 to 2 is due to this mTHP swap-in series, while the
improvement from 2 to 3 comes from the zsmalloc/zram patchset [2] you
mentioned.

[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

> > >
> >

Thanks
Barry
Kanchana P Sridhar Aug. 29, 2024, 1:01 a.m. UTC | #5
Hi Shakeel,

We submitted an RFC patchset [1] with the Intel In-Memory Analytics
Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n'
compression algorithm in the IAA crypto driver.

Relative to software compressors, we could get a 10X improvement in zram
write latency and 7X improvement in zram read latency.

[1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/

Thanks,
Kanchana
Barry Song Aug. 29, 2024, 2:24 a.m. UTC | #6
On Thu, Aug 29, 2024 at 1:01 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Shakeel,
>
> We submitted an RFC patchset [1] with the Intel In-Memory Analytics
> Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n'
> compression algorithm in the IAA crypto driver.
>
> Relative to software compressors, we could get a 10X improvement in zram
> write latency and 7X improvement in zram read latency.
>
> [1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/

Hi Kanchana,
Thanks for sharing. I understand you’ll need this mTHP swap-in series
to leverage your
IAA for parallel decompression, right? Without mTHP swap-in, you won't
get this 7X
improvement, right?

This is another important use case for the mTHP swap-in series,
highlighting the strong
need to start the work from the sync IO device.

I’ll try to find some time to review your patch and explore how we can
better support both
software and hardware improvements in zsmalloc/zram with a more
compatible approach.
Also, I have a talk[1] at LPC2024—would you mind if I include a
description of your use
case?

[1] https://lpc.events/event/18/contributions/1780/

>
> Thanks,
> Kanchana

Thanks
Barry
Kanchana P Sridhar Aug. 29, 2024, 2:38 a.m. UTC | #7
Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 28, 2024 7:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: akpm@linux-foundation.org; baolin.wang@linux.alibaba.com;
> chrisl@kernel.org; david@redhat.com; hanchuanhua@oppo.com;
> hannes@cmpxchg.org; hch@infradead.org; hughd@google.com;
> kaleshsingh@google.com; kasong@tencent.com; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; mhocko@suse.com;
> minchan@kernel.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> ryncsn@gmail.com; senozhatsky@chromium.org; shakeel.butt@linux.dev;
> shy828301@gmail.com; surenb@google.com; v-songbaohua@oppo.com;
> willy@infradead.org; xiang@kernel.org; Huang, Ying
> <ying.huang@intel.com>; yosryahmed@google.com;
> zhengtangquan@oppo.com; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io
> devices
> 
> On Thu, Aug 29, 2024 at 1:01 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Shakeel,
> >
> > We submitted an RFC patchset [1] with the Intel In-Memory Analytics
> > Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n'
> > compression algorithm in the IAA crypto driver.
> >
> > Relative to software compressors, we could get a 10X improvement in zram
> > write latency and 7X improvement in zram read latency.
> >
> > [1]
> https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.co
> m/
> 
> Hi Kanchana,
> Thanks for sharing. I understand you’ll need this mTHP swap-in series
> to leverage your
> IAA for parallel decompression, right? Without mTHP swap-in, you won't
> get this 7X
> improvement, right?

Yes, that is correct.

> 
> This is another important use case for the mTHP swap-in series,
> highlighting the strong
> need to start the work from the sync IO device.

Sure, this makes sense!

> 
> I’ll try to find some time to review your patch and explore how we can
> better support both
> software and hardware improvements in zsmalloc/zram with a more
> compatible approach.
> Also, I have a talk[1] at LPC2024—would you mind if I include a
> description of your use
> case?

Sure, this sounds good.

Thanks,
Kanchana

> 
> [1] https://lpc.events/event/18/contributions/1780/
> 
> >
> > Thanks,
> > Kanchana
> 
> Thanks
> Barry
Kairui Song Sept. 3, 2024, 6:24 p.m. UTC | #8
Hi All,


On Wed, Aug 21, 2024 at 3:46 PM <hanchuanhua@oppo.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> Currently, we have mTHP features, but unfortunately, without support for
> large folio swap-ins, once these large folios are swapped out, they are
> lost because mTHP swap is a one-way process. The lack of mTHP swap-in
> functionality prevents mTHP from being used on devices like Android that
> heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices
> such as zRAM. This is probably the simplest and most common use case,
> benefiting billions of Android phones and similar devices with minimal
> implementation cost. In this straightforward scenario, large folios are
> always exclusive, eliminating the need to handle complex rmap and
> swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
>
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
>
>    ./thp_swap_allocator_test
>    Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
>    Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
>    Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
>    Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
>    Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
>    Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
>    Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
>    Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
>    Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
>    Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
>    Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
>    Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
>    Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
>    Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
>    Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
>    Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
>
>    w/ this patch (always 0%):
>    Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
>    zsmalloc compression/decompression with larger granularity[2]. The upcoming
>    optimization in zsmalloc will significantly increase swap speed and improve
>    compression efficiency. Tested by running 100 iterations of swapping 100MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swapout(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533
>
>     The compression ratio also improved, as evaluated with 1 GiB of data:
>      granularity   orig_data_size   compr_data_size
>      4KiB-zstd      1048576000       246876055
>      64KiB-zstd     1048576000       199763892
>
>    Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
>    realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
>    of nr_pages. Swapping in content filled with the same data 0x11, w/o
>    and w/ the patch for five rounds (Since the content is the same,
>    decompression will be very fast. This primarily assesses the impact of
>    reduced page faults):
>
>   swp in bandwidth(bytes/ms)    w/o              w/
>    round1                     624152          1127501
>    round2                     631672          1127501
>    round3                     620459          1139756
>    round4                     606113          1139756
>    round5                     624152          1152281
>    avg                        621310          1137359      +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 223 insertions(+), 27 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b9fe2f354878..7aa0358a4160 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       struct folio *folio;
> +       swp_entry_t entry;
> +
> +       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
> +                               vmf->address, false);
> +       if (!folio)
> +               return NULL;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +                                          GFP_KERNEL, entry)) {
> +               folio_put(folio);
> +               return NULL;
> +       }
> +
> +       return folio;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Check if the PTEs within a range are contiguous swap entries
> + * and have no cache when check_no_cache is true.
> + */
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                          int nr_pages, bool check_no_cache)
> +{
> +       struct swap_info_struct *si;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       pgoff_t offset;
> +       int idx, i;
> +       pte_t pte;
> +
> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +       idx = (vmf->address - addr) / PAGE_SIZE;
> +       pte = ptep_get(ptep);
> +
> +       if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +               return false;
> +       entry = pte_to_swp_entry(pte);
> +       offset = swp_offset(entry);
> +       if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> +               return false;
> +
> +       if (!check_no_cache)
> +               return true;
> +
> +       si = swp_swap_info(entry);
> +       /*
> +        * While allocating a large folio and doing swap_read_folio, which is
> +        * the case the being faulted pte doesn't have swapcache. We need to
> +        * ensure all PTEs have no cache as well, otherwise, we might go to
> +        * swap devices while the content is in swapcache.
> +        */
> +       for (i = 0; i < nr_pages; i++) {
> +               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> +                                                    unsigned long addr,
> +                                                    unsigned long orders)
> +{
> +       int order, nr;
> +
> +       order = highest_order(orders);
> +
> +       /*
> +        * To swap in a THP with nr pages, we require that its first swap_offset
> +        * is aligned with that number, as it was when the THP was swapped out.
> +        * This helps filter out most invalid entries.
> +        */
> +       while (orders) {
> +               nr = 1 << order;
> +               if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       return orders;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * A large swapped out folio could be partially or fully in zswap. We
> +        * lack handling for such cases, so fallback to swapping in order-0
> +        * folio.
> +        */
> +       if (!zswap_never_enabled())
> +               goto fallback;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       /*
> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +        * and suitable for swapping THP.
> +        */
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +       orders = thp_swap_suitable_orders(swp_offset(entry),
> +                                         vmf->address, orders);
> +
> +       if (!orders)
> +               goto fallback;
> +
> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +                                 vmf->address & PMD_MASK, &ptl);
> +       if (unlikely(!pte))
> +               goto fallback;
> +
> +       /*
> +        * For do_swap_page, find the highest order where the aligned range is
> +        * completely swap entries with contiguous swap offsets.
> +        */
> +       order = highest_order(orders);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true))
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +
> +       /* Try allocating the highest of the remaining orders. */
> +       gfp = vma_thp_gfp_mask(vma);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +                                                           gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
> +               order = next_order(&orders, order);
> +       }
> +
> +fallback:
> +       return __alloc_swap_folio(vmf);
> +}
> +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                                 int nr_pages, bool check_no_cache)
> +{
> +       return false;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       return __alloc_swap_folio(vmf);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> -                       /*
> -                        * Prevent parallel swapin from proceeding with
> -                        * the cache flag. Otherwise, another thread may
> -                        * finish swapin first, free the entry, and swapout
> -                        * reusing the same entry. It's undetectable as
> -                        * pte_same() returns true due to entry reuse.
> -                        */
> -                       if (swapcache_prepare(entry, 1)) {
> -                               /* Relax a bit to prevent rapid repeated page faults */
> -                               schedule_timeout_uninterruptible(1);
> -                               goto out;
> -                       }
> -                       need_clear_cache = true;
> -
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_swap_folio(vmf);
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> -                                                       vma->vm_mm, GFP_KERNEL,
> -                                                       entry)) {
> -                                       ret = VM_FAULT_OOM;
> +                               nr_pages = folio_nr_pages(folio);
> +                               if (folio_test_large(folio))
> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +                               /*
> +                                * Prevent parallel swapin from proceeding with
> +                                * the cache flag. Otherwise, another thread
> +                                * may finish swapin first, free the entry, and
> +                                * swapout reusing the same entry. It's
> +                                * undetectable as pte_same() returns true due
> +                                * to entry reuse.
> +                                */
> +                               if (swapcache_prepare(entry, nr_pages)) {
> +                                       /*
> +                                        * Relax a bit to prevent rapid
> +                                        * repeated page faults.
> +                                        */
> +                                       schedule_timeout_uninterruptible(1);
>                                         goto out_page;
>                                 }
> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
> +                               need_clear_cache = true;
> +
> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out_nomap;
>         }
>
> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long folio_start = ALIGN_DOWN(vmf->address,
> +                                                      nr * PAGE_SIZE);
> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> +               pte_t *folio_ptep = vmf->pte - idx;
> +
> +               if (!can_swapin_thp(vmf, folio_ptep, nr, false))
> +                       goto out_nomap;
> +
> +               page_idx = idx;
> +               address = folio_start;
> +               ptep = folio_ptep;
> +               goto check_folio;
> +       }
> +
>         nr_pages = 1;
>         page_idx = 0;
>         address = vmf->address;
> @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios, which are either
> -                * fully exclusive or fully shared. If we ever get large folios
> -                * here, we have to be careful.
> +                * We currently only expect small !anon folios which are either
> +                * fully exclusive or fully shared, or new allocated large
> +                * folios which are fully exclusive. If we ever get large
> +                * folios within swapcache here, we have to be careful.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio));
> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>                 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>         } else {
> @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_put(swapcache);
>         }
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.43.0
>

With latest mm-unstable, I'm seeing following WARN followed by user
space segfaults (multiple mTHP enabled):

[   39.145686] ------------[ cut here ]------------
[   39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535
swap_read_folio+0x4db/0x520
[   39.146307] Modules linked in:
[   39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not
tainted 6.11.0-rc6.orig+ #131
[   39.146887] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[   39.147206] RIP: 0010:swap_read_folio+0x4db/0x520
[   39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff
ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db
fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e
82 48
[   39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202
[   39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff
[   39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00
[   39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078
[   39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000
[   39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003
[   39.150177] FS:  00007f51a56c9540(0000) GS:ffff888fffc00000(0000)
knlGS:0000000000000000
[   39.150623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0
[   39.151317] Call Trace:
[   39.151565]  <TASK>
[   39.151778]  ? __warn+0x84/0x130
[   39.152044]  ? swap_read_folio+0x4db/0x520
[   39.152345]  ? report_bug+0xfc/0x1e0
[   39.152614]  ? handle_bug+0x3f/0x70
[   39.152891]  ? exc_invalid_op+0x17/0x70
[   39.153178]  ? asm_exc_invalid_op+0x1a/0x20
[   39.153467]  ? swap_read_folio+0x4db/0x520
[   39.153753]  do_swap_page+0xc6d/0x14f0
[   39.154054]  ? srso_return_thunk+0x5/0x5f
[   39.154361]  __handle_mm_fault+0x758/0x850
[   39.154645]  handle_mm_fault+0x134/0x340
[   39.154945]  do_user_addr_fault+0x2e5/0x760
[   39.155245]  exc_page_fault+0x6a/0x140
[   39.155546]  asm_exc_page_fault+0x26/0x30
[   39.155847] RIP: 0033:0x55627b071446
[   39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25
72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da
03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45
85 c9
[   39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246
[   39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b
[   39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff
[   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
[   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
[   39.158998]  </TASK>
[   39.159226] ---[ end trace 0000000000000000 ]---

After reverting this or Usama's "mm: store zero pages to be swapped
out in a bitmap", the problem is gone. I think these two patches may
have some conflict that needs to be resolved.
Yosry Ahmed Sept. 3, 2024, 6:38 p.m. UTC | #9
[..]
>
> With latest mm-unstable, I'm seeing following WARN followed by user
> space segfaults (multiple mTHP enabled):
>
> [   39.145686] ------------[ cut here ]------------
> [   39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535
> swap_read_folio+0x4db/0x520
> [   39.146307] Modules linked in:
> [   39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not
> tainted 6.11.0-rc6.orig+ #131
> [   39.146887] Hardware name: Tencent Cloud CVM, BIOS
> seabios-1.9.1-qemu-project.org 04/01/2014
> [   39.147206] RIP: 0010:swap_read_folio+0x4db/0x520
> [   39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff
> ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db
> fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e
> 82 48
> [   39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202
> [   39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff
> [   39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00
> [   39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078
> [   39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000
> [   39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003
> [   39.150177] FS:  00007f51a56c9540(0000) GS:ffff888fffc00000(0000)
> knlGS:0000000000000000
> [   39.150623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0
> [   39.151317] Call Trace:
> [   39.151565]  <TASK>
> [   39.151778]  ? __warn+0x84/0x130
> [   39.152044]  ? swap_read_folio+0x4db/0x520
> [   39.152345]  ? report_bug+0xfc/0x1e0
> [   39.152614]  ? handle_bug+0x3f/0x70
> [   39.152891]  ? exc_invalid_op+0x17/0x70
> [   39.153178]  ? asm_exc_invalid_op+0x1a/0x20
> [   39.153467]  ? swap_read_folio+0x4db/0x520
> [   39.153753]  do_swap_page+0xc6d/0x14f0
> [   39.154054]  ? srso_return_thunk+0x5/0x5f
> [   39.154361]  __handle_mm_fault+0x758/0x850
> [   39.154645]  handle_mm_fault+0x134/0x340
> [   39.154945]  do_user_addr_fault+0x2e5/0x760
> [   39.155245]  exc_page_fault+0x6a/0x140
> [   39.155546]  asm_exc_page_fault+0x26/0x30
> [   39.155847] RIP: 0033:0x55627b071446
> [   39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25
> 72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da
> 03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45
> 85 c9
> [   39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246
> [   39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b
> [   39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff
> [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> [   39.158998]  </TASK>
> [   39.159226] ---[ end trace 0000000000000000 ]---
>
> After reverting this or Usama's "mm: store zero pages to be swapped
> out in a bitmap", the problem is gone. I think these two patches may
> have some conflict that needs to be resolved.

Yup. I saw this conflict coming and specifically asked for this
warning to be added in Usama's patch to catch it [1]. It served its
purpose.

Usama's patch does not handle large folio swapin, because at the time
it was written we didn't have it. We expected Usama's series to land
sooner than this one, so the warning was to make sure that this series
handles large folio swapin in the zeromap code. Now that they are both
in mm-unstable, we are gonna have to figure this out.

I suspect Usama's patches are closer to land so it's better to handle
this in this series, but I will leave it up to Usama and
Chuanhua/Barry to figure this out :)

[1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/
Andrew Morton Sept. 3, 2024, 8:07 p.m. UTC | #10
On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:

> > [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> > [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> > [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> > [   39.158998]  </TASK>
> > [   39.159226] ---[ end trace 0000000000000000 ]---
> >
> > After reverting this or Usama's "mm: store zero pages to be swapped
> > out in a bitmap", the problem is gone. I think these two patches may
> > have some conflict that needs to be resolved.
> 
> Yup. I saw this conflict coming and specifically asked for this
> warning to be added in Usama's patch to catch it [1]. It served its
> purpose.
> 
> Usama's patch does not handle large folio swapin, because at the time
> it was written we didn't have it. We expected Usama's series to land
> sooner than this one, so the warning was to make sure that this series
> handles large folio swapin in the zeromap code. Now that they are both
> in mm-unstable, we are gonna have to figure this out.
> 
> I suspect Usama's patches are closer to land so it's better to handle
> this in this series, but I will leave it up to Usama and
> Chuanhua/Barry to figure this out :)
> 
> [1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/

Thanks.  To unbreak -next I'll drop the two-patch series "mm: Ignite
large folios swap-in support" for now.

btw, next time can we please call it "enable large folios swap-in
support"?  "ignite" doesn't make much sense here.
Barry Song Sept. 3, 2024, 9:36 p.m. UTC | #11
On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > > [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> > > [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> > > [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> > > [   39.158998]  </TASK>
> > > [   39.159226] ---[ end trace 0000000000000000 ]---
> > >
> > > After reverting this or Usama's "mm: store zero pages to be swapped
> > > out in a bitmap", the problem is gone. I think these two patches may
> > > have some conflict that needs to be resolved.
> >
> > Yup. I saw this conflict coming and specifically asked for this
> > warning to be added in Usama's patch to catch it [1]. It served its
> > purpose.
> >
> > Usama's patch does not handle large folio swapin, because at the time
> > it was written we didn't have it. We expected Usama's series to land
> > sooner than this one, so the warning was to make sure that this series
> > handles large folio swapin in the zeromap code. Now that they are both
> > in mm-unstable, we are gonna have to figure this out.
> >
> > I suspect Usama's patches are closer to land so it's better to handle
> > this in this series, but I will leave it up to Usama and
> > Chuanhua/Barry to figure this out :)

I believe handling this in swap-in might violate layer separation.
`swap_read_folio()` should be a reliable API to call, regardless of
whether `zeromap` is present. Therefore, the fix should likely be
within `zeromap` but not this `swap-in`. I’ll take a look at this with
Usama :-)

> >
> > [1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/
>
> Thanks.  To unbreak -next I'll drop the two-patch series "mm: Ignite
> large folios swap-in support" for now.
>
> btw, next time can we please call it "enable large folios swap-in
> support"?  "ignite" doesn't make much sense here.

sure.
>

Thanks
Barry
Yosry Ahmed Sept. 3, 2024, 10:05 p.m. UTC | #12
On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > > [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> > > > [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> > > > [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> > > > [   39.158998]  </TASK>
> > > > [   39.159226] ---[ end trace 0000000000000000 ]---
> > > >
> > > > After reverting this or Usama's "mm: store zero pages to be swapped
> > > > out in a bitmap", the problem is gone. I think these two patches may
> > > > have some conflict that needs to be resolved.
> > >
> > > Yup. I saw this conflict coming and specifically asked for this
> > > warning to be added in Usama's patch to catch it [1]. It served its
> > > purpose.
> > >
> > > Usama's patch does not handle large folio swapin, because at the time
> > > it was written we didn't have it. We expected Usama's series to land
> > > sooner than this one, so the warning was to make sure that this series
> > > handles large folio swapin in the zeromap code. Now that they are both
> > > in mm-unstable, we are gonna have to figure this out.
> > >
> > > I suspect Usama's patches are closer to land so it's better to handle
> > > this in this series, but I will leave it up to Usama and
> > > Chuanhua/Barry to figure this out :)
>
> I believe handling this in swap-in might violate layer separation.
> `swap_read_folio()` should be a reliable API to call, regardless of
> whether `zeromap` is present. Therefore, the fix should likely be
> within `zeromap` but not this `swap-in`. I’ll take a look at this with
> Usama :-)

I meant handling it within this series to avoid blocking Usama
patches, not within this code. Thanks for taking a look, I am sure you
and Usama will figure out the best way forward :)
Usama Arif Sept. 4, 2024, 9:30 p.m. UTC | #13
On 03/09/2024 23:05, Yosry Ahmed wrote:
> On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>
>>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>>>
>>>>> [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
>>>>> [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
>>>>> [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
>>>>> [   39.158998]  </TASK>
>>>>> [   39.159226] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>> After reverting this or Usama's "mm: store zero pages to be swapped
>>>>> out in a bitmap", the problem is gone. I think these two patches may
>>>>> have some conflict that needs to be resolved.
>>>>
>>>> Yup. I saw this conflict coming and specifically asked for this
>>>> warning to be added in Usama's patch to catch it [1]. It served its
>>>> purpose.
>>>>
>>>> Usama's patch does not handle large folio swapin, because at the time
>>>> it was written we didn't have it. We expected Usama's series to land
>>>> sooner than this one, so the warning was to make sure that this series
>>>> handles large folio swapin in the zeromap code. Now that they are both
>>>> in mm-unstable, we are gonna have to figure this out.
>>>>
>>>> I suspect Usama's patches are closer to land so it's better to handle
>>>> this in this series, but I will leave it up to Usama and
>>>> Chuanhua/Barry to figure this out :)
>>
>> I believe handling this in swap-in might violate layer separation.
>> `swap_read_folio()` should be a reliable API to call, regardless of
>> whether `zeromap` is present. Therefore, the fix should likely be
>> within `zeromap` but not this `swap-in`. I’ll take a look at this with
>> Usama :-)
> 
> I meant handling it within this series to avoid blocking Usama
> patches, not within this code. Thanks for taking a look, I am sure you
> and Usama will figure out the best way forward :)

Hi Barry and Yosry,

Is the best (and quickest) way forward to have a v8 of this with
https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/
as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio
in this support large folios swap-in patch?

Thanks,
Usama
Barry Song Sept. 4, 2024, 11:10 p.m. UTC | #14
On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 03/09/2024 23:05, Yosry Ahmed wrote:
> > On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>
> >>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>
> >>>>> [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> >>>>> [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> >>>>> [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> >>>>> [   39.158998]  </TASK>
> >>>>> [   39.159226] ---[ end trace 0000000000000000 ]---
> >>>>>
> >>>>> After reverting this or Usama's "mm: store zero pages to be swapped
> >>>>> out in a bitmap", the problem is gone. I think these two patches may
> >>>>> have some conflict that needs to be resolved.
> >>>>
> >>>> Yup. I saw this conflict coming and specifically asked for this
> >>>> warning to be added in Usama's patch to catch it [1]. It served its
> >>>> purpose.
> >>>>
> >>>> Usama's patch does not handle large folio swapin, because at the time
> >>>> it was written we didn't have it. We expected Usama's series to land
> >>>> sooner than this one, so the warning was to make sure that this series
> >>>> handles large folio swapin in the zeromap code. Now that they are both
> >>>> in mm-unstable, we are gonna have to figure this out.
> >>>>
> >>>> I suspect Usama's patches are closer to land so it's better to handle
> >>>> this in this series, but I will leave it up to Usama and
> >>>> Chuanhua/Barry to figure this out :)
> >>
> >> I believe handling this in swap-in might violate layer separation.
> >> `swap_read_folio()` should be a reliable API to call, regardless of
> >> whether `zeromap` is present. Therefore, the fix should likely be
> >> within `zeromap` but not this `swap-in`. I’ll take a look at this with
> >> Usama :-)
> >
> > I meant handling it within this series to avoid blocking Usama
> > patches, not within this code. Thanks for taking a look, I am sure you
> > and Usama will figure out the best way forward :)
>
> Hi Barry and Yosry,
>
> Is the best (and quickest) way forward to have a v8 of this with
> https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/
> as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio
> in this support large folios swap-in patch?

Yes, Usama. i can actually do a check:

zeromap_cnt = swap_zeromap_entries_count(entry, nr);

/* swap_read_folio() can handle inconsistent zeromap in multiple entries */
if (zeromap_cnt > 0 && zeromap_cnt < nr)
       try next order;

On the other hand, if you read the code of zRAM, you will find zRAM has
exactly the same mechanism as zeromap but zRAM can even do more
by same_pages filled. since zRAM does the job in swapfile layer, there
is no this kind of consistency issue like zeromap.

So I feel for zRAM case, we don't need zeromap at all as there are duplicated
efforts while I really appreciate your job which can benefit all swapfiles.
i mean, zRAM has the ability to check "zero"(and also non-zero but same
content). after zeromap checks zeromap, zRAM will check again:

static int zram_write_page(struct zram *zram, struct page *page, u32 index)
{
       ...

        if (page_same_filled(mem, &element)) {
                kunmap_local(mem);
                /* Free memory associated with this sector now. */
                flags = ZRAM_SAME;
                atomic64_inc(&zram->stats.same_pages);
                goto out;
        }
        ...
}

So it seems that zeromap might slightly impact my zRAM use case. I'm not
blaming you, just pointing out that there might be some overlap in effort
here :-)

>
> Thanks,
> Usama

Thanks
Barry
Usama Arif Sept. 4, 2024, 11:23 p.m. UTC | #15
On 05/09/2024 00:10, Barry Song wrote:
> On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 03/09/2024 23:05, Yosry Ahmed wrote:
>>> On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>>
>>>>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>>>> [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
>>>>>>> [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
>>>>>>> [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
>>>>>>> [   39.158998]  </TASK>
>>>>>>> [   39.159226] ---[ end trace 0000000000000000 ]---
>>>>>>>
>>>>>>> After reverting this or Usama's "mm: store zero pages to be swapped
>>>>>>> out in a bitmap", the problem is gone. I think these two patches may
>>>>>>> have some conflict that needs to be resolved.
>>>>>>
>>>>>> Yup. I saw this conflict coming and specifically asked for this
>>>>>> warning to be added in Usama's patch to catch it [1]. It served its
>>>>>> purpose.
>>>>>>
>>>>>> Usama's patch does not handle large folio swapin, because at the time
>>>>>> it was written we didn't have it. We expected Usama's series to land
>>>>>> sooner than this one, so the warning was to make sure that this series
>>>>>> handles large folio swapin in the zeromap code. Now that they are both
>>>>>> in mm-unstable, we are gonna have to figure this out.
>>>>>>
>>>>>> I suspect Usama's patches are closer to land so it's better to handle
>>>>>> this in this series, but I will leave it up to Usama and
>>>>>> Chuanhua/Barry to figure this out :)
>>>>
>>>> I believe handling this in swap-in might violate layer separation.
>>>> `swap_read_folio()` should be a reliable API to call, regardless of
>>>> whether `zeromap` is present. Therefore, the fix should likely be
>>>> within `zeromap` but not this `swap-in`. I’ll take a look at this with
>>>> Usama :-)
>>>
>>> I meant handling it within this series to avoid blocking Usama
>>> patches, not within this code. Thanks for taking a look, I am sure you
>>> and Usama will figure out the best way forward :)
>>
>> Hi Barry and Yosry,
>>
>> Is the best (and quickest) way forward to have a v8 of this with
>> https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/
>> as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio
>> in this support large folios swap-in patch?
> 
> Yes, Usama. i can actually do a check:
> 
> zeromap_cnt = swap_zeromap_entries_count(entry, nr);
> 
> /* swap_read_folio() can handle inconsistent zeromap in multiple entries */
> if (zeromap_cnt > 0 && zeromap_cnt < nr)
>        try next order;
> 
> On the other hand, if you read the code of zRAM, you will find zRAM has
> exactly the same mechanism as zeromap but zRAM can even do more
> by same_pages filled. since zRAM does the job in swapfile layer, there
> is no this kind of consistency issue like zeromap.
> 
> So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> efforts while I really appreciate your job which can benefit all swapfiles.
> i mean, zRAM has the ability to check "zero"(and also non-zero but same
> content). after zeromap checks zeromap, zRAM will check again:
> 

Yes, so there is a reason for having the zeromap patches, which I have outlined
in the coverletter.

https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/

There are usecases where zswap/zram might not be used in production.
We can reduce I/O and flash wear in those cases by a large amount.

Also running in Meta production, we found that the number of non-zero filled
complete pages were less than 1%, so essentially its only the zero-filled pages
that matter.

I believe after zeromap, it might be a good idea to remove the page_same_filled 
check from zram code? Its not really a problem if its kept as well as I dont
believe any zero-filled pages should reach zram_write_page?

> static int zram_write_page(struct zram *zram, struct page *page, u32 index)
> {
>        ...
> 
>         if (page_same_filled(mem, &element)) {
>                 kunmap_local(mem);
>                 /* Free memory associated with this sector now. */
>                 flags = ZRAM_SAME;
>                 atomic64_inc(&zram->stats.same_pages);
>                 goto out;
>         }
>         ...
> }
> 
> So it seems that zeromap might slightly impact my zRAM use case. I'm not
> blaming you, just pointing out that there might be some overlap in effort
> here :-)
> 
>>
>> Thanks,
>> Usama
> 
> Thanks
> Barry
Barry Song Sept. 4, 2024, 11:27 p.m. UTC | #16
On Thu, Sep 5, 2024 at 11:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 05/09/2024 00:10, Barry Song wrote:
> > On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 03/09/2024 23:05, Yosry Ahmed wrote:
> >>> On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>>>
> >>>>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>
> >>>>>>> [   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
> >>>>>>> [   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> >>>>>>> [   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
> >>>>>>> [   39.158998]  </TASK>
> >>>>>>> [   39.159226] ---[ end trace 0000000000000000 ]---
> >>>>>>>
> >>>>>>> After reverting this or Usama's "mm: store zero pages to be swapped
> >>>>>>> out in a bitmap", the problem is gone. I think these two patches may
> >>>>>>> have some conflict that needs to be resolved.
> >>>>>>
> >>>>>> Yup. I saw this conflict coming and specifically asked for this
> >>>>>> warning to be added in Usama's patch to catch it [1]. It served its
> >>>>>> purpose.
> >>>>>>
> >>>>>> Usama's patch does not handle large folio swapin, because at the time
> >>>>>> it was written we didn't have it. We expected Usama's series to land
> >>>>>> sooner than this one, so the warning was to make sure that this series
> >>>>>> handles large folio swapin in the zeromap code. Now that they are both
> >>>>>> in mm-unstable, we are gonna have to figure this out.
> >>>>>>
> >>>>>> I suspect Usama's patches are closer to land so it's better to handle
> >>>>>> this in this series, but I will leave it up to Usama and
> >>>>>> Chuanhua/Barry to figure this out :)
> >>>>
> >>>> I believe handling this in swap-in might violate layer separation.
> >>>> `swap_read_folio()` should be a reliable API to call, regardless of
> >>>> whether `zeromap` is present. Therefore, the fix should likely be
> >>>> within `zeromap` but not this `swap-in`. I’ll take a look at this with
> >>>> Usama :-)
> >>>
> >>> I meant handling it within this series to avoid blocking Usama
> >>> patches, not within this code. Thanks for taking a look, I am sure you
> >>> and Usama will figure out the best way forward :)
> >>
> >> Hi Barry and Yosry,
> >>
> >> Is the best (and quickest) way forward to have a v8 of this with
> >> https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/
> >> as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio
> >> in this support large folios swap-in patch?
> >
> > Yes, Usama. i can actually do a check:
> >
> > zeromap_cnt = swap_zeromap_entries_count(entry, nr);
> >
> > /* swap_read_folio() can handle inconsistent zeromap in multiple entries */
> > if (zeromap_cnt > 0 && zeromap_cnt < nr)
> >        try next order;
> >
> > On the other hand, if you read the code of zRAM, you will find zRAM has
> > exactly the same mechanism as zeromap but zRAM can even do more
> > by same_pages filled. since zRAM does the job in swapfile layer, there
> > is no this kind of consistency issue like zeromap.
> >
> > So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> > efforts while I really appreciate your job which can benefit all swapfiles.
> > i mean, zRAM has the ability to check "zero"(and also non-zero but same
> > content). after zeromap checks zeromap, zRAM will check again:
> >
>
> Yes, so there is a reason for having the zeromap patches, which I have outlined
> in the coverletter.
>
> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
>
> There are usecases where zswap/zram might not be used in production.
> We can reduce I/O and flash wear in those cases by a large amount.
>
> Also running in Meta production, we found that the number of non-zero filled
> complete pages were less than 1%, so essentially its only the zero-filled pages
> that matter.

I don't have data on Android phones, i'd like to see if phones have exactly
the same ratio that non-zero same page is rare.

>
> I believe after zeromap, it might be a good idea to remove the page_same_filled
> check from zram code? Its not really a problem if its kept as well as I dont
> believe any zero-filled pages should reach zram_write_page?
>
> > static int zram_write_page(struct zram *zram, struct page *page, u32 index)
> > {
> >        ...
> >
> >         if (page_same_filled(mem, &element)) {
> >                 kunmap_local(mem);
> >                 /* Free memory associated with this sector now. */
> >                 flags = ZRAM_SAME;
> >                 atomic64_inc(&zram->stats.same_pages);
> >                 goto out;
> >         }
> >         ...
> > }
> >
> > So it seems that zeromap might slightly impact my zRAM use case. I'm not
> > blaming you, just pointing out that there might be some overlap in effort
> > here :-)
> >
> >>
> >> Thanks,
> >> Usama
> >

Thanks
Barry
Yosry Ahmed Sept. 4, 2024, 11:35 p.m. UTC | #17
[..]
> >
> > On the other hand, if you read the code of zRAM, you will find zRAM has
> > exactly the same mechanism as zeromap but zRAM can even do more
> > by same_pages filled. since zRAM does the job in swapfile layer, there
> > is no this kind of consistency issue like zeromap.
> >
> > So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> > efforts while I really appreciate your job which can benefit all swapfiles.
> > i mean, zRAM has the ability to check "zero"(and also non-zero but same
> > content). after zeromap checks zeromap, zRAM will check again:
> >
>
> Yes, so there is a reason for having the zeromap patches, which I have outlined
> in the coverletter.
>
> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
>
> There are usecases where zswap/zram might not be used in production.
> We can reduce I/O and flash wear in those cases by a large amount.
>
> Also running in Meta production, we found that the number of non-zero filled
> complete pages were less than 1%, so essentially its only the zero-filled pages
> that matter.
>
> I believe after zeromap, it might be a good idea to remove the page_same_filled
> check from zram code? Its not really a problem if its kept as well as I dont
> believe any zero-filled pages should reach zram_write_page?

I brought this up before and Sergey pointed out that zram is sometimes
used as a block device without swap, and that use case would benefit
from having this handling in zram. That being said, I have no idea how
many people care about this specific scenario.
Barry Song Sept. 22, 2024, 11:57 p.m. UTC | #18
On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > >
> > > On the other hand, if you read the code of zRAM, you will find zRAM has
> > > exactly the same mechanism as zeromap but zRAM can even do more
> > > by same_pages filled. since zRAM does the job in swapfile layer, there
> > > is no this kind of consistency issue like zeromap.
> > >
> > > So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> > > efforts while I really appreciate your job which can benefit all swapfiles.
> > > i mean, zRAM has the ability to check "zero"(and also non-zero but same
> > > content). after zeromap checks zeromap, zRAM will check again:
> > >
> >
> > Yes, so there is a reason for having the zeromap patches, which I have outlined
> > in the coverletter.
> >
> > https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
> >
> > There are usecases where zswap/zram might not be used in production.
> > We can reduce I/O and flash wear in those cases by a large amount.
> >
> > Also running in Meta production, we found that the number of non-zero filled
> > complete pages were less than 1%, so essentially its only the zero-filled pages
> > that matter.
> >
> > I believe after zeromap, it might be a good idea to remove the page_same_filled
> > check from zram code? Its not really a problem if its kept as well as I dont
> > believe any zero-filled pages should reach zram_write_page?
>
> I brought this up before and Sergey pointed out that zram is sometimes
> used as a block device without swap, and that use case would benefit
> from having this handling in zram. That being said, I have no idea how
> many people care about this specific scenario.

Hi Usama/Yosry,

We successfully gathered page_same_filled data for zram on Android.
Interestingly,
our findings differ from yours on zswap.

Hailong discovered that around 85-86% of the page_same_filled data
consists of zeros,
while about 15% are non-zero. We suspect that on Android or similar
systems, some
graphics or media data might be duplicated at times, such as a red
block displayed
on the screen.

Does this suggest that page_same_filled could still provide some
benefits in zram
cases?

Thanks
Barry
Usama Arif Sept. 23, 2024, 10:22 a.m. UTC | #19
On 23/09/2024 00:57, Barry Song wrote:
> On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> [..]
>>>>
>>>> On the other hand, if you read the code of zRAM, you will find zRAM has
>>>> exactly the same mechanism as zeromap but zRAM can even do more
>>>> by same_pages filled. since zRAM does the job in swapfile layer, there
>>>> is no this kind of consistency issue like zeromap.
>>>>
>>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated
>>>> efforts while I really appreciate your job which can benefit all swapfiles.
>>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same
>>>> content). after zeromap checks zeromap, zRAM will check again:
>>>>
>>>
>>> Yes, so there is a reason for having the zeromap patches, which I have outlined
>>> in the coverletter.
>>>
>>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
>>>
>>> There are usecases where zswap/zram might not be used in production.
>>> We can reduce I/O and flash wear in those cases by a large amount.
>>>
>>> Also running in Meta production, we found that the number of non-zero filled
>>> complete pages were less than 1%, so essentially its only the zero-filled pages
>>> that matter.
>>>
>>> I believe after zeromap, it might be a good idea to remove the page_same_filled
>>> check from zram code? Its not really a problem if its kept as well as I dont
>>> believe any zero-filled pages should reach zram_write_page?
>>
>> I brought this up before and Sergey pointed out that zram is sometimes
>> used as a block device without swap, and that use case would benefit
>> from having this handling in zram. That being said, I have no idea how
>> many people care about this specific scenario.
> 
> Hi Usama/Yosry,
> 
> We successfully gathered page_same_filled data for zram on Android.
> Interestingly,
> our findings differ from yours on zswap.
> 
> Hailong discovered that around 85-86% of the page_same_filled data
> consists of zeros,
> while about 15% are non-zero. We suspect that on Android or similar
> systems, some
> graphics or media data might be duplicated at times, such as a red
> block displayed
> on the screen.
> 
> Does this suggest that page_same_filled could still provide some
> benefits in zram
> cases?

Hi Barry,

Thanks for the data, its very interesting to know this from mobile side.
Eventhough its not 99% that I observed, I do feel 85% is still quite high. 

The 2 main reasons for the patcheset to store zero pages to be
swapped out in a bitmap were for applications that use swap but
not zswap/zram (reducing I/O and flash wear), and simplifying zswap
code. It also meant fewer zswap_entry structs in memory which would
offset the memory usage by bitmap.

Yosry mentioned that Sergey pointed out that zram is sometimes
used as a block device without swap, and that use case would benefit
from having this handling in zram. Will that case also not go through
swap_writepage and therefore be takencare of by swap_info_struct->zeromap?

Also just curious if there are cases in mobile where only swap is used,
but not zswap/zram?

I think even with 85% zero-filled pages, we could get rid of page_same_filled
especially if the zram without swap case is handled by swap_info_struct->zeromap.
But don't have strong preference.

Thanks,
Usama

> 
> Thanks
> Barry
Johannes Weiner Sept. 23, 2024, 12:10 p.m. UTC | #20
On Mon, Sep 23, 2024 at 11:22:30AM +0100, Usama Arif wrote:
> On 23/09/2024 00:57, Barry Song wrote:
> > On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>> On the other hand, if you read the code of zRAM, you will find zRAM has
> >>>> exactly the same mechanism as zeromap but zRAM can even do more
> >>>> by same_pages filled. since zRAM does the job in swapfile layer, there
> >>>> is no this kind of consistency issue like zeromap.
> >>>>
> >>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> >>>> efforts while I really appreciate your job which can benefit all swapfiles.
> >>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same
> >>>> content). after zeromap checks zeromap, zRAM will check again:
> >>>>
> >>>
> >>> Yes, so there is a reason for having the zeromap patches, which I have outlined
> >>> in the coverletter.
> >>>
> >>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
> >>>
> >>> There are usecases where zswap/zram might not be used in production.
> >>> We can reduce I/O and flash wear in those cases by a large amount.
> >>>
> >>> Also running in Meta production, we found that the number of non-zero filled
> >>> complete pages were less than 1%, so essentially its only the zero-filled pages
> >>> that matter.
> >>>
> >>> I believe after zeromap, it might be a good idea to remove the page_same_filled
> >>> check from zram code? Its not really a problem if its kept as well as I dont
> >>> believe any zero-filled pages should reach zram_write_page?
> >>
> >> I brought this up before and Sergey pointed out that zram is sometimes
> >> used as a block device without swap, and that use case would benefit
> >> from having this handling in zram. That being said, I have no idea how
> >> many people care about this specific scenario.
> > 
> > Hi Usama/Yosry,
> > 
> > We successfully gathered page_same_filled data for zram on Android.
> > Interestingly,
> > our findings differ from yours on zswap.
> > 
> > Hailong discovered that around 85-86% of the page_same_filled data
> > consists of zeros,
> > while about 15% are non-zero. We suspect that on Android or similar
> > systems, some
> > graphics or media data might be duplicated at times, such as a red
> > block displayed
> > on the screen.
> > 
> > Does this suggest that page_same_filled could still provide some
> > benefits in zram
> > cases?
> 
> Hi Barry,
> 
> Thanks for the data, its very interesting to know this from mobile side.
> Eventhough its not 99% that I observed, I do feel 85% is still quite high.

Would it be possible to benchmark Android with zram only optimizing
zero pages?

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index c3d245617083..f6ded491fd00 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -211,6 +211,9 @@ static bool page_same_filled(void *ptr, unsigned long *element)
 	page = (unsigned long *)ptr;
 	val = page[0];
 
+	if (val)
+		return false;
+
 	if (val != page[last_pos])
 		return false;
 
My take is that, if this is worth optimizing for, then it's probably
worth optimizing for in the generic swap layer too. It makes sense to
maintain feature parity if we one day want Android to work with zswap.

> The 2 main reasons for the patcheset to store zero pages to be
> swapped out in a bitmap were for applications that use swap but
> not zswap/zram (reducing I/O and flash wear), and simplifying zswap
> code. It also meant fewer zswap_entry structs in memory which would
> offset the memory usage by bitmap.
> 
> Yosry mentioned that Sergey pointed out that zram is sometimes
> used as a block device without swap, and that use case would benefit
> from having this handling in zram. Will that case also not go through
> swap_writepage and therefore be takencare of by swap_info_struct->zeromap?

No, it won't.

However, the write bios sent from swap have REQ_SWAP set. Zram could
make its same-filled optimization conditional on that flag if it wants
to maintain it for the raw block device use case.
Yosry Ahmed Sept. 23, 2024, 4:53 p.m. UTC | #21
On Mon, Sep 23, 2024 at 5:10 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Sep 23, 2024 at 11:22:30AM +0100, Usama Arif wrote:
> > On 23/09/2024 00:57, Barry Song wrote:
> > > On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >>>> On the other hand, if you read the code of zRAM, you will find zRAM has
> > >>>> exactly the same mechanism as zeromap but zRAM can even do more
> > >>>> by same_pages filled. since zRAM does the job in swapfile layer, there
> > >>>> is no this kind of consistency issue like zeromap.
> > >>>>
> > >>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated
> > >>>> efforts while I really appreciate your job which can benefit all swapfiles.
> > >>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same
> > >>>> content). after zeromap checks zeromap, zRAM will check again:
> > >>>>
> > >>>
> > >>> Yes, so there is a reason for having the zeromap patches, which I have outlined
> > >>> in the coverletter.
> > >>>
> > >>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
> > >>>
> > >>> There are usecases where zswap/zram might not be used in production.
> > >>> We can reduce I/O and flash wear in those cases by a large amount.
> > >>>
> > >>> Also running in Meta production, we found that the number of non-zero filled
> > >>> complete pages were less than 1%, so essentially its only the zero-filled pages
> > >>> that matter.
> > >>>
> > >>> I believe after zeromap, it might be a good idea to remove the page_same_filled
> > >>> check from zram code? Its not really a problem if its kept as well as I dont
> > >>> believe any zero-filled pages should reach zram_write_page?
> > >>
> > >> I brought this up before and Sergey pointed out that zram is sometimes
> > >> used as a block device without swap, and that use case would benefit
> > >> from having this handling in zram. That being said, I have no idea how
> > >> many people care about this specific scenario.
> > >
> > > Hi Usama/Yosry,
> > >
> > > We successfully gathered page_same_filled data for zram on Android.
> > > Interestingly,
> > > our findings differ from yours on zswap.
> > >
> > > Hailong discovered that around 85-86% of the page_same_filled data
> > > consists of zeros,
> > > while about 15% are non-zero. We suspect that on Android or similar
> > > systems, some
> > > graphics or media data might be duplicated at times, such as a red
> > > block displayed
> > > on the screen.
> > >
> > > Does this suggest that page_same_filled could still provide some
> > > benefits in zram
> > > cases?
> >
> > Hi Barry,
> >
> > Thanks for the data, its very interesting to know this from mobile side.
> > Eventhough its not 99% that I observed, I do feel 85% is still quite high.
>
> Would it be possible to benchmark Android with zram only optimizing
> zero pages?
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index c3d245617083..f6ded491fd00 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -211,6 +211,9 @@ static bool page_same_filled(void *ptr, unsigned long *element)
>         page = (unsigned long *)ptr;
>         val = page[0];
>
> +       if (val)
> +               return false;
> +
>         if (val != page[last_pos])
>                 return false;
>
> My take is that, if this is worth optimizing for, then it's probably
> worth optimizing for in the generic swap layer too. It makes sense to
> maintain feature parity if we one day want Android to work with zswap.

I am not sure if it's worth it for the generic swap layer. We would
need to store 8 bytes per swap entry to maintain feature parity,
that's about 0.2% of swap capacity as consistent memory overhead. Swap
capacity is usually higher than the actual size of swapped data.

IIUC the data you gathered from prod showed that 1% of same filled
pages were non-zero, and 10-20% of swapped data was same filled [1].
That means that ~0.15% of swapped data is non-zero same-filled.

With zswap, assuming a 3:1 compression ratio, we'd be paying 0.2% of
swap capacity to save around 0.05% of swapped data in memory. I think
it may be worse because that compression ratio may be higher for
same-filled data.

With SSD swap, I am not sure if 0.15% reduction in IO is worth the
memory overhead.

OTOH, zram keeps track of the same-filled value for free because it
overlays the zsmalloc handle (like zswap used to do). So the same
tradeoffs do not apply.

Barry mentioned that 15% of same-filled pages are non-zero in their
Android experiment, but what % of total swapped memory is this, and
how much space does it take if we just compress it instead?

IOW, how much memory is this really saving with zram (especially that
metadata is statically allocated)?
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index b9fe2f354878..7aa0358a4160 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3986,6 +3986,184 @@  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	swp_entry_t entry;
+
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
+				vmf->address, false);
+	if (!folio)
+		return NULL;
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+					   GFP_KERNEL, entry)) {
+		folio_put(folio);
+		return NULL;
+	}
+
+	return folio;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Check if the PTEs within a range are contiguous swap entries
+ * and have no cache when check_no_cache is true.
+ */
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
+			   int nr_pages, bool check_no_cache)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	int idx, i;
+	pte_t pte;
+
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx = (vmf->address - addr) / PAGE_SIZE;
+	pte = ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry = pte_to_swp_entry(pte);
+	offset = swp_offset(entry);
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+		return false;
+
+	if (!check_no_cache)
+		return true;
+
+	si = swp_swap_info(entry);
+	/*
+	 * While allocating a large folio and doing swap_read_folio, which is
+	 * the case the being faulted pte doesn't have swapcache. We need to
+	 * ensure all PTEs have no cache as well, otherwise, we might go to
+	 * swap devices while the content is in swapcache.
+	 */
+	for (i = 0; i < nr_pages; i++) {
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
+			return false;
+	}
+
+	return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+						     unsigned long addr,
+						     unsigned long orders)
+{
+	int order, nr;
+
+	order = highest_order(orders);
+
+	/*
+	 * To swap in a THP with nr pages, we require that its first swap_offset
+	 * is aligned with that number, as it was when the THP was swapped out.
+	 * This helps filter out most invalid entries.
+	 */
+	while (orders) {
+		nr = 1 << order;
+		if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * A large swapped out folio could be partially or fully in zswap. We
+	 * lack handling for such cases, so fallback to swapping in order-0
+	 * folio.
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * and suitable for swapping THP.
+	 */
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = thp_swap_suitable_orders(swp_offset(entry),
+					  vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				  vmf->address & PMD_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio) {
+			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+							    gfp, entry))
+				return folio;
+			folio_put(folio);
+		}
+		order = next_order(&orders, order);
+	}
+
+fallback:
+	return __alloc_swap_folio(vmf);
+}
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
+				  int nr_pages, bool check_no_cache)
+{
+	return false;
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	return __alloc_swap_folio(vmf);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4074,34 +4252,34 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry, 1)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
-				if (mem_cgroup_swapin_charge_folio(folio,
-							vma->vm_mm, GFP_KERNEL,
-							entry)) {
-					ret = VM_FAULT_OOM;
+				nr_pages = folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread
+				 * may finish swapin first, free the entry, and
+				 * swapout reusing the same entry. It's
+				 * undetectable as pte_same() returns true due
+				 * to entry reuse.
+				 */
+				if (swapcache_prepare(entry, nr_pages)) {
+					/*
+					 * Relax a bit to prevent rapid
+					 * repeated page faults.
+					 */
+					schedule_timeout_uninterruptible(1);
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry, 1);
+				need_clear_cache = true;
+
+				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4207,6 +4385,23 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long folio_start = ALIGN_DOWN(vmf->address,
+						       nr * PAGE_SIZE);
+		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr, false))
+			goto out_nomap;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4338,11 +4533,12 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large
+		 * folios which are fully exclusive. If we ever get large
+		 * folios within swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4385,7 +4581,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4401,7 +4597,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;