diff mbox series

[2/5] mm/zswap: change dstmem size to one page

Message ID 20231213-zswap-dstmem-v1-2-896763369d04@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zswap: dstmem reuse optimizations and cleanups | expand

Commit Message

Chengming Zhou Dec. 13, 2023, 4:17 a.m. UTC
Change the dstmem size from 2 * PAGE_SIZE to only one page since
we only need at most one page when compress, and the "dlen" is also
PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
we don't wanna store the output in zswap anyway.

So change it to one page, and delete the stale comment.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Yosry Ahmed Dec. 13, 2023, 11:34 p.m. UTC | #1
On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> we only need at most one page when compress, and the "dlen" is also
> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> we don't wanna store the output in zswap anyway.
>
> So change it to one page, and delete the stale comment.

I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
be nice if someone has the context, perhaps one of the maintainers.

One potential reason is that we used to store a zswap header
containing the swap entry in the compressed page for writeback
purposes, but we don't do that anymore. Maybe we wanted to be able to
handle the case where an incompressible page would exceed PAGE_SIZE
because of that?
Nhat Pham Dec. 14, 2023, 12:18 a.m. UTC | #2
On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Change the dstmem size from 2 * PAGE_SIZE to only one page since
> > we only need at most one page when compress, and the "dlen" is also
> > PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> > we don't wanna store the output in zswap anyway.
> >
> > So change it to one page, and delete the stale comment.
>
> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> be nice if someone has the context, perhaps one of the maintainers.

It'd be very nice indeed.

>
> One potential reason is that we used to store a zswap header
> containing the swap entry in the compressed page for writeback
> purposes, but we don't do that anymore. Maybe we wanted to be able to
> handle the case where an incompressible page would exceed PAGE_SIZE
> because of that?

It could be hmm. I didn't study the old zswap architecture too much,
but it has been 2 * PAGE_SIZE since the time zswap was first merged
last I checked.
I'm not 100% comfortable ACK-ing the undoing of something that looks
so intentional, but FTR, AFAICT, this looks correct to me.
Chengming Zhou Dec. 14, 2023, 1:33 p.m. UTC | #3
On 2023/12/14 08:18, Nhat Pham wrote:
> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>>
>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>> we only need at most one page when compress, and the "dlen" is also
>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>> we don't wanna store the output in zswap anyway.
>>>
>>> So change it to one page, and delete the stale comment.
>>
>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>> be nice if someone has the context, perhaps one of the maintainers.
> 
> It'd be very nice indeed.
> 
>>
>> One potential reason is that we used to store a zswap header
>> containing the swap entry in the compressed page for writeback
>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>> handle the case where an incompressible page would exceed PAGE_SIZE
>> because of that?
> 
> It could be hmm. I didn't study the old zswap architecture too much,
> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> last I checked.
> I'm not 100% comfortable ACK-ing the undoing of something that looks
> so intentional, but FTR, AFAICT, this looks correct to me.

Right, there is no any history about the reason why we needed 2 pages.
But obviously only one page is needed from the current code and no any
problem found in the kernel build stress testing.

Thanks!
Yosry Ahmed Dec. 14, 2023, 1:37 p.m. UTC | #4
On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 08:18, Nhat Pham wrote:
> > On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> >> <zhouchengming@bytedance.com> wrote:
> >>>
> >>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >>> we only need at most one page when compress, and the "dlen" is also
> >>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >>> we don't wanna store the output in zswap anyway.
> >>>
> >>> So change it to one page, and delete the stale comment.
> >>
> >> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> >> be nice if someone has the context, perhaps one of the maintainers.
> >
> > It'd be very nice indeed.
> >
> >>
> >> One potential reason is that we used to store a zswap header
> >> containing the swap entry in the compressed page for writeback
> >> purposes, but we don't do that anymore. Maybe we wanted to be able to
> >> handle the case where an incompressible page would exceed PAGE_SIZE
> >> because of that?
> >
> > It could be hmm. I didn't study the old zswap architecture too much,
> > but it has been 2 * PAGE_SIZE since the time zswap was first merged
> > last I checked.
> > I'm not 100% comfortable ACK-ing the undoing of something that looks
> > so intentional, but FTR, AFAICT, this looks correct to me.
>
> Right, there is no any history about the reason why we needed 2 pages.
> But obviously only one page is needed from the current code and no any
> problem found in the kernel build stress testing.

Could you try manually stressing the compression with data that
doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
that this case is specifically handled. I think using data from
/dev/random will do that but please double check that dlen ==
PAGE_SIZE.
Chengming Zhou Dec. 14, 2023, 1:57 p.m. UTC | #5
On 2023/12/14 21:37, Yosry Ahmed wrote:
> On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/14 08:18, Nhat Pham wrote:
>>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>
>>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>>> <zhouchengming@bytedance.com> wrote:
>>>>>
>>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>>>> we only need at most one page when compress, and the "dlen" is also
>>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>>>> we don't wanna store the output in zswap anyway.
>>>>>
>>>>> So change it to one page, and delete the stale comment.
>>>>
>>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>>>> be nice if someone has the context, perhaps one of the maintainers.
>>>
>>> It'd be very nice indeed.
>>>
>>>>
>>>> One potential reason is that we used to store a zswap header
>>>> containing the swap entry in the compressed page for writeback
>>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>>>> handle the case where an incompressible page would exceed PAGE_SIZE
>>>> because of that?
>>>
>>> It could be hmm. I didn't study the old zswap architecture too much,
>>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
>>> last I checked.
>>> I'm not 100% comfortable ACK-ing the undoing of something that looks
>>> so intentional, but FTR, AFAICT, this looks correct to me.
>>
>> Right, there is no any history about the reason why we needed 2 pages.
>> But obviously only one page is needed from the current code and no any
>> problem found in the kernel build stress testing.
> 
> Could you try manually stressing the compression with data that
> doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> that this case is specifically handled. I think using data from
> /dev/random will do that but please double check that dlen ==
> PAGE_SIZE.

I just did the same kernel build testing, indeed there are a few cases
that output dlen == PAGE_SIZE.

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'

@[1]: 2
@[0]: 12011430
Chengming Zhou Dec. 14, 2023, 3:03 p.m. UTC | #6
On 2023/12/14 21:57, Chengming Zhou wrote:
> On 2023/12/14 21:37, Yosry Ahmed wrote:
>> On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>>
>>> On 2023/12/14 08:18, Nhat Pham wrote:
>>>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>
>>>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>>>>> we only need at most one page when compress, and the "dlen" is also
>>>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>>>>> we don't wanna store the output in zswap anyway.
>>>>>>
>>>>>> So change it to one page, and delete the stale comment.
>>>>>
>>>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>>>>> be nice if someone has the context, perhaps one of the maintainers.
>>>>
>>>> It'd be very nice indeed.
>>>>
>>>>>
>>>>> One potential reason is that we used to store a zswap header
>>>>> containing the swap entry in the compressed page for writeback
>>>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>>>>> handle the case where an incompressible page would exceed PAGE_SIZE
>>>>> because of that?
>>>>
>>>> It could be hmm. I didn't study the old zswap architecture too much,
>>>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
>>>> last I checked.
>>>> I'm not 100% comfortable ACK-ing the undoing of something that looks
>>>> so intentional, but FTR, AFAICT, this looks correct to me.
>>>
>>> Right, there is no any history about the reason why we needed 2 pages.
>>> But obviously only one page is needed from the current code and no any
>>> problem found in the kernel build stress testing.
>>
>> Could you try manually stressing the compression with data that
>> doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
>> that this case is specifically handled. I think using data from
>> /dev/random will do that but please double check that dlen ==
>> PAGE_SIZE.
> 
> I just did the same kernel build testing, indeed there are a few cases
> that output dlen == PAGE_SIZE.
> 
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> 
> @[1]: 2
> @[0]: 12011430

I think we shouldn't put these poorly compressed output into zswap,
maybe it's better to early return in these cases when compress ratio
< threshold ratio, which can be tune by the user?

e.g. in the same kernel build testing:

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1>2048]=count()}'

@[1]: 1597706
@[0]: 10886138
Yosry Ahmed Dec. 14, 2023, 6:30 p.m. UTC | #7
On Thu, Dec 14, 2023 at 5:57 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 21:37, Yosry Ahmed wrote:
> > On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> On 2023/12/14 08:18, Nhat Pham wrote:
> >>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>
> >>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> >>>> <zhouchengming@bytedance.com> wrote:
> >>>>>
> >>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >>>>> we only need at most one page when compress, and the "dlen" is also
> >>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >>>>> we don't wanna store the output in zswap anyway.
> >>>>>
> >>>>> So change it to one page, and delete the stale comment.
> >>>>
> >>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> >>>> be nice if someone has the context, perhaps one of the maintainers.
> >>>
> >>> It'd be very nice indeed.
> >>>
> >>>>
> >>>> One potential reason is that we used to store a zswap header
> >>>> containing the swap entry in the compressed page for writeback
> >>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
> >>>> handle the case where an incompressible page would exceed PAGE_SIZE
> >>>> because of that?
> >>>
> >>> It could be hmm. I didn't study the old zswap architecture too much,
> >>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> >>> last I checked.
> >>> I'm not 100% comfortable ACK-ing the undoing of something that looks
> >>> so intentional, but FTR, AFAICT, this looks correct to me.
> >>
> >> Right, there is no any history about the reason why we needed 2 pages.
> >> But obviously only one page is needed from the current code and no any
> >> problem found in the kernel build stress testing.
> >
> > Could you try manually stressing the compression with data that
> > doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> > that this case is specifically handled. I think using data from
> > /dev/random will do that but please double check that dlen ==
> > PAGE_SIZE.
>
> I just did the same kernel build testing, indeed there are a few cases
> that output dlen == PAGE_SIZE.
>
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
>
> @[1]: 2
> @[0]: 12011430

That's very useful information, thanks for testing that. Please
include this in the commit log. Please also include the fact that we
used to store a zswap header with the compressed page but don't do
that anymore, which *may* be the reason why this was needed back then.

I still want someone who knows the history to Ack this, but FWIW it
looks correct to me, so low-key:
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Yosry Ahmed Dec. 14, 2023, 6:34 p.m. UTC | #8
[..]
>
> I think we shouldn't put these poorly compressed output into zswap,
> maybe it's better to early return in these cases when compress ratio
> < threshold ratio, which can be tune by the user?

We have something similar at Google, but because we use zswap without
a backing swapfile, we make those pages unevictable. For the upstream
code, the pages will go to a backing swapfile, which arguably violates
the LRU ordering, but may be the correct thing to do. There was a
recent upstream attempt to solidify storing those incompressible pages
in zswap in their uncompressed form to retain the LRU ordering.

If you want, feel free to start a discussion about this separately,
it's out of context for this patch series.

Thanks!
Nhat Pham Dec. 14, 2023, 8:29 p.m. UTC | #9
On Thu, Dec 14, 2023 at 10:30 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 5:57 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > On 2023/12/14 21:37, Yosry Ahmed wrote:
> > > On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> > > <zhouchengming@bytedance.com> wrote:
> > >>
> > >> On 2023/12/14 08:18, Nhat Pham wrote:
> > >>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >>>>
> > >>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > >>>> <zhouchengming@bytedance.com> wrote:
> > >>>>>
> > >>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> > >>>>> we only need at most one page when compress, and the "dlen" is also
> > >>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> > >>>>> we don't wanna store the output in zswap anyway.
> > >>>>>
> > >>>>> So change it to one page, and delete the stale comment.
> > >>>>
> > >>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> > >>>> be nice if someone has the context, perhaps one of the maintainers.
> > >>>
> > >>> It'd be very nice indeed.
> > >>>
> > >>>>
> > >>>> One potential reason is that we used to store a zswap header
> > >>>> containing the swap entry in the compressed page for writeback
> > >>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
> > >>>> handle the case where an incompressible page would exceed PAGE_SIZE
> > >>>> because of that?
> > >>>
> > >>> It could be hmm. I didn't study the old zswap architecture too much,
> > >>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> > >>> last I checked.
> > >>> I'm not 100% comfortable ACK-ing the undoing of something that looks
> > >>> so intentional, but FTR, AFAICT, this looks correct to me.
> > >>
> > >> Right, there is no any history about the reason why we needed 2 pages.
> > >> But obviously only one page is needed from the current code and no any
> > >> problem found in the kernel build stress testing.
> > >
> > > Could you try manually stressing the compression with data that
> > > doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> > > that this case is specifically handled. I think using data from
> > > /dev/random will do that but please double check that dlen ==
> > > PAGE_SIZE.

FWIW, zsmalloc supports the storing of pages that are PAGE_SIZE in
length, so a use case is probably there (although it could be for
ZRAM). We tested it during the storing-uncompressed-pages patch.
Architecturally, it seems that zswap just lets the backend allocator
handle the rejection of compressed objects that are too large, and the
compressor to reject pages that are too poorly compressed.

> >
> > I just did the same kernel build testing, indeed there are a few cases
> > that output dlen == PAGE_SIZE.
> >
> > bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> >
> > @[1]: 2
> > @[0]: 12011430
>
> That's very useful information, thanks for testing that. Please
> include this in the commit log. Please also include the fact that we
> used to store a zswap header with the compressed page but don't do
> that anymore, which *may* be the reason why this was needed back then.
>
> I still want someone who knows the history to Ack this, but FWIW it
> looks correct to me, so low-key:
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

Anyway:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index edb8b45ed5a1..fa186945010d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -707,7 +707,7 @@  static int zswap_dstmem_prepare(unsigned int cpu)
 	struct mutex *mutex;
 	u8 *dst;
 
-	dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
 	if (!dst)
 		return -ENOMEM;
 
@@ -1662,8 +1662,7 @@  bool zswap_store(struct folio *folio)
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
-	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
+	sg_init_one(&output, dst, PAGE_SIZE);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,