diff mbox series

[v4,1/6] mm/zswap: change dstmem size to one page

Message ID 20231213-zswap-dstmem-v4-1-f228b059dd89@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zswap: dstmem reuse optimizations and cleanups | expand

Commit Message

Chengming Zhou Dec. 26, 2023, 3:54 p.m. UTC
Change the dstmem size from 2 * PAGE_SIZE to only one page since
we only need at most one page when compress, and the "dlen" is also
PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
we don't wanna store the output in zswap anyway.

So change it to one page, and delete the stale comment.

There is no any history about the reason why we needed 2 pages, it has
been 2 * PAGE_SIZE since the time zswap was first merged.

According to Yosry and Nhat, one potential reason is that we used to
store a zswap header containing the swap entry in the compressed page
for writeback purposes, but we don't do that anymore.

This patch works good in kernel build testing even when the input data
doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
from the bpftrace tool:

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
@[1]: 2
@[0]: 12011430

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Barry Song Dec. 27, 2023, 1:07 a.m. UTC | #1
On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> we only need at most one page when compress, and the "dlen" is also
> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> we don't wanna store the output in zswap anyway.
>
> So change it to one page, and delete the stale comment.
>
> There is no any history about the reason why we needed 2 pages, it has
> been 2 * PAGE_SIZE since the time zswap was first merged.

i remember there was an over-compression case,  that means the compressed
data can be bigger than the source data. the similar thing is also done in zram
drivers/block/zram/zcomp.c

int zcomp_compress(struct zcomp_strm *zstrm,
                const void *src, unsigned int *dst_len)
{
        /*
         * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
         * because sometimes we can endup having a bigger compressed data
         * due to various reasons: for example compression algorithms tend
         * to add some padding to the compressed buffer. Speaking of padding,
         * comp algorithm `842' pads the compressed length to multiple of 8
         * and returns -ENOSP when the dst memory is not big enough, which
         * is not something that ZRAM wants to see. We can handle the
         * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
         * receive -ERRNO from the compressing backend we can't help it
         * anymore. To make `842' happy we need to tell the exact size of
         * the dst buffer, zram_drv will take care of the fact that
         * compressed buffer is too big.
         */
        *dst_len = PAGE_SIZE * 2;

        return crypto_comp_compress(zstrm->tfm,
                        src, PAGE_SIZE,
                        zstrm->buffer, dst_len);
}


>
> According to Yosry and Nhat, one potential reason is that we used to
> store a zswap header containing the swap entry in the compressed page
> for writeback purposes, but we don't do that anymore.
>
> This patch works good in kernel build testing even when the input data
> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
> from the bpftrace tool:
>
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> @[1]: 2
> @[0]: 12011430
>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7ee54a3d8281..976f278aa507 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
>         struct mutex *mutex;
>         u8 *dst;
>
> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
>         if (!dst)
>                 return -ENOMEM;
>
> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> +       sg_init_one(&output, dst, PAGE_SIZE);
>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
>         /*
>          * it maybe looks a little bit silly that we send an asynchronous request,
>
> --
> b4 0.10.1
>

Thanks
Barry
Chengming Zhou Dec. 27, 2023, 6:11 a.m. UTC | #2
On 2023/12/27 09:07, Barry Song wrote:
> On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>> we only need at most one page when compress, and the "dlen" is also
>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>> we don't wanna store the output in zswap anyway.
>>
>> So change it to one page, and delete the stale comment.
>>
>> There is no any history about the reason why we needed 2 pages, it has
>> been 2 * PAGE_SIZE since the time zswap was first merged.
> 
> i remember there was an over-compression case,  that means the compressed
> data can be bigger than the source data. the similar thing is also done in zram
> drivers/block/zram/zcomp.c

Right, there is a buffer overflow report[1] that I just +to you.

I think over-compression is all right, but buffer overflow is not acceptable,
so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe
overflowed too, just with smaller probability, right?

Thanks.

> 
> int zcomp_compress(struct zcomp_strm *zstrm,
>                 const void *src, unsigned int *dst_len)
> {
>         /*
>          * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
>          * because sometimes we can endup having a bigger compressed data
>          * due to various reasons: for example compression algorithms tend
>          * to add some padding to the compressed buffer. Speaking of padding,
>          * comp algorithm `842' pads the compressed length to multiple of 8
>          * and returns -ENOSP when the dst memory is not big enough, which
>          * is not something that ZRAM wants to see. We can handle the
>          * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
>          * receive -ERRNO from the compressing backend we can't help it
>          * anymore. To make `842' happy we need to tell the exact size of
>          * the dst buffer, zram_drv will take care of the fact that
>          * compressed buffer is too big.
>          */
>         *dst_len = PAGE_SIZE * 2;
> 
>         return crypto_comp_compress(zstrm->tfm,
>                         src, PAGE_SIZE,
>                         zstrm->buffer, dst_len);
> }
> 
> 
>>
>> According to Yosry and Nhat, one potential reason is that we used to
>> store a zswap header containing the swap entry in the compressed page
>> for writeback purposes, but we don't do that anymore.
>>
>> This patch works good in kernel build testing even when the input data
>> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
>> from the bpftrace tool:
>>
>> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
>> @[1]: 2
>> @[0]: 12011430
>>
>> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>> Acked-by: Chris Li <chrisl@kernel.org> (Google)
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 7ee54a3d8281..976f278aa507 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
>>         struct mutex *mutex;
>>         u8 *dst;
>>
>> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
>> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
>>         if (!dst)
>>                 return -ENOMEM;
>>
>> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
>>         sg_init_table(&input, 1);
>>         sg_set_page(&input, page, PAGE_SIZE, 0);
>>
>> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
>> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
>> +       sg_init_one(&output, dst, PAGE_SIZE);
>>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
>>         /*
>>          * it maybe looks a little bit silly that we send an asynchronous request,
>>
>> --
>> b4 0.10.1
>>
> 
> Thanks
> Barry
Barry Song Dec. 27, 2023, 6:32 a.m. UTC | #3
On Wed, Dec 27, 2023 at 7:11 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/27 09:07, Barry Song wrote:
> > On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >> we only need at most one page when compress, and the "dlen" is also
> >> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >> we don't wanna store the output in zswap anyway.
> >>
> >> So change it to one page, and delete the stale comment.
> >>
> >> There is no any history about the reason why we needed 2 pages, it has
> >> been 2 * PAGE_SIZE since the time zswap was first merged.
> >
> > i remember there was an over-compression case,  that means the compressed
> > data can be bigger than the source data. the similar thing is also done in zram
> > drivers/block/zram/zcomp.c
>
> Right, there is a buffer overflow report[1] that I just +to you.
>
> I think over-compression is all right, but buffer overflow is not acceptable,
> so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe
> overflowed too, just with smaller probability, right?

practically, the typical page size is 4KB or above, so we have never seen 2
pages can be overflowed. We may have a chance to let CPU-based
compression code to return earlier before overflowing though it is still
very tough.
but for accelerators-based compression in drivers/crypto, the only choice is
giving its dma engine a buffer whose length is enough - 2*PAGE_SIZE.

so i don't think this patch is correct.

>
> Thanks.
>
> >
> > int zcomp_compress(struct zcomp_strm *zstrm,
> >                 const void *src, unsigned int *dst_len)
> > {
> >         /*
> >          * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
> >          * because sometimes we can endup having a bigger compressed data
> >          * due to various reasons: for example compression algorithms tend
> >          * to add some padding to the compressed buffer. Speaking of padding,
> >          * comp algorithm `842' pads the compressed length to multiple of 8
> >          * and returns -ENOSP when the dst memory is not big enough, which
> >          * is not something that ZRAM wants to see. We can handle the
> >          * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
> >          * receive -ERRNO from the compressing backend we can't help it
> >          * anymore. To make `842' happy we need to tell the exact size of
> >          * the dst buffer, zram_drv will take care of the fact that
> >          * compressed buffer is too big.
> >          */
> >         *dst_len = PAGE_SIZE * 2;
> >
> >         return crypto_comp_compress(zstrm->tfm,
> >                         src, PAGE_SIZE,
> >                         zstrm->buffer, dst_len);
> > }
> >
> >
> >>
> >> According to Yosry and Nhat, one potential reason is that we used to
> >> store a zswap header containing the swap entry in the compressed page
> >> for writeback purposes, but we don't do that anymore.
> >>
> >> This patch works good in kernel build testing even when the input data
> >> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
> >> from the bpftrace tool:
> >>
> >> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> >> @[1]: 2
> >> @[0]: 12011430
> >>
> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >>  mm/zswap.c | 5 ++---
> >>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 7ee54a3d8281..976f278aa507 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
> >>         struct mutex *mutex;
> >>         u8 *dst;
> >>
> >> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> >> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> >>         if (!dst)
> >>                 return -ENOMEM;
> >>
> >> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
> >>         sg_init_table(&input, 1);
> >>         sg_set_page(&input, page, PAGE_SIZE, 0);
> >>
> >> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> >> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> >> +       sg_init_one(&output, dst, PAGE_SIZE);
> >>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> >>         /*
> >>          * it maybe looks a little bit silly that we send an asynchronous request,
> >>
> >> --
> >> b4 0.10.1
> >>

Thanks
Barry
Andrew Morton Dec. 27, 2023, 8:58 p.m. UTC | #4
On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:

> > i remember there was an over-compression case,  that means the compressed
> > data can be bigger than the source data. the similar thing is also done in zram
> > drivers/block/zram/zcomp.c
> 
> Right, there is a buffer overflow report[1] that I just +to you.

What does "[1]" refer to?  Is there a bug report about this series?
Nhat Pham Dec. 27, 2023, 11:21 p.m. UTC | #5
On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:
>
> > > i remember there was an over-compression case,  that means the compressed
> > > data can be bigger than the source data. the similar thing is also done in zram
> > > drivers/block/zram/zcomp.c
> >
> > Right, there is a buffer overflow report[1] that I just +to you.
>
> What does "[1]" refer to?  Is there a bug report about this series?

I think Chengming was referring to this:

https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/

Syzkaller/syzbot found an edge case where the page's "compressed" form
was larger than one page, which tripped up the compression code (since
we reduced the compression buffer size to 1 page here).
Chengming Zhou Dec. 28, 2023, 6:41 a.m. UTC | #6
On 2023/12/28 07:21, Nhat Pham wrote:
> On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:
>>
>>>> i remember there was an over-compression case,  that means the compressed
>>>> data can be bigger than the source data. the similar thing is also done in zram
>>>> drivers/block/zram/zcomp.c
>>>
>>> Right, there is a buffer overflow report[1] that I just +to you.
>>
>> What does "[1]" refer to?  Is there a bug report about this series?
> 
> I think Chengming was referring to this:
> 
> https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/
> 
> Syzkaller/syzbot found an edge case where the page's "compressed" form
> was larger than one page, which tripped up the compression code (since
> we reduced the compression buffer size to 1 page here).

Right, thanks Nhat!

The reported bug can be fixed by a patch I posted:
https://lore.kernel.org/all/20231227093523.2735484-1-chengming.zhou@linux.dev/

Although this bug is fixed, we still have to revert the first patch to use
2 pages buffer in zswap, since not all compressor drivers would respect the
buffer size we passed in and may overflow our output buffer.

Barry Song has explained the background in:
https://lore.kernel.org/all/CAGsJ_4xuuaPnQzkkQVaRyZL6ZdwkiQ_B7_c2baNaCKVg_O7ZQA@mail.gmail.com/

I will send an updated series later.

Thanks!
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 7ee54a3d8281..976f278aa507 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -707,7 +707,7 @@  static int zswap_dstmem_prepare(unsigned int cpu)
 	struct mutex *mutex;
 	u8 *dst;
 
-	dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
 	if (!dst)
 		return -ENOMEM;
 
@@ -1662,8 +1662,7 @@  bool zswap_store(struct folio *folio)
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
-	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
+	sg_init_one(&output, dst, PAGE_SIZE);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,